Jump to content

StringRegExperts


JohnOne
 Share

Recommended Posts

I really just have no clue how it works, even after aeveral hours trying to study the helpfile examples.

I've whittles the part of the webpage sourcecode down to

17/12/2009</span>

<a href="/news/archive/40698/europa-league-draw.html">EUROPA LEAGUE DRAW

Using InetGetSource() and _StringBetween(), but my head is battered trying to extract the text I want, which is "17/12/2009", "/news/archive/40698/europa-league-draw.html" and "EUROPA LEAGUE DRAW"

If someone has the time I would really apprecicate a pattern, with a quick, what and why explaination.

Even knowing I will feel quite the fool, after understanding this, I had to post.

Any help appreciated.

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

A regular expression such as

'>([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)<'
should do it.

I made assumptions that before the date there is a > and after the word draw is a <.

Apart from that it's pretty simple. Start a group, get any character from 0 to 9, have either 1 or 2. Then a slash, then another number one or two digits, slash, four digit year. End group. This is the date in the first group. Then the span, any amount of whitespace, then the link code. In the quotes, read characters up until a " and then end that group. Then finish off reading until the next html tag starts.

Edited by Richard Robertson
Link to comment
Share on other sites

This is just :- carrying the ball over the line; driving the nail all the way home; or, adding the finishing touches.

It uses a very slightly modified Richard Robertson regular expression pattern.

#include <Array.au3>

Local $sStr = "17/12/2009</span>" & @CRLF & _
        '<a href="/news/archive/40698/europa-league-draw.html">EUROPA LEAGUE DRAW'

ConsoleWrite(StringRegExpReplace($sStr, _
        '(\d{1,2}/\d{1,2}/\d{4})</span>\s*<a href="([^"]*)">([^<]*)', _
        "\1" & @CRLF & "\2" & @CRLF & "\3") & @CRLF)


Local $aArr = StringRegExp($sStr, _
        '(\d{1,2}/\d{1,2}/\d{4})</span>\s*<a href="([^"]*)">([^<]*)', 3)
_ArrayDisplay($aArr)
Link to comment
Share on other sites

Thanks very gladly for your time gents.

I'm stll having problems though

Although Malkey example works a treat on the string in it, neither work on the string returned by _stringBetween()

Its not as it appears in my quote so that could be a problem.

Heres how it appears in a msgbox (I changed the string between code to add > before date, and< at end of string.

Posted Image

Noted that (\d{1,2} = ([0-9]{1,2}

Not fully grasping the last part yet, but am I correct thinking the red is ignored and the green is matched ?

'>([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)<'

EDIT: just noticed the string changed on account of the source changing.

EDIT2: using flag 3 the error is 1 "Array is invalid. No matches."

Edited by JohnOne

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

Not fully grasping the last part yet, but am I correct thinking the red is ignored and the green is matched?

'>([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)<'

No, a match requires something to match each part:

'>'

...followed by something that matches the group '([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'

...followed by something that matches '</span>\s*<a href="'

...followed by something that matches the group '([^"]*)'

...followed by '">'

...followed by something that matches the group '([^<]*)'

...followed by '<'

The leading '>' and trailing '<' are not in your original string. This works with your original string:

#include <Array.au3>

Global $sString = '17/12/2009</span>' & @CRLF & '<a href="/news/archive/40698/europa-league-draw.html">EUROPA LEAGUE DRAW'
Global $sRegExp = '([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)'
Global $RET = StringRegExp($sString, $sRegExp, 3)
If @error Then
    ConsoleWrite("@error = " & @error & @LF)
Else
    _ArrayDisplay($RET, "$RET")
EndIf

;)

Edited by PsaltyDS
Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

I am starting to understand the pattern now, but still cant grasp why the extracted string wont comply.

They look different in the msgbox.

This works

#include <Inet.au3>
#include <String.au3>
#include <Array.au3>
;Global $Url = "http://www.evertonfc.com/news/news-archive.html"
;Global $sFile = _INetGetSource($Url)
Global $sString = '17/12/2009</span>' & @CRLF & '<a href="/news/archive/40698/europa-league-draw.html">EUROPA LEAGUE DRAW'
MsgBox(0,"String",$sString)
Global $sRegExp = '([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)'
Global $RET = StringRegExp($sString, $sRegExp, 3)
If @error Then
    ConsoleWrite("@error = " & @error & @LF)
Else
    _ArrayDisplay($RET, "$RET")
EndIf

This does not

#include <Inet.au3>
#include <String.au3>
#include <Array.au3>
;#cs
Global $Url = "http://www.evertonfc.com/news/news-archive.html"
Global $sFile = _INetGetSource($Url)
Global $pattern = '(\d{1,2}/\d{1,2}/\d{4})</span> \s* <a href="([^"]*)">([^<]*)'
Global $sString = _StringBetween($sFile,'<span class="date"','/a>',-1)
MsgBox(0,"",$sString[0])
ConsoleWrite(@CRLF & $sString[0])
Global $sString1 = StringRegExp($sString,$pattern,3)
If @error Then
MsgBox(0,"",@error)
Else
    _ArrayDisplay($sString1)
EndIf

heads hurting now.

Edit

This dosent work either

#include <Inet.au3>
#include <String.au3>
#include <Array.au3>
;Global $Url = "http://www.evertonfc.com/news/news-archive.html"
;Global $sFile = _INetGetSource($Url)
Global $sString = '17/12/2009</span>' & @CRLF & '<a' & @CRLF & 'href="/news/archive/40698/europa-league-draw.html">EUROPA LEAGUE DRAW'
MsgBox(0,"String",$sString)
Global $sRegExp = '([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)'
Global $RET = StringRegExp($sString, $sRegExp, 3)
If @error Then
    ConsoleWrite("@error = " & @error & @LF)
Else
    _ArrayDisplay($RET, "$RET")
EndIf
Edited by JohnOne

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

In the first of your failing examples, we don't get to see what $sString contains before the StringRegExp() is run. Post the failing content.

In the second example, you have a clearly non-matching string because of the @CRLF between '<a' and 'href="...'. The pattern is looking for a literal space there, not just any whitespace.

;)

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

This is what $sString outputs in the console, but it looks completely different in the msgbox.

">18/12/2009</span>

<a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN<"

Without the double quotes of course.

Idf I am correct the seemingly whitespace contains a {RETURN} of some sort, and a {TAB} along with what seem to be spaces.

And its not showing here as It shows in the console or msgbox.

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

This is what $sString outputs in the console, but it looks completely different in the msgbox.

">18/12/2009</span>

<a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN<"

Without the double quotes of course.

Idf I am correct the seemingly whitespace contains a {RETURN} of some sort, and a {TAB} along with what seem to be spaces.

And its not showing here as It shows in the console or msgbox.

Still works for me with lots of misc whitespace inserted:
#include <Array.au3>

Global $sString = '>18/12/2009</span> ' & @TAB & '  ' & @CR & '   ' & @LF & ' ' & @CRLF & _
        ' <a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN<'
Global $sRegExp = '([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)'
Global $RET = StringRegExp($sString, $sRegExp, 3)
If @error Then
    ConsoleWrite("@error = " & @error & @LF)
Else
    _ArrayDisplay($RET, "$RET")
EndIf

;)

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Get $bString = StringToBinary($sString) and check out the results to see if there is something odd in there.

;)

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

No joy, No output from $bString in console.

Nothing ever goes smoothly for me, ever.

#include <Inet.au3>
#include <String.au3>
#include <Array.au3>
Global $Url = "http://www.evertonfc.com/news/news-archive.html"
Global $sFile = _INetGetSource($Url)
Global $sString = _StringBetween($sFile,'<span class="date">','</a>',-1)
Global $sRegExp = '([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)'
ConsoleWrite($sString[0])
$bString = StringToBinary($sString[0])
ConsoleWrite($bString)

Console output

>Running:(3.3.0.0):F:\Program Files\AutoIt3\autoit3.exe "F:\Test1\test.au3"    
18/12/2009</span>
                        <a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN
18/12/2009</span>
                        <a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN+>21:14:22 AutoIT3.exe ended.rc:0
+>21:14:23 AutoIt3Wrapper Finished
>Exit code: 0    Time: 11.668

EDIT for clarity

Fixed code and output has changed

Edited by JohnOne

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

Doh! ;)

$sString is an array from _StringBetween().

You'll have to use $sString[0], not just $sString. Did you make the same mistake passing it into StringRegExp()?

:evil:

Hint: Yes, you did, in the second script in post #7. :evil:

Edited by PsaltyDS
Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

OOpse changed wrong line in code to

(was trying different flags)

$bString = StringToBinary($sString[0])

Output

>Running:(3.3.0.0):F:\Program Files\AutoIt3\autoit3.exe "F:\Test1\test.au3"    
18/12/2009</span>
                        <a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN
18/12/2009</span>
                        <a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN+>21:14:22 AutoIT3.exe ended.rc:0
+>21:14:23 AutoIt3Wrapper Finished
>Exit code: 0    Time: 11.668

Well thats certainly not right, its exactly the same ;)

EDIT2 edited post 13 to correct code and output

Edited by JohnOne

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

Post exactly what you're running now. Because this:

Global $sString[1] = ['18/12/2009</span>' & @CRLF & _
        '<a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN18/12/2009</span>']

Global $bString = StringToBinary($sString[0])
ConsoleWrite("$bString = " & $bString & @LF)

Outputs this:

>Running:(3.3.2.0):C:\Program Files\AutoIt3\autoit3.exe "C:\Program Files\AutoIt3\Testing\Test1.au3"    
$bString = 0x31382F31322F323030393C2F7370616E3E0D0A3C6120687265663D222F6E6577732F617263686976652F6F737369652D657965732D7072656D2D72657475726E2E68746D6C223E4F535349452045594553205052454D2052455455524E31382F31322F323030393C2F7370616E3E

;)

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Just keeps getting weirder

1 minute ago

#include <Inet.au3>
#include <String.au3>
#include <Array.au3>
Global $Url = "http://www.evertonfc.com/news/news-archive.html"
Global $sFile = _INetGetSource($Url)
Global $sString = _StringBetween($sFile,'<span class="date">','</a>',-1)
Global $sRegExp = '([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})</span>\s*<a href="([^"]*)">([^<]*)'
ConsoleWrite($sString[0] & @CRLF)
$bString = StringToBinary($sString[0])
ConsoleWrite($bString)

$blah = "abcdefg"
$blahblah = StringToBinary($blah)
ConsoleWrite($blahblah)

>"F:\Program Files\AutoIt3\SciTE\AutoIt3Wrapper\AutoIt3Wrapper.exe" /run /prod /ErrorStdOut /in "F:\Test1\test.au3" /autoit3dir "F:\Program Files\AutoIt3" /UserParams    
+>21:21:37 Starting AutoIt3Wrapper v.2.0.0.1    Environment(Language:0409  Keyboard:00000809  OS:WIN_VISTA/  CPU:X86 OS:X86)
>Running AU3Check (1.54.14.0)  from:F:\Program Files\AutoIt3
+>21:21:37 AU3Check ended.rc:0
>Running:(3.3.0.0):F:\Program Files\AutoIt3\autoit3.exe "F:\Test1\testp.au3"    
18/12/2009</span>
                        <a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN
18/12/2009</span>
                        <a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURNabcdefg+>21:21:42 AutoIT3.exe ended.rc:0
+>21:21:43 AutoIt3Wrapper Finished
>Exit code: 0    Time: 6.446

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

Changed ConsoleWrite($bString) to ConsoleWrite($bString & @CRLF)

Now outputting

>Running:(3.3.0.0):F:\Program Files\AutoIt3\autoit3.exe "F:\Test1\test.au3"    
18/12/2009</span>
                        <a href="/news/archive/ossie-eyes-prem-return.html">OSSIE EYES PREM RETURN
0x31382F31322F323030393C2F7370616E3E0D0A2020202020202020202020202020202020202020093C6120687265663D222F6E6577732F617263686976652F6F737369652D657965732D7072656D2D72657475726E2E68746D6C223E4F535349452045594553205052454D2052455455524E
abcdefg

+>21:32:07 AutoIT3.exe ended.rc:0
+>21:32:08 AutoIt3Wrapper Finished
>Exit code: 0    Time: 6.744

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

That section of white spaces is: 0x3E0D0A2020202020202020202020202020202020202020093C

That's just @CRLF, some spaces, and one @TAB between the > and <. Nothing strange there.

;)

Edited by PsaltyDS
Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Yup, just been looking myself

I got it to 0D0A202020202020202020202020202020202020202009 = @CRLF, 20 spaces, and on TAB

Just makes it worse, really am cracking up now ;)

EDIT: Going to try a different O/S

Thanks again mate, much obliged.

Edited by JohnOne

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...