Jump to content
Marc

[solved] StringRegExp command is kidding me

Recommended Posts

Posted (edited)

Hi all,

I'm confused. (Even more than usual, that is)

I am trying to capture the URL of the translated german dilbert comic.

If I put the code from the web page into the RegExpQuickTester, I get the desired URL back.

Using the very same RegEx in my Script, the url is not matched. 

Most likely it's a very stupid error I made, but I can't figure it out. New year really starts great 🤣

 

HttpSetUserAgent('Mozilla / 5.0')

; Tag in Sourcecode of Webpage is: ; src="https://www.ingenieur.de/wp-content/uploads/2020/12/Dilbert_d_3054_2021-01-01_F-980x305.jpg"
; so the complete source code of the webpage should be replaced with the match $1 of the regex. Result should be
; https://www.ingenieur.de/wp-content/uploads/2020/12/Dilbert_d_3054_2021-01-01_F-980x305.jpg

Local $source = _INetGetSourceEx('http://www.ingenieur.de/Spiel-Spass/Dilbert')
Local $suche = '(?is).* src="(https://www.ingenieur.de/wp-content/uploads/.+?/Dilbert_._.*?_2021-01-01_.*?\.jpg).*'
Local $url = ""

If StringRegExp($source, $suche) Then
    $url = StringRegExpReplace($source, $suche, "$1")
    MsgBox(0,"", $url)
Else
    MsgBox(16,"oops", "RegEx Problem" & @error)
    ClipPut($suche & @CRLF & $source)
EndIf

Func _INetGetSourceEx($s_URL, $bString = True)
    ; https://www.autoitscript.com/forum/topic/107500-inetgetsource-utf-8-problem/
    Local $sString = InetRead($s_URL, 1)
    Local $nError = @error, $nExtended = @extended
    If $bString Then $sString = BinaryToString($sString, 4)
    Return SetError($nError, $nExtended, $sString)
EndFunc   ;==>_INetGetSourceEx

best regards,

Marc

Edited by Marc
improved demo source code for more clarity

It's my job to comfort the disturbed and to disturb the comfortable.
My Projects: Profiler, MakeSFX, UserInfo, Simple Robocopy Progressbar

Share this post


Link to post
Share on other sites
Posted (edited)

I'm trying to get the very specific Image-URL out of the complete source code of the web page, so this regex would be a little bit too unspecific.

No idea why the regex works in the regex tool and in regex101, but not in my script... 

Update: If the complete source code is copied in the regex tester tool, it does not find the URL.

If the text before line 567 is removed, it works.

What am I missing here? (the point, obviously)

source.txt

Edited by Marc

It's my job to comfort the disturbed and to disturb the comfortable.
My Projects: Profiler, MakeSFX, UserInfo, Simple Robocopy Progressbar

Share this post


Link to post
Share on other sites

Hi Marc,
It works for me with (?i) but not with (?is)
(?is) makes it fail when first .*? is encountered in pattern.

Dilbert.png.2456ee7cc7d8cefa3d34da1319ac241f.png

Result display (truncated in the pic)
0: https://www.ingenieur.de/wp-content/uploads/2020/12/Dilbert_d_3054_2021-01-01_F-980x305.jpg

Share this post


Link to post
Share on other sites

What is it you are looking for ?  Picture of the day ?  Last picture of the page ?  You didn't say what, except that it does not work !

Share this post


Link to post
Share on other sites
Posted (edited)

@Nine:Indeed, I am trying to catch the current comic of the day. If available, which is not granted.

@pixelsearch: hm, using (i?) matches but the result of the StringRegExpReplace is the whole sourcecode of the page

In older versions of my script, I worked with StringInStr to get the position, then switched to regex to be more flexible and have a shorter syntax, because the site is sometimes doing funny things like storing the image in a different folder.

For todays comic (2020-01-02), the right URL is

https://www.ingenieur.de/wp-content/uploads/2020/12/Dilbert_d_3055_2021-01-02_F-980x305.jpg

To get the URL, this one works:

Local $suche = '(?is).* src="(https://www.ingenieur.de/wp-content/uploads/2020/12/Dilbert_._.*?_' & @YEAR & '-' & @MON & '-' & @MDAY & '.*?\.jpg).*'

But as you can see, I had to change the path of the image to "2020/12" instead of "2021/01". So I wanted to skip the subfolder-part with wildcards, but surprisingly, this one does not match:

Local $suche = '(?is).* src="(https://www.ingenieur.de/wp-content/uploads/.+?/Dilbert_._.*?_' & @YEAR & '-' & @MON & '-' & @MDAY & '.*?\.jpg).*'

Hmm.

Edited by Marc

It's my job to comfort the disturbed and to disturb the comfortable.
My Projects: Profiler, MakeSFX, UserInfo, Simple Robocopy Progressbar

Share this post


Link to post
Share on other sites
Posted (edited)

This works for me using the provided source file

$source = FileRead("source.txt")
$date = @YEAR & '-' & @MON & '-' & StringFormat("%02i", @MDAY-1)
$pattern = '(?i)src="(https://www.ingenieur.de/wp-content/uploads/.+?/Dilbert.+?' & $date & '.+?\.jpg)'
$res = StringRegExp($source, $pattern, 1)
Msgbox(0,"", IsArray($res) ? $res[0] : "nothing today")

 

Edited by mikell

Share this post


Link to post
Share on other sites
Posted (edited)

@pixelsearch: after some thinking, you're right.

@mikell: yes, that works :) 

@all: seems I tricked myself by trying to replace the whole source code of the site with the match instead of jst keeping the resulting match.

🤦‍♂️

See also: Self-Awareness (savagechickens.com)

Edited by Marc

It's my job to comfort the disturbed and to disturb the comfortable.
My Projects: Profiler, MakeSFX, UserInfo, Simple Robocopy Progressbar

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...