Can i make a Regular Expression start matching after "String1" and stop matching after "String2"?

hawkair · August 3, 2023

Hi

I have a text like this: ($txt=)

<div class="titlereference-overview-section">
        Directors:
        <ul class="ipl-inline-list">
            <li class="ipl-inline-list__item">
<a href="/name/nm8681530">John Smith</a>,
<a href="/name/nm8681530">Jim </a>,
<a href="/name/nm8681530">Jack</a>
            </li>
                <li class="ipl-inline-list__item">
    <a href="/title/tt8806524/fullcredits" class=>See more &raquo;</a>
                </li>
        </ul>
    </div>
    <div class="titlereference-overview-section">
        Writers:
        <ul class="ipl-inline-list">
            <li class="ipl-inline-list__item">
<a href="/name/nm8681530">Kirsten</a>,
<a href="/name/nm8681530">Jessica</a>,
<a href="/name/nm8681530">Maya</a>
            </li>
                <li class="ipl-inline-list__item">
    <a href="/title/tt8806524/fullcredits" class=>See more &raquo;</a>
                </li>
        </ul>
    </div>
    <div class="titlereference-overview-section">
        Stars:
        <ul class="ipl-inline-list">
            <li class="ipl-inline-list__item">
<a href="/name/nm0001772">Patrick Stewart</a>,
<a href="/name/nm0403335">Michelle Hurd</a>,
<a href="/name/nm0005394">Jeri Ryan</a>
            </li>
                <li class="ipl-inline-list__item">
    <a href="/title/tt8806524/fullcredits" class=>See more &raquo;</a>
                </li>
        </ul>

I want to get the writers.

This code

$aWriters = StringRegExp($txt, '<a href="/name/nm.*?">([^<]*)</a>', 3)

gets all names.

This code

;To check quickly copy the text then run the code
$txt = Clipget()
$txt = StringRegExpReplace($txt, "(?s)^.*Writers", "")
$txt = StringRegExpReplace($txt, "(?s)</ul>.*", "")
$aWriters = StringRegExp($txt, '<a href="/name/nm.*?">([^<]*)</a>', 3)
MsgBox(262144, "Writers", _ArrayToString($aWriters, ","))

deletes text before "Writers" and after Writers section ends and gets all the Writers names. Note that "Stars" section may not always follow "Writers"

Can I do this with a single RegExp command?

mikell · August 3, 2023

You may fire all the unwanted parts using a single SRER

$txt = Clipget()
$s = StringRegExpReplace($txt, '(?s)^.*Writers(.*?/nm\d+">)|\R<a(?1)|</a>|\s+</li>.*$', "")
MsgBox(0,"", $s)

Edit
It's a cute challenge but - IMHO - your multipart solution is somewhere more versatile

Edit2
Much nicer, how to get this in a 1D array (and BTW a better answer to the question in the title of the topic)

$txt = Clipget()
$aWriters = StringRegExp($txt, '(?s)(?:.*?Writers|\G(?!</a>\s+</li>)).*?/nm\d+">([^<]*)', 3)
_ArrayDisplay($aWriters)

Edited August 3, 2023 by mikell

hawkair · August 4, 2023

Mikel thank you

It works exactly as I want.

I have no words...

Now I can merilly go off into my cave and have fun figuring out how it does it

Edit:

I used Google translate and the Autoit help file as dictionary and got the following:

StringRegExp($txt, '(?s)(?:.*?Writers|\G(?!</a>\s+</li>)).*?/nm\d+">([^<]*)', 3)

Find all text until Writers - do not save, Or Starting at this position (\G) Match while the subpattern is not '</a>\s+</li>' then follows the actual pattern to match: '.*?/nm\d+">([^<]*)'

Edited August 4, 2023 by hawkair

mikell · August 4, 2023

4 hours ago, hawkair said:

Now I can merilly go off into my cave and have fun figuring out how it does it

Sorry I didn't comment this \G magic
The definition of \G in the helpfile is not very clear, better look at this one , especially "\G matches at the end of the previous match"

The conditions in the title question 'start matching after "String1" and stop matching after "String2" ' are defined in both parts of the alternation

StringRegExp($txt, '(?sx) (?: ^.*?Writers | \G (?!</a>\s+</li>) ) .*?/nm\d+">([^<]*)', 3)

How it works :
- using the left part of the alternation and the final pattern, the regex runs up to 'Writers', searches and finds "Kirsten"
- then using the right part of the alternation, \G matches right after "Kirsten", the assertion is true so the regex restarts searching and finds "Jessica"
- \G matches right after "Jessica", in the same way the regex keeps on searching and finds "Maya"
- \G matches right after "Maya", but at this position the condition is not fulfilled any more, the regex fails and returns the result

:sweating:

Edited August 4, 2023 by mikell
typo(s)

pixelsearch · August 4, 2023

@mikell very nice ! Yesterday, I really felt you'd come back with your Edit2 to suggest a solution with \G or similar . jchd wrote once that he should think more of this \G thing

As you wrote, the \G explanation isn't really clear in the help file, that's why it took me time (with your help) to achieve the "pseudo help file" in RegExp Quick Tester, especially it had to be a short one-liner explanation :

In the previous pic, as a writer can be named... "Writers" (found some guys named "Writers" on Google !) then I added some tests in the left part of the alternation, e.g writers:\s+ instead of writers, no big deal. We note that the order of the tests in the alternation is important, writers: first on the left side of the alternation, \G on the right side). Now if you don't mind, I got 2 questions :

1) In case "Writers:" isn't found in the subject, can we add something in the right part of the alternation, so the regex engine returns nothing ?
Because actually, if you change in the subject "Writers:" to "Wrs:" for example, then this would be returned (in OP's post with your pattern, or in my previous pic) and it would be better to avoid it :

John Smith
Jim 
Jack

I don't think we can add a positive look-behind (e.g. search for "Writers:" to be found before each \G match) because look-behind doesn't work with not fixed-length string, maybe \K or something else ?
If nothing can be easily done, then question 2 may bring the answer :

2) A test shows that a positive lookahead (instead of the negative lookahead found in your pattern or in the pic above) solves this kind of situation... but I don't understand why :

(?is)(?:.*?writers:\s+|\G</a>(?=,)).*?/nm\d+">([^<]*)

With this positive lookahead, if "Writers:" isn't found in the subject, then nothing is returned (which is a good thing !) . So the 2nd question is, when "Writers:" isn't found in the subject, why does the negative lookahead returns results (which are confusing) when the positive lookahead doesn't return anything (which seems more correct) ?

Thanks for reading :bye:

mikell · August 4, 2023

2 hours ago, pixelsearch said:

In case "Writers:" isn't found in the subject

Ahhh yesss, I didn't pay attention to this :whistle:

The answer to your questions is written in the definition of this nice \G spot :
"matches at the beginning of the subject string OR at the end of the previous match"

To solve the problem here we just have to make \G to not match at the beginning of the string

StringRegExp($txt, '(?sx) (?: ^.*?Writers | \G (?!\A|</a>\s+</li>) ) .*?/nm\d+">([^<]*)', 3)

So the answer to the 2nd question becomes obvious now

BTW I still prefer the negative lookahead which allows to define a limit as the OP asked for

pixelsearch · August 4, 2023

Well done mikell, that \A| is really cool ! we'll have to remember to always use it to make sure no false result is ever returned when the "string to search" is not found in the subject :

(StringtoSearch|\G(?!\A|...))

If not mistaken, what "saves" us in OP's subject is the fact there is no comma after Maya (the last writer) but there is always a comma after each preceding writer (Kirsten & Jessica). If a comma followed Maya, then this would have been wrongly returned :

Kirsten
Jessica
Maya
Patrick Stewart
Michelle Hurd
Jeri Ryan

But well, in this case you sure would have found another working pattern

mikell · August 4, 2023

1 hour ago, pixelsearch said:

what "saves" us in OP's subject is the fact there is no comma after Maya

Not really. The purpose here was to find a correct way to define the limit to stop matching, and in this case it is defined by the whole subpattern </a>\s+</li>
But well, if you include in the pattern an optional comma, then you can add a comma after Maya (or remove the other commas) in the subject string and it will work

StringRegExp($txt, '(?sx) (?: ^.*?Writers | \G (?!\A|</a>,?\s+</li>) ) .*?/nm\d+">([^<]*)', 3)

Different requirements, different solutions

pixelsearch · August 4, 2023

@mikell Thx for the explanation. While you're still there, a complete explanation to my 2nd question from the post above could be (please correct me if I'm wrong) when Writers: isn't found in this subject :

<a href="/name/nm8681530">John Smith</a>,
<a href="/name/nm8681530">Jim </a>,
<a href="/name/nm8681530">Jack</a>
<a href="/name/nm8681530">Kirsten</a>,
<a href="/name/nm8681530">Jessica</a>,
<a href="/name/nm8681530">Maya</a>
<a href="/name/nm0001772">Patrick Stewart</a>,
<a href="/name/nm0403335">Michelle Hurd</a>,
<a href="/name/nm0005394">Jeri Ryan</a>

Pattern with negative lookahead 
(?is)(?:^.*?Writers:|\G(?!</a>\s+)).*?/nm\d+">([^<]*)

Result :
John Smith
Jim 
Jack

Negative lookahead : the left part of the alternation didn't match (Writers: wasn't found) so position restarts at the beginning of the string before the right part of the alternation is processed. The negative lookahead is then True, because there is no < /a > at the very beginning of the string, so the end of the pattern is processed (outside the alternation) and John Smith is a match. Now back to the right part of the alternation, the \G part, where the negative lookahead is True again (because there is no < /a > followed by whitespaces after John Smith, in fact the presence of the comma makes the negative lookahead True) and Jim is a match, then Jack is a match and basta, because the negative lookahead is now False (Jack is followed by < /a > and whitespace(s), it's the lack of a comma that makes the negative lookahead false) and that's why there were 3 results.

Pattern with positive lookahead 
(?is)(?:^.*?Writers:|\G</a>(?=,)).*?/nm\d+">([^<]*)

Result :
None

Positive lookahead : it seems easier to explain. Same beginning : the left part of the alternation didn't match (Writers: wasn't found) so position restarts at the beginning of the string before the right part of the alternation is processed. The positive lookahead is immediately false as there is no < /a > at the very beginning of the string so the regex fails, the end of pattern isn't processed as both alternations didn't make it at all.

BTW, I wonder if the very 1st ^ in pattern is mandatory, it seems to work same with or without it (?)

mikell · August 5, 2023

14 hours ago, pixelsearch said:

I wonder if the very 1st ^ in pattern is mandatory

It is not mandatory but it is recommended
Rex says here : "the regex style guide recommends using anchors whenever possible—even when your regex would match without them"

Can i make a Regular Expression start matching after "String1" and stop matching after "String2"?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members