Jump to content

StringRegExp and repeating a Capturing Group {x} times


Recommended Posts

Hi

I have the following sample line of text from a TV episode guide, which is also contained in the attached file:

<tr style="background: url(/_layout_v3/misc/tabletopicbg.jpg) repeat;" ><td class='b2'><a class='wlink' href='/24/episodes/613/01x24'>24 :01x24 - 11:00 P.M.-12:00 A.M.</a> (May/21/2002)<tr style='background: #EEEEEE; color: #000000;' ><td class='b2'><font color='black'>Sherry Palmer makes a move in her discontent over David's faked death. Jack offers himself in exchange for his daughter's life and squares off against the Drazens. Nina, the second mole, makes a hasty escape from CTU headquarters that is delayed when Teri intrudes. David Palmer makes a final decision concerning Sherry.</font></td></tr><tr bgcolor='#FFFFFF' id="brow"><td class='b2'><u><b>Guest Stars: </b></u><a href='/person/id-1046/Xander+Berkeley'>Xander Berkeley</a> as George Mason, <a href='/person/id-10013/Zeljko+Ivanek'>Zeljko Ivanek</a> as Andre Drazen, <a href='/person/id-1154/Terrell+Tilford'>Terrell Tilford</a> as Paul Wilson, <a href='/person/id-49206/Karina+Arroyave'>Karina Arroyave</a> as Jamey Farrell, <a href='/person/id-1044/Penny+Johnson+Jerald'>Penny Johnson Jerald</a> as Sherry Palmer, <a href='/person/id-1139/Dennis+Hopper'>Dennis Hopper</a> as Victor Drazen, <a href='/person/id-51174/Carlos+Bernard'>Carlos Bernard (2)</a> as Tony Almeida, <a href='/person/id-23977/Kevin+Chapman'>Kevin Chapman</a> as Coast Guard Officer, <a href='/person/id-28007/Jude+Ciccolella'>Jude Ciccolella</a> as Mike Novick, | <u><b>Co-Guest Stars: </b></u><a href='/person/id-1153/Tico+Wells'>Tico Wells</a> as Karris, <a href='/person/id-1152/Jane+Yamamoto'>Jane Yamamoto</a> as Field Reporter, <a href='/person/id-1144/Endre+Hules'>Endre Hules</a> as Serge, <a href='/person/id-52211/Reynaldo+Gallegos'>Reynaldo Gallegos (1)</a> as Sergeant Devlin</td></Tr><tr bgcolor='#E7E7CB' id="brow"><td class='b2'><u><b>Director: </b></u><a href='/person/id-1048/Stephen+Hopkins'>Stephen Hopkins</a><br><u><b>Writer: </b></u><a href='/person/id-1049/Joel+Surnow'>Joel Surnow</a>, <a href='/person/id-1058/Michael+Loceff'>Michael Loceff</a></td></Tr></table></td></tr><tr><td height='5'>&nbsp;</td></tr></table></td>

Within this text, I need to extract the following fields: ID, Series, Episode, Title, Airdate, Plot, Cast and Crew.

The following RegEx successfully extracts the fields ID-Plot with no problems:

(?i).+?s/(\d{1,7}).+?>(?:\d+ :|)(\d{2}|)(?:x|)(\d{2}|)(?: - |)(.+?)</a> .(\D{3}/\d{2}/\d{4})..+?k.>(.+?)</font>

But then comes the tricky bit.

Cast members are formatted as, e.g.

<a href='/person/id-1046/Xander+Berkeley'>Xander Berkeley</a> as George Mason

Adding
.+?(<a href=.+?</td>)
to the end of the RegEx simply extracts one long string, but is there a more elegant solution whereby I can extract <name> as <name>, <name> as <name>, ...?

Could someone give me a pointer as where to go on this (if it is feasible)?

Regards

DG

TVRage.txt

Link to comment
Share on other sites

Hi,

did you try to delete all html tags and the split the texts or just give _IEBodyReadText a try.

Mega

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Link to comment
Share on other sites

  • Moderators

Local $s_text = FileRead(@DesktopDir & "\TVRage.txt")
Local $s_pattern = "(?i)(.*?href='/person/id-\d+/.+?'>)([\w\s]+)((?:\s*\(\d+\))*</a>)([\w\s]+)(?:\z|,|</td>.+?\z)"
$s_text = StringTrimRight(StringRegExpReplace($s_text,  $s_pattern, "\2\4, "), 2)
MsgBox(0, 0, $s_text)

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

@Xenobiologist

I didn't mention that I'm getting the data using _INetGetSource - sorry about that.

Local $s_text = FileRead(@DesktopDir & "\TVRage.txt")
Local $s_pattern = "(?i)(.*?href='/person/id-\d+/.+?'>)([\w\s]+)((?:\s*\(\d+\))*</a>)([\w\s]+)(?:\z|,|</td>.+?\z)"
$s_text = StringTrimRight(StringRegExpReplace($s_text,  $s_pattern, "\2\4, "), 2)
MsgBox(0, 0, $s_text)
Wow! That certainly does the trick!

Would you be so kind as to confirm what I *think* is happening?

  • (?i) Case-insensitivity flag
  • [0] (.*?href='/person/id-\d+/.+?'>) Match the 'href.../nnn/...>' string
  • [1] ([\w\s]+) Match any "word" OR whitespace character AND repeat the previous set 1 or more times (Actor Name) until...
  • [2] ((?:\s*\(\d+\))*</a>) Non-capturing group: whitespace characters but match any digit (0-9) repeated 1 or more times terminated by </a>
  • [3] ([\w\s]+) Match any "word" OR whitespace character AND repeat the previous set 1 or more times (Character Name) until...
  • [4] (?:\z|,|</td>.+?\z) Non-capturing group: match only at end of string OR Comma OR </td> & match only at end of string
  • Finally, replace groups 2 & 4 with " "
I'm struggling a bit with number 4. - it appears to matching on the digits contained in the 'href' string??

Kind regards

DG

Link to comment
Share on other sites

  • Moderators

@Xenobiologist

I didn't mention that I'm getting the data using _INetGetSource - sorry about that.

Wow! That certainly does the trick!

Would you be so kind as to confirm what I *think* is happening?

  • (?i) Case-insensitivity flag
  • [0] (.*?href='/person/id-\d+/.+?'>) Match the 'href.../nnn/...>' string
  • [1] ([\w\s]+) Match any "word" OR whitespace character AND repeat the previous set 1 or more times (Actor Name) until...
  • [2] ((?:\s*\(\d+\))*</a>) Non-capturing group: whitespace characters but match any digit (0-9) repeated 1 or more times terminated by </a>
  • [3] ([\w\s]+) Match any "word" OR whitespace character AND repeat the previous set 1 or more times (Character Name) until...
  • [4] (?:\z|,|</td>.+?\z) Non-capturing group: match only at end of string OR Comma OR </td> & match only at end of string
  • Finally, replace groups 2 & 4 with " "
I'm struggling a bit with number 4. - it appears to matching on the digits contained in the 'href' string??

Kind regards

DG

[4] basically is saying to end the matches at the end of the string, a comma, or </td> rest of string

[*] Is close, what I'm telling it is to replace all other matches with nothing, and keep matches 2 and 4 with a space after 4 (the as Name Last Name)

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...