DobraGolonka Posted August 20, 2008 Share Posted August 20, 2008 Hi I have the following sample line of text from a TV episode guide, which is also contained in the attached file: <tr style="background: url(/_layout_v3/misc/tabletopicbg.jpg) repeat;" ><td class='b2'><a class='wlink' href='/24/episodes/613/01x24'>24 :01x24 - 11:00 P.M.-12:00 A.M.</a> (May/21/2002)<tr style='background: #EEEEEE; color: #000000;' ><td class='b2'><font color='black'>Sherry Palmer makes a move in her discontent over David's faked death. Jack offers himself in exchange for his daughter's life and squares off against the Drazens. Nina, the second mole, makes a hasty escape from CTU headquarters that is delayed when Teri intrudes. David Palmer makes a final decision concerning Sherry.</font></td></tr><tr bgcolor='#FFFFFF' id="brow"><td class='b2'><u><b>Guest Stars: </b></u><a href='/person/id-1046/Xander+Berkeley'>Xander Berkeley</a> as George Mason, <a href='/person/id-10013/Zeljko+Ivanek'>Zeljko Ivanek</a> as Andre Drazen, <a href='/person/id-1154/Terrell+Tilford'>Terrell Tilford</a> as Paul Wilson, <a href='/person/id-49206/Karina+Arroyave'>Karina Arroyave</a> as Jamey Farrell, <a href='/person/id-1044/Penny+Johnson+Jerald'>Penny Johnson Jerald</a> as Sherry Palmer, <a href='/person/id-1139/Dennis+Hopper'>Dennis Hopper</a> as Victor Drazen, <a href='/person/id-51174/Carlos+Bernard'>Carlos Bernard (2)</a> as Tony Almeida, <a href='/person/id-23977/Kevin+Chapman'>Kevin Chapman</a> as Coast Guard Officer, <a href='/person/id-28007/Jude+Ciccolella'>Jude Ciccolella</a> as Mike Novick, | <u><b>Co-Guest Stars: </b></u><a href='/person/id-1153/Tico+Wells'>Tico Wells</a> as Karris, <a href='/person/id-1152/Jane+Yamamoto'>Jane Yamamoto</a> as Field Reporter, <a href='/person/id-1144/Endre+Hules'>Endre Hules</a> as Serge, <a href='/person/id-52211/Reynaldo+Gallegos'>Reynaldo Gallegos (1)</a> as Sergeant Devlin</td></Tr><tr bgcolor='#E7E7CB' id="brow"><td class='b2'><u><b>Director: </b></u><a href='/person/id-1048/Stephen+Hopkins'>Stephen Hopkins</a><br><u><b>Writer: </b></u><a href='/person/id-1049/Joel+Surnow'>Joel Surnow</a>, <a href='/person/id-1058/Michael+Loceff'>Michael Loceff</a></td></Tr></table></td></tr><tr><td height='5'> </td></tr></table></td> Within this text, I need to extract the following fields: ID, Series, Episode, Title, Airdate, Plot, Cast and Crew. The following RegEx successfully extracts the fields ID-Plot with no problems: (?i).+?s/(\d{1,7}).+?>(?:\d+ :|)(\d{2}|)(?:x|)(\d{2}|)(?: - |)(.+?)</a> .(\D{3}/\d{2}/\d{4})..+?k.>(.+?)</font> But then comes the tricky bit. Cast members are formatted as, e.g. <a href='/person/id-1046/Xander+Berkeley'>Xander Berkeley</a> as George Mason Adding .+?(<a href=.+?</td>) to the end of the RegEx simply extracts one long string, but is there a more elegant solution whereby I can extract <name> as <name>, <name> as <name>, ...? Could someone give me a pointer as where to go on this (if it is feasible)? Regards DGTVRage.txt Link to comment Share on other sites More sharing options...
Xenobiologist Posted August 20, 2008 Share Posted August 20, 2008 Hi,did you try to delete all html tags and the split the texts or just give _IEBodyReadText a try.Mega Scripts & functions Organize Includes Let Scite organize the include files Yahtzee The game "Yahtzee" (Kniffel, DiceLion) LoginWrapper Secure scripts by adding a query (authentication) _RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...) Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc. MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times Link to comment Share on other sites More sharing options...
Moderators SmOke_N Posted August 20, 2008 Moderators Share Posted August 20, 2008 (edited) Local $s_text = FileRead(@DesktopDir & "\TVRage.txt") Local $s_pattern = "(?i)(.*?href='/person/id-\d+/.+?'>)([\w\s]+)((?:\s*\(\d+\))*</a>)([\w\s]+)(?:\z|,|</td>.+?\z)" $s_text = StringTrimRight(StringRegExpReplace($s_text, $s_pattern, "\2\4, "), 2) MsgBox(0, 0, $s_text) Edited August 20, 2008 by SmOke_N Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer. Link to comment Share on other sites More sharing options...
DobraGolonka Posted August 20, 2008 Author Share Posted August 20, 2008 @Xenobiologist I didn't mention that I'm getting the data using _INetGetSource - sorry about that. Local $s_text = FileRead(@DesktopDir & "\TVRage.txt") Local $s_pattern = "(?i)(.*?href='/person/id-\d+/.+?'>)([\w\s]+)((?:\s*\(\d+\))*</a>)([\w\s]+)(?:\z|,|</td>.+?\z)" $s_text = StringTrimRight(StringRegExpReplace($s_text, $s_pattern, "\2\4, "), 2) MsgBox(0, 0, $s_text) Wow! That certainly does the trick! Would you be so kind as to confirm what I *think* is happening? (?i) Case-insensitivity flag[0] (.*?href='/person/id-\d+/.+?'>) Match the 'href.../nnn/...>' string[1] ([\w\s]+) Match any "word" OR whitespace character AND repeat the previous set 1 or more times (Actor Name) until...[2] ((?:\s*\(\d+\))*</a>) Non-capturing group: whitespace characters but match any digit (0-9) repeated 1 or more times terminated by </a>[3] ([\w\s]+) Match any "word" OR whitespace character AND repeat the previous set 1 or more times (Character Name) until...[4] (?:\z|,|</td>.+?\z) Non-capturing group: match only at end of string OR Comma OR </td> & match only at end of stringFinally, replace groups 2 & 4 with " "I'm struggling a bit with number 4. - it appears to matching on the digits contained in the 'href' string?? Kind regards DG Link to comment Share on other sites More sharing options...
Moderators SmOke_N Posted August 20, 2008 Moderators Share Posted August 20, 2008 @Xenobiologist I didn't mention that I'm getting the data using _INetGetSource - sorry about that. Wow! That certainly does the trick! Would you be so kind as to confirm what I *think* is happening? (?i) Case-insensitivity flag[0] (.*?href='/person/id-\d+/.+?'>) Match the 'href.../nnn/...>' string[1] ([\w\s]+) Match any "word" OR whitespace character AND repeat the previous set 1 or more times (Actor Name) until...[2] ((?:\s*\(\d+\))*</a>) Non-capturing group: whitespace characters but match any digit (0-9) repeated 1 or more times terminated by </a>[3] ([\w\s]+) Match any "word" OR whitespace character AND repeat the previous set 1 or more times (Character Name) until...[4] (?:\z|,|</td>.+?\z) Non-capturing group: match only at end of string OR Comma OR </td> & match only at end of stringFinally, replace groups 2 & 4 with " "I'm struggling a bit with number 4. - it appears to matching on the digits contained in the 'href' string?? Kind regards DG[4] basically is saying to end the matches at the end of the string, a comma, or </td> rest of string [*] Is close, what I'm telling it is to replace all other matches with nothing, and keep matches 2 and 4 with a space after 4 (the as Name Last Name) Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now