werd Posted April 17, 2012 Share Posted April 17, 2012 Been stumped on this for a while and would appreciate any guidance...I need to parse some HTML using REGEX, here's the test html:<TD id=AB44-itmz><A style="TEXT-ALIGN: center" id=AB45 class=urBtnStd tabIndex=0 href="javascript:void(0);" ct="B" ti="0"><IMG style="MARGIN-TOP: 0px" id=WD27-img border=0 align=absMiddle src="http://asdf.com/search_button.gif"> <SPAN class=urBtnCntTxt>Search</SPAN></A></SPAN><SPAN id=WD28-rtbi class=urTbarItmBtn canHide="false"><A style="TEXT-ALIGN: center" id=DD17 class=urBtnStd tabIndex=0 href="javascript:void(0);" ct="B" ti="0" lsevents="{Press:[{ResponseData:'delta',ClientAction:'submit'},{}]}" lsdata="{0:'STANDARD',1:'',2:true,3:false,4:false,5:true,6:'',7:';aads/request_button.gif',8:'',9:false,10:'',11:false,12:'',13:'NONE',14:false,15:false,16:'',17:'NONE'}"><IMG style="MARGIN-TOP: 0px" id=WD28-img border=0 align=absMiddle src="http://asdf.com/request_button.gif"> <SPAN class=urBtnCntTxt>Request</SPAN></A></SPAN><SPAN id=WD29-rtbi class=urTbarItmBtn canHide="false"><A style="TEXT-ALIGN: center" id=WD29 class=urBtnStd tabIndex=0 href="javascript:void(0);" ct="B" ti="0" lsevents="{Press:[{ResponseData:'delta',ClientAction:'submit'},{}]}" lsdata="{0:'STANDARD',1:'',2:true,3:false,4:false,5:false,6:'',7:'',8:'',9:false,10:'',11:false,12:'',13:'NONE',14:false,15:false,16:'',17:'NONE'}"><SPAN class=urBtnCntTxt>Merchant info.</SPAN></A></SPAN></TD></TR></TBODY></TABLE></DIV>And I'm trying to grab the \id=.... \ value (having only 4 alphanumeric characters after the equal sign then a space) that first preceeds the text "Request". Here's the code I'm using:\bid=([A-Z]{2}[0-9]{2,6})\s\b(?=class=urBtnStd .*Request)Right now, I know this is incorrect because it's returning both "id=AB45" and "id=DD17". How can I grab just the "id=DD17" which will correspond to the "Request" text?thanks in advance,wd Link to comment Share on other sites More sharing options...
GEOSoft Posted April 18, 2012 Share Posted April 18, 2012 this should always get the last one before the request text. "(?i)<td.+bid=([[:alpha:]]{2}[0-9]{2,6})sb(?=class=urBtnStd .*Request)" George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
werd Posted April 18, 2012 Author Share Posted April 18, 2012 Thanks that works... but I'm frustrated that I don't understand why it works. can you shed any light on how the "<td.+" makes this work? Link to comment Share on other sites More sharing options...
JohnQSmith Posted April 19, 2012 Share Posted April 19, 2012 Here's how RegexBuddy explains it.Match the remainder of the regex with the options: case insensitive (i) «(?i)» Match the characters “<td” literally «<td» Match any single character that is not a line break character «.+» Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» Assert position at a word boundary «\b» Match the characters “id=” literally «id=» Match the regular expression below and capture its match into backreference number 1 «([[:alpha:]]{2}[0-9]{2,6})» Match a single character present in the list below «[[:alpha:]]{2}» Exactly 2 times «{2}» A character in the POSIX character class “alpha” «[:alpha:]» Match a single character in the range between “0” and “9” «[0-9]{2,6}» Between 2 and 6 times, as many times as possible, giving back as needed (greedy) «{2,6}» Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s» Assert position at a word boundary «\b» Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=class=urBtnStd .*Request)» Match the characters “class=urBtnStd ” literally «class=urBtnStd » Match any single character that is not a line break character «.*» Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» Match the characters “Request” literally «Request» Whenever someone says "pls" because it's shorter than "please", I say "no" because it's shorter than "yes". Link to comment Share on other sites More sharing options...
GEOSoft Posted April 19, 2012 Share Posted April 19, 2012 the '<td' wasn't really important. it's only there because of the possibility that the match could also be found elsewhere on the page and we didn't get to see the whole page.the '.+' just told it to skip everything up to the last match.the pcre toolkit in my signature will allow you to load a web page for testing the expressions to see the results before you commit the code to your script and when you have it working the way you want you can just export it directly to scite by clicking the blue button at the upper right if you have scite set as the export mode in options. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
werd Posted April 19, 2012 Author Share Posted April 19, 2012 Thanks much -- i get it now. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now