Sign in to follow this  
Followers 0
werd

specific regex help

6 posts in this topic

Been stumped on this for a while and would appreciate any guidance...

I need to parse some HTML using REGEX, here's the test html:

<TD id=AB44-itmz><A style="TEXT-ALIGN: center" id=AB45 class=urBtnStd tabIndex=0 href="javascript:void(0);" ct="B" ti="0"><IMG style="MARGIN-TOP: 0px" id=WD27-img border=0 align=absMiddle src="http://asdf.com/search_button.gif">&nbsp;<SPAN class=urBtnCntTxt>Search</SPAN></A></SPAN><SPAN id=WD28-rtbi class=urTbarItmBtn canHide="false"><A style="TEXT-ALIGN: center" id=DD17 class=urBtnStd tabIndex=0 href="javascript:void(0);" ct="B" ti="0" lsevents="{Press:[{ResponseData:'delta',ClientAction:'submit'},{}]}" lsdata="{0:'STANDARD',1:'',2:true,3:false,4:false,5:true,6:'',7:';aads/request_button.gif',8:'',9:false,10:'',11:false,12:'',13:'NONE',14:false,15:false,16:'',17:'NONE'}"><IMG style="MARGIN-TOP: 0px" id=WD28-img border=0 align=absMiddle src="http://asdf.com/request_button.gif">&nbsp;<SPAN class=urBtnCntTxt>Request</SPAN></A></SPAN><SPAN id=WD29-rtbi class=urTbarItmBtn canHide="false"><A style="TEXT-ALIGN: center" id=WD29 class=urBtnStd tabIndex=0 href="javascript:void(0);" ct="B" ti="0" lsevents="{Press:[{ResponseData:'delta',ClientAction:'submit'},{}]}" lsdata="{0:'STANDARD',1:'',2:true,3:false,4:false,5:false,6:'',7:'',8:'',9:false,10:'',11:false,12:'',13:'NONE',14:false,15:false,16:'',17:'NONE'}"><SPAN class=urBtnCntTxt>Merchant info.</SPAN></A></SPAN></TD></TR></TBODY></TABLE></DIV>

And I'm trying to grab the \id=.... \ value (having only 4 alphanumeric characters after the equal sign then a space) that first preceeds the text "Request". Here's the code I'm using:

\bid=([A-Z]{2}[0-9]{2,6})\s\b(?=class=urBtnStd .*Request)

Right now, I know this is incorrect because it's returning both "id=AB45" and "id=DD17". How can I grab just the "id=DD17" which will correspond to the "Request" text?

thanks in advance,

wd

Share this post


Link to post
Share on other sites



this should always get the last one before the request text.

"(?i)<td.+bid=([[:alpha:]]{2}[0-9]{2,6})sb(?=class=urBtnStd .*Request)"


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Thanks that works... but I'm frustrated that I don't understand why it works. can you shed any light on how the "<td.+" makes this work?

Share this post


Link to post
Share on other sites

Here's how RegexBuddy explains it.

Match the remainder of the regex with the options: case insensitive (i) «(?i)»
Match the characters “<td” literally «<td»
Match any single character that is not a line break character «.+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert position at a word boundary «\b»
Match the characters “id=” literally «id=»
Match the regular expression below and capture its match into backreference number 1 «([[:alpha:]]{2}[0-9]{2,6})»
   Match a single character present in the list below «[[:alpha:]]{2}»
      Exactly 2 times «{2}»
      A character in the POSIX character class “alpha” «[:alpha:]»
   Match a single character in the range between “0” and “9” «[0-9]{2,6}»
      Between 2 and 6 times, as many times as possible, giving back as needed (greedy) «{2,6}»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Assert position at a word boundary «\b»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=class=urBtnStd .*Request)»
   Match the characters “class=urBtnStd ” literally «class=urBtnStd »
   Match any single character that is not a line break character «.*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the characters “Request” literally «Request»

Whenever someone says "pls" because it's shorter than "please", I say "no" because it's shorter than "yes".

Share this post


Link to post
Share on other sites

the '<td' wasn't really important. it's only there because of the possibility that the match could also be found elsewhere on the page and we didn't get to see the whole page.

the '.+' just told it to skip everything up to the last match.

the pcre toolkit in my signature will allow you to load a web page for testing the expressions to see the results before you commit the code to your script and when you have it working the way you want you can just export it directly to scite by clicking the blue button at the upper right if you have scite set as the export mode in options.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Thanks much -- i get it now.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0