Sign in to follow this  
Followers 0
footswitch

Regular Expression syntax question

15 posts in this topic

Hello there,

I need to get all the "function('random_string')" INSIDE the <script></script> tags and WITHOUT the "{" character.

So I'm trying to accomplish something in between of these two:

#include <array.au3>

$html="{function('00')--<script--{function('11')--function('22')--function('33')--</script>--<script--{function('44')--function('55')--function('66')--</script>--function('77')"
$array1=StringRegExp ($html, "(?s)(?i)<script.+function\('(.*?)'\).+</SCRIPT>",3)
If @error==1 Then ConsoleWrite("-> No matches for first RegExp!"&@CRLF)
; RegExp explained:
    ; sets flag "(?s)", which means: "." matches any character including newline
    ; sets flag "(?i)", which means: case insensitive
    ; looks for "<script", then a bunch of text must be present - ".+" - until it finds the last occurrence of "function('random string here')" - why only the last occurrence?
    ; then another bunch of text must be present and finally "</SCRIPT>" needs to appear after it all
_ArrayDisplay($array1)

$array2=StringRegExp ($html, "(?s)(?i)[^\{]function\('(.*?)'\)",3)
If @error==1 Then ConsoleWrite("-> No matches for second RegExp!"&@CRLF)
; RegExp explained:
    ; find every occurrence of "function('random string here')" that doesn't have a preceding "{"
_ArrayDisplay($array2)

I'm not quite the RegExp Guru ;)

Any thoughts?

Thanks,

footswitch


Share this post


Link to post
Share on other sites



I'm still not entirely sure what you're after here. In this case, are you just after functions 11 through 66 (omitting functions 00 and 77) or are you after all functions?

In your StringRegExp, you'll want a "?" after "[^\{]", as your bracket does not always appear, at least in your $html string as you have it now.

Also, you'll probably want to check out the String Regular Expression Tester.

Share this post


Link to post
Share on other sites

You could accomplish this in two steps - the first would be to isolate the portions that have surrounding <script </script> tags. (btw, why isn't there a '>' before the </script> ?) Then you would work through each array element.

Another way is to identify the text that should come before the '--function' part of a statement. In the above scenario, you could do something like this (assuming 'script' or ')' comes before the dashes):

$aMatches=StringRegExp($html,"(?:\)|script)--\{?function.'(\d+)'",3)

Share this post


Link to post
Share on other sites

"\w+\('\d+'\)"


Hi ;)

Share this post


Link to post
Share on other sites

I'm pretty sure that you are not giving us an actual string to work with here and that makes it somewhat difficult to give you a proper answer.

Post some of the actual html code and tell us what you want for a result.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Thank you for your feedback.

I eventually ended up doing this with two steps last night. Not that it came to me at first. Sorry for not posting on time.

Anyway, would be interesting to see a way of doing this in just one step, if such thing is possible.

The test string provided is all we need.

I need all the functions() inside <script ... </script>, as long as they don't have a preceding "{".

So, in this case, my output should be:

function('22')

function('33')

function('55')

function('66')

@Ascend4nt, the lack of the ">" is just my natural lazyness, because of the many possible scenarios like <script>, <script javascript>, <script something>...

@GMK, I believe I do NOT need a "?" after "[^\{]", because I actually want to exclude these scenarios.

These are the lines I'm currently using:

$array1=StringRegExp ($html, "(?s)(?i)<script(.+?)</SCRIPT>",3)

; next we combine the matches all together:

$string=""

For $i=0 To UBound($array1)-1

$string&=$array1[$i]

Next

$array2=StringRegExp ($string, "(?s)(?i)[^\{]function\('(.*?)'\)",3)

Edited by footswitch

Share this post


Link to post
Share on other sites

If the line always contains 3 function('num') in between the <script> -- </script> then this will works...

<script--{(\w+\('\d+'\))--(\w+\('\d+'\))--(\w+\('\d+'\))--.*?>


Hi ;)

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

Okay, I get your point.

I never know what I can find inside a <script> tag.

In my case, it contains several lines of code.

Among that code, function('random_text') might appear, often more than once.

I believe this would be a good test string:

$html="<html>(...)"&@CRLF& _

" this is a function('that i dont want in the output because its outside of the script tags');"&@CRLF& _

"<script javascript>"&@CRLF& _

"few lines of code here"&@CRLF& _

"few lines of code here"&@CRLF& _

"more code and then function('FIRST TAG testing one enter"&@CRLF& _

"two and thr33.'); and etc."&@CRLF& _

"this will continue through the ages"&@CRLF& _

"and possibly there's a second function('FIRST TAG with mor3 alphanumeric chars here')"&@CRLF& _

"; and now here it is a {function('i dont want to catch this one because it starts with {');}"&@CRLF& _

"few lines of code here"&@CRLF& _

"few lines of code here"&@CRLF& _

"and finally</script>(...)"&@CRLF& _

"remember that this {function('also cant be present in the output because its outside of the script tags');(...)</html>"&@CRLF& _

"<html>(...)"&@CRLF& _

" this is a function('that i dont want in the output because its outside of the script tags');"&@CRLF& _

"<script javascript>"&@CRLF& _

"few lines of code here"&@CRLF& _

"few lines of code here"&@CRLF& _

"more code and then function('SECOND TAG testing one enter"&@CRLF& _

"two and thr33.'); and etc."&@CRLF& _

"this will continue through the ages"&@CRLF& _

"and possibly there's a second function('SECOND TAG with mor3 alphanumeric chars here')"&@CRLF& _

"; and now here it is a {function('i dont want to catch this one because it starts with {');}"&@CRLF& _

"few lines of code here"&@CRLF& _

"few lines of code here"&@CRLF& _

"and finally</script>(...)"&@CRLF& _

"remember that this {function('also cant be present in the output because its outside of the script tags');(...)</html>"

From this, i only want this output:

(array)

0|FIRST TAG testing one entertwo and thr33.

1|FIRST TAG with mor3 alphanumeric chars here

2|SECOND TAG testing one entertwo and thr33.

3|SECOND TAG with mor3 alphanumeric chars here

The script that I posted earlier today (one RegExp over another) performs this operation successfully:

1. Get everything inside <script> tags

2. Get everything inside function(''), as long as function doesn't have a preceding {

;)

Edited by footswitch

Share this post


Link to post
Share on other sites

I prefer your way too. Perhaps, somebody could write a proper UDF for nested pattern matching.


Hi ;)

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

Yeah, like

_StringRegExpNested ( ByRef $aPatterns ) ; with a virtually unlimited number of nested RegExps

Just to think about the combinations of Flags, Return values and Error values... what a mess it would be ;)

EDIT: typo

Edited by footswitch

Share this post


Link to post
Share on other sites

The whole concept makes one shudder.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Hang on a sec. Why don't you do this: http://www.autoitscript.com/forum/index.php?showtopic=119220&view=findpost&p=828731

Similar theory, you have opening brackets and closing brackets. Doing it on html would be more complicated, but if you keep going through to the next tag it's possible. And what you should remember is that regex is NOT magic. It's loops in strings.

Share this post


Link to post
Share on other sites

#13 ·  Posted (edited)

@Mat,

I like what you did. But then there's the whole HTML enchilada: lots of conditional arguments which I believe would really mess up the code.

Nested StringRegExps are precise, easy to understand, easy to tune-up and acceptably efficient.

Fighting for the best way of reinventing the wheel, are we? ;)

EDIT: typo

Edited by footswitch

Share this post


Link to post
Share on other sites

No matter how many times you reinvent it, there is still going to be a flat spot someplace.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

All I'm trying to point out is that although regex looks neat in AutoIt. It's just loops in strings, so if you were a computer and could think, you would think it was a mess. But computers aren't prone to expressing their opinions, unless they are told to, so you get away with it.

You are right though, my method would get messy when you start to deal with strings to open and close rather than single characters, not to mention a host of other factors.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0