Sign in to follow this  
Followers 0
b47chguru

Regex problem , need immediate help

25 posts in this topic

hi..

i am working on a script which extracts a webpage source and gets the table elements..

but my regex isn't working properly..

i want to extract the content between <table and </table> from the source..

code:

$file = FileOpen("tyu.txt")
$file_content = FileRead($file)
FileClose($file)
$table = StringRegExp($file_content, "(?s)<table((?s).*?)</table>",3)
_ArrayDisplay($table)

and this is the text file : http://www.comfaca.com/aiyo.txt

but the regex is working perfectly with the StringRegExpGui udf..

Thanks in Advance.

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

You want to extract proxies from hidemyass using regular expression? I've tried that, you're not going to get very far.

If that is what you want to do, this gets incredibly difficult and if you can't manage to simply extract a table from html, I don't see you anywhere in the near future succeeding with this. If it's not the proxies you're after, I couldn't imagine why you would want to extract the proxy table from their website.

Also, there are far simpler ways to make a successful "scrapper". Hidemyass looks pretty but it doesn't have all the proxies.

P.S. They implement about 7 or 8 different obfuscation methods to show their proxies, they do this because they obviously don't like scrappers taking their stuff, so on the user side, the proxies show up looking nice, but the html code is a gigantic confusing maze of html and CSS trickery that only render the right numbers and it's 10^100x simpler to just do it another way. This question is more suitable for a place like hackforums.net. They have working backdoored scrappers everywhere.

Nevermind, I guess when I tried before, it was when I knew less AutoIt a few months ago, this seems to work.

Edited by FlutterShy

Share this post


Link to post
Share on other sites

You want to extract proxies from hidemyass using regular expression? I've tried that, you're not going to get very far.

If that is what you want to do, this gets incredibly difficult and if you can't manage to simply extract a table from html, I don't see you anywhere in the near future succeeding with this. If it's not the proxies you're after, I couldn't imagine why you would want to extract the proxy table from their website.

Also, there are far simpler ways to make a successful "scrapper". Hidemyass looks pretty but it doesn't have all the proxies.

P.S. They implement about 7 or 8 different obfuscation methods to show their proxies, they do this because they obviously don't like scrappers taking their stuff, so on the user side, the proxies show up looking nice, but the html code is a gigantic confusing maze of html and CSS trickery that only render the right numbers and it's 10^100x simpler to just do it another way. This question is more suitable for a place like hackforums.net. They have working backdoored scrappers everywhere.

Thanks for your reply,

i have already succeeded in deobfuscating it.. i dont understand why you said i should post this question in some hackforum when my question is related to autoit regex..

it would be of much help if anyone could provide me with a solution to this regex problem

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

i dont understand why you said i should post this question in some hackforum when my question is related to autoit regex..

Whatever regular expression that can be used in AutoIt can be used in most other languages.

i have already succeeded in deobfuscating it..

I think you misunderstood me. But anyway, what the heck...

it would be of much help if anyone could provide me with a solution to this regex problem

$table = StringRegExp("<table HURRR DURRRF x3>YAY :D</table>", "(?s)(?i)<table(?i:[^>].*|)>(.*?)</table>",3)
ConsoleWrite($table[0] & @CR)
Edited by FlutterShy

Share this post


Link to post
Share on other sites

Whatever regular expression that can be used in AutoIt can be used in most other languages.

I think you misunderstood me. But anyway, what the heck...

$table = StringRegExp("<table HURRR DURRRF x3>YAY :D</table>", "(?s)(?i)<table(?i:[^>].+?|)>(.*?)</table>",3)
ConsoleWrite($table[0] & @CR)

your regex doesnt work..

try it with the file link

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

b47chguru,

Here is what you are asking for, although it is unlikely that it is what you want.

(?m)<table[.sS]*</table

note: rookie with regexp

kylomas

edit: Opps forgot to add grouping try this pattern

(?m)(?:<table)([.sS]*)(?:</table)

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

b47chguru,

Here is what you are asking for, although it is unlikely that it is what you want.

note: rookie with regexp

kylomas

hmm, I guess I should have tried it on the file first.

I wondering though

<td class="
jbjnbjnbjnbj

Mines only returning that, but why? It "looks" like it should have worked.. :sweating:

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

@Fluttershy - maybe because you are not in multiline mode, don't now for sure, got a headache from thinking about regexp for 2 mins

kylomas

@b47chguru - don't assume 1 match, iterate through the results array.

edit: not true s includes the EOL chars

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

wut, I thought you always had to specify a capturing group within the regexp, I guess you learn stuff everyday...

$table = StringRegExp(FileRead(".file.txt"), "(?s)(?i)<table(?i:[^>].*|)>.*</table>",3)
ConsoleWrite($table[0] & @CR)

Share this post


Link to post
Share on other sites

@Fluttershy - see my addendum, the first pattern was wrong!!!


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

@kylomas the regex still doesnt work with the file..

but the funny thing is that regex works with the StringRegExpGui udf..

Share this post


Link to post
Share on other sites

b47chguru,

Define "still does'nt work". I copied your file to a regexp tester and it worked fine. Are you sure that you saw my update, the 1st pattern was incorrect.

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

@kylomas

sorry, my bad... my first regex itself was working, the problem is that _arraydisplay doesnt display it

anyways thanks for helping out! :)

Share this post


Link to post
Share on other sites

#14 ·  Posted (edited)

@kylomas

sorry, my bad... my first regex itself was working, the problem is that _arraydisplay doesnt display it

anyways thanks for helping out! :)

yeah, you're right, but you should be using "(?s)<table(.*?)</table>", since you already defined (?s) once, you don't have to do it again fyi. lol, wow, I just did some testing, no, this is not correct.

Also, you're trying to extract the proxies am I correct? Why not use tbody instead of table?

Edited by FlutterShy

Share this post


Link to post
Share on other sites

b47chguru,

Sure about that? Why would _arraydisplay not display an array, given that the regexp returned one. You can interrogate @error following a regexp call to make sure of the result.

Flag = 3 or 4 :

@Error Meaning 0 Array is valid. 1 Array is invalid. No matches. 2 Bad pattern, array is invalid. @Extended = offset of error in pattern.

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

#16 ·  Posted (edited)

Or grab everything that is not html like this

stringregexp($str,'>([^<].*?)<',3)

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

b47chguru,

Sure about that? Why would _arraydisplay not display an array, given that the regexp returned one. You can interrogate @error following a regexp call to make sure of the result.

kylomas

yes, _arrayDisplay function doesnt display the array element in this case $table[0]

Share this post


Link to post
Share on other sites

That is my point, do NOT assume just one match, iterate through the array.


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

@FlutterShy

yes , i am trying to extract proxies.. i have made a partial css interpreter to decode it into the ip and it works..the only problem which i had was regarding this regex..

@kylomas

yes, :) thankyou very much for helping in solving this problem.

Share this post


Link to post
Share on other sites

#20 ·  Posted (edited)

Ok, I wanted to test my self for some reason and felt an urge to do it. Especially after asking you how you did it to deobfuscate your stuff and repliad by saying you'd post it sometime in the exmples script, people who usually do this rarely actually do.

So here it is, this will recover every single proxy by parsing the html and css.

#AutoIt3Wrapper_AU3Check_Parameters=-d -w 1 -w 2 -w 3 -w- 4 -w 5 -w 6 -w- 7

; #FUNCTION# ====================================================================================================================
; Name ..........: _UnHideMyAss
; Description ...: Recovers proxies from hidemyass.com
; Syntax ........: _UnHideMyAss($HTML)
; Parameters ....: $HTML - HTML web source.
; Return values .: A string white space delimetered list
; Author ........: FlutterShy
; Modified ......:
; Remarks .......: will most liekly stop working after a month from this post
; Example .......: No
; ===============================================================================================================================
Func _UnHideMyAss($HTML)
Local $tables = StringRegExp($HTML, "(?s)(?i)<tbody.*>.*</tbody>",3); extract entire tabel
If @error Then Return SetError(1, 0, 0)

Local $aBody = StringRegExp($tables[0], "(?s)<tr(?i:[^>].*)>.*</tr>",3); get the smaller table
If @error Then Return SetError(2, 0, 0)

Local $Fields = StringRegExp($aBody[0], "<tr(?i:[^>].*)>((?s).*?)</tr>",3); seperate the groups of entries
If @error Then Return SetError(3, 0, 0)

Local $Step
Local $Out
Local $Styles
Local $TempStyles

For $o = 0 To UBound($Fields) -1
$HTML = StringReplace(StringReplace($Fields[$o], @CR,""), @LF, ""); remove all line break things

$aBody = StringRegExp($HTML, "<style>(.*?)</style>",3); extract CSS styles
If @error Then Return SetError(4, $o, 0)

$Styles = StringRegExp($aBody[0], ".(.*?){display:(.+?)}",3); get css values
If @error Then ContinueLoop

$TempStyles = $Styles

ReDim $Styles[200][2]
$Step = 0

For $I = 0 to (UBound($TempStyles)-1) Step 2; load styles to array
$Styles[$Step][0] = $TempStyles[$I]
$Styles[$Step][1] = $TempStyles[$I+1]
$Step += 1
Next

ReDim $Styles[$Step][2]

$aBody = StringRegExp($HTML, '(?s)</style>.*?<span class="country">',3); get the actual obfuscated proxy content
If @error Then ContinueLoop

$aBody[0] = StringRegExpReplace($aBody[0], '<spansstyle="display:s?none">.*?</span>', ""); remove the ones that will not show up
$aBody[0] = StringRegExpReplace($aBody[0], '<divsstyle="display:s?none">.*?</div>', ""); seperate regexp to avoid confusion

For $I = 0 To UBound($Styles) - 1
If $Styles[$I][1] == "none" Then _; remove the CSS styles none displayed entities or whatever
$aBody[0] = StringRegExpReplace($aBody[0], '<(?:span|div)sclass="'&$Styles[$I][0]&'">.*?</span>', "")
Next

$aBody[0] = StringRegExpReplace($aBody[0], '<(?:span|div).*?>([.d]*?)</(?:span|div)>', "$1"); remove dummy tags
$aBody[0] = StringRegExpReplace(StringStripWS($aBody[0], 8), '<td>([^>]d+)</td>', ":$1"); set port
$aBody[0] = StringRegExpReplace($aBody[0], '<[^<]*>', ""); remove everything else now

$Out &= $aBody[0]&" "
Next

Return SetError(0, 0, $Out)
EndFunc

Global $Result = _UnHideMyAss('$Source')
Global $Discriminate = "312[8-9]|28134|54321|45612|443|1d{2,3}|9d{3}|8d{1,3}"
Global $aResult = StringRegExp($Result, "((?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd).(?:(?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd|0).){2}(?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd|0):(?:"&$Discriminate&"))", 3)
Global $Match = 0
Global $OutPut

$Result = StringSplit($Result, " ", 2)
For $A = 0 To UBound($Result)-1
For $I = 0 To UBound($aResult)-1
If ($Result[$A]==$aResult[$I]) Then $Match = 1
Next
Switch $Match
Case 0
$Output &= $Result[$A] & @CRLF

Case 1
$Output &= $Result[$A] & @TAB & " HURR DUURRFF :D" & @CRLF
EndSwitch
$Match = 0
Next

ConsoleWrite($Output)

OMG THIS SITE!

It stopped working already :P

very easy to fix though, I'll let anyone who's interested figure it out.

Edited by FlutterShy

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0