Jump to content
Sign in to follow this  
Chance

Proxy Problems - RegExp

Recommended Posts

Chance

I have a huge list of proxies with ports in IP:PORT format.

I need to extract all the proxies that use the port ranges 80-8081 including common ports like 3128 and 8080 and ignore all the rest.

The regular expression I'm using is "(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,4})", how can I do this using a single captchering regexp statment?

Share this post


Link to post
Share on other sites
Chance

Ok, so I think I know what needs to be done.

((?:d{1,3}.){3}d{1,3}:(?:8080|3128|80|8081))

Just read through the help file, apparantly it's just that simple... :P

Edited by FlutterShy

Share this post


Link to post
Share on other sites
Chance

Well, I read up some more, an this is what I've got.

(?i)((?:(?:25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(?:[some port|statmests]))

^taken from net

Apparently, this regexp validates the IP addres to make sure that it's not in those funny ranges, I had to do this because I'm dealing with those IPs that you just don't know what you're going to get thrown at you.

Still, I do have a problem, I have a huge list of IP:PORT addres's and I wanted to make a regexp that would only pick up valid addres, but I'm not that smart to do it..... ( ._.)

1.237.43.340:80
421.535.123.123:8080
53.12.2.2:8080
14.55.01.255:443
164.77.82.21:80
202.149.78.234:8080
60.2.227.123:3128

From the above examples, the regexp will only pick 5 examples I think, one of which should not be picked up, which is "14.55.01.255:443", specifically because of the 01 bit, and if I remember correctly, IPs shouldn't begin with a 0.

I'me not too smart enough to be able to develop a regexp to filer out fake IP addrs's like those ;_;

If anyone could be so kind as to help me out?

Share this post


Link to post
Share on other sites
Beege

Thats kinda tricky. zero can be valid. so can one. thats why the fuction passes it. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) just verifys its a value 0 to 255.

If you have a bunch like that I would add a second check for that. so something like:

(?i)((?:(?:25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(?:[some port|statmests])) and not stringregexp($ip, '.0d?')
Edited by Beege
  • Like 1

Share this post


Link to post
Share on other sites
Chance

If you have a bunch like that I would add a second check for that. so something like:

(?i)((?:(?:25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(?:[some port|statmests])) and not stringregexp($ip, '.0d?')

I'm sorry, I think that's exactly what I just posted............... ( ._.)

Uless I'm missing something. It's still picking up exactly what the other RegExp was picking up.

I'm sorry, I just don't know too much regexp.....

Edited by FlutterShy

Share this post


Link to post
Share on other sites
Robjong

Hi,

give this a try, it does allow .0 or .1 but not .01, the port can be between 80 and 8081.

'b(?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0).){3}(?:25[0-5]|2[0-4][0-9]|1?[0-9][0-9]?|0):(?:[8-9]d|[1-9]dd|[1-7]d{3}|80(?:[0-7]d|8[01]))b'
Edited by Robjong
  • Like 1

Share this post


Link to post
Share on other sites
BrewManNH

If you're looking for a way to validate an IP address, try this snippet that I came up with that will validate an IPv4 address as valid or not. Validating the port number after that is simple.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites
Chance

If you're looking for a way to validate an IP address, try this snippet that I came up with that will validate an IPv4 address as valid or not. Validating the port number after that is simple.

Thanks, I'll find some use for this.

Thats kinda tricky. zero can be valid. so can one. thats why the fuction passes it. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) just verifys its a value 0 to 255.

If you have a bunch like that I would add a second check for that. so something like:

(?i)((?:(?:25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(?:[some port|statmests])) and not stringregexp($ip, '.0d?')

Sorry, I didn't exactly see what it was you were doing the first time I saw, as it turns out this does work, it's just that I need to keep it withing the regexpression because I can't throw a single IP at it at one time, this is meant to go through thousands at a time.

Hi,

give this a try, it does allow .0 or .1 but not .01, the port can be between 80 and 8081.

'b(?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0).){3}(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0):(?:[8-9]d|[1-9]dd|[1-7]d{3}|80(?:[0-7]d|8[01]))b'

YES!

This got me on track, it keeps those funny addresses out flawlessly as I can tell by now, I've modified it a bit and threw a huge list of IPs I know are valid and it picked them all up, then I mixed in some tricky ones with zeros and odd ranges and it filtered those out perfectly.

(?i)((?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]|0).){3}(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]|0):(?:8[0-9]{3}|8[0-9]|3128|28134|54321|45612|443))

The reason I add "(?i)" is because I'm assuming that it helps the RegExp operation go a little faster since it's not checking for case sensitivity.

Share this post


Link to post
Share on other sites
Robjong

The reason I add "(?i)" is because I'm assuming that it helps the RegExp operation go a little faster since it's not checking for case sensitivity.

That assumption is wrong, there are no characters that have a case difference.

If you go with your pattern you should add a word boundary "b" to the end of it, otherwise it might still match some ports you do not want.

For example, your pattern matches any 8xxx and 8x port but would now also match the first two digits of 8xx numbers, resulting in non existing/working ip:port combinations.

Edit: removed unintentional smiley.

Edited by Robjong

Share this post


Link to post
Share on other sites
Bowmore

The reason I add "(?i)" is because I'm assuming that it helps the RegExp operation go a little faster since it's not checking for case sensitivity.

I've not done any tests on this but I would expect a case insensitive RegEx to be slightly slower if there is any difference. Some where in Regex engines code it will have to do a comparison something like this:

If char = A or char = a Then

rather than just

If char = A Then

Edited by Bowmore

"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning."- Rick Cook

Share this post


Link to post
Share on other sites
Beege

Hi,

give this a try, it does allow .0 or .1 but not .01, the port can be between 80 and 8081.

'b(?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0).){3}(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0):(?:[8-9]d|[1-9]dd|[1-7]d{3}|80(?:[0-7]d|8[01]))b'

Ha! I knew you'd be able to nail this one! :graduated:

Share this post


Link to post
Share on other sites
Chance

If you go with your pattern you should add a word boundary "b" to the end of it, otherwise it might still match some ports you do not want.

For example, your pattern matches any 8xxx and 8x port but would now also match the first two digits of 8xx numbers, resulting in non existing/working ip:port combinations.

I've read over the help file and I've still yet to understand what this "word boundry" thing does, so I preferomed various tests and found that it reduces the amount of legit IP:PORT addresses it picks up, while not using it allows the RegExp pattern to work flawlessly. The IPs were seperated by only line break characters in my tests aka CR+LF, about two hundred of them.

I've not done any tests on this but I would expect a case insensitive RegEx to be slightly slower if there is any difference. Some where in Regex engines code it will have to do a comparison something like this:

If char = A or char = a Then

rather than just

If char = A Then

That assumption is wrong, there are no characters that have a case difference.

Ok, so performed 4 tests using a file about 1,500,000 lines long, each containing an IP:PORT in them at each line and I the results weren't what I was expecting...

;Case Insensitive = 64181.3263976287, 68533.5281439401

;Case Sensetive = 63767.234662506, 63963.4948017136

Share this post


Link to post
Share on other sites
Robjong

Ha! I knew you'd be able to nail this one! :graduated:

Haha, I was waiting for you to do it :P (Hi btw)

I've read over the help file and I've still yet to understand what this "word boundry" thing does, so I preferomed various tests and found that it reduces the amount of legit IP:PORT addresses it picks up, while not using it allows the RegExp pattern to work flawlessly. The IPs were seperated by only line break characters in my tests aka CR+LF, about two hundred of them.

It is really quite easy, as it's name suggest, the word boundary is related to the word sequence ( w ).

It matches a boundary of a word but not an actual character (zero width assertion), in other words it matches bewtween a word character ( A-Z a-z 0-9 _ ) and a non-word character.

For example, if you were to match groups of 3 digits you might write a pattern like this.

#include <Array.au3>
$aMatches = StringRegExp("123 456 7890", "d{3}", 3) ; matches 0:123, 1:456, 2:789
_ArrayDisplay($aMatches)

Which matches "123" "456" and "789", now you can see the problem, the "789" was originally Not a group of 3 numbers, now let's try it with boundaries.

#include <Array.au3>
$aMatches = StringRegExp("123 456 7890", "bd{3}b", 3) ; matches 0:123, 1:456
_ArrayDisplay($aMatches)

I hope this clears it up a bit.

Ok, so performed 4 tests using a file about 1,500,000 lines long, each containing an IP:PORT in them at each line and I the results weren't what I was expecting...

I'm betting you did not start the test with an SRE call you did not include in the timings, to start up the engine? ( First SRE call is significantly slower ;) )

Edit: tidy.

Edited by Robjong
  • Like 1

Share this post


Link to post
Share on other sites
Chance

It was still catching fake addresses due to the included 0 in the repeating 3 statements, obviously it has to be even bigger and more monstrous to work correctly.

((?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]).(?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]|0).){2}(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]|0):(?:8[0-9]{3}|3128|28134|54321|45612|443|8[0-9]))

Edit: Ok, so after reading the comments below, it's become obvious that addresses with a leading 0 are not fake, but the problem is that I'd rather ignore these because some people like to keep lists with the IPs 3 characters wide at each octet, I've tested over 200,000 and find that these rarely ever work, so I find it better to just skip them exclusively in order to prevent wasting any time.

Edited by FlutterShy

Share this post


Link to post
Share on other sites
BrewManNH

I've never found a 100% reliable regex that will validate every possible IP address without errors. There was a thread on (I think) codeproject that someone tried to solicit the best regex to do it, and after about 50 tries they never came up with a bulletproof way to do it in one line. There's just far too many variables and exceptions and allowances from what I saw.

Which is why I created my IP address validater function. It may not be lightning fast, but at least it works.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites
Robjong

I've never found a 100% reliable regex that will validate every possible IP address without errors. There was a thread on (I think) codeproject that someone tried to solicit the best regex to do it, and after about 50 tries they never came up with a bulletproof way to do it in one line. There's just far too many variables and exceptions and allowances from what I saw.

Which is why I created my IP address validater function. It may not be lightning fast, but at least it works.

I wrote this SRE version based on the same rules your snippet enforces, as far as I can tell it works.

;===============================================================================
; Description.......: Check if a given IP address is a valid IPv4 address.
; Parameter(s)......: $sIP - The IPv4 address to validate.
; Requirement.......:
; Return Value(s)...: Success - 1
;                     Failure - 0, and sets @error to 1
; Author(s).........: Robjong (SRE version of _ValidIP by BrewManNH : http://www.autoitscript.com/wiki/Snippets_%28_Internet_%29#ValidIP.28.29_.7E_Author_-_BrewManNH)
; Remarks ..........: This will accept an IP address that is 4 octets long, and contains only numbers and falls within
;                     valid IP address values. Class A networks can't start with 0 or 127. 169.xx.xx.xx is reserved and is
;                     invalid and any address that starts above 239, ex. 240.xx.xx.xx is reserved. The address range
;                     224-239 is reserved as well for Multicast groups but can be a valid IP address range if you're using
;                     it as such. Any IP address ending in 0 or 255 is also invalid for an IP.
;===============================================================================
Func _IsValidIPv4($sIP)
    Local $fRes = StringRegExp($sIP, "\A(?!(127|169|0{1,3})\.)(2[0-3]\d|[01]?\d\d?)(\.(25[0-5]|2[0-4]\d|[01]?\d\d?)){2}\.(25[0-4]|2[0-4]\d|1\d\d|0?[1-9]\d?|0{0,2}[1-9])\z")
    Return SetError(Not $fRes, 0, $fRes)
EndFunc   ;==>_IsValidIPv4

I also noticed your version allows addresses like 01.02.03.04, that should not be allowed should it? (it should, see next post)

Another thing I was curious about was this line:

$dString &= StringRight(Hex($aArray[$I]), 2)

... is there a reason you are not using it like this..?

Hex($aArray[$I], 2)

To get back on topic, to use this to parse the proxy list this should help:

Func _ParseProxyList($sString)
    Return StringRegExp($sIP, "\b(?!(?:127|169|0{1,3})\.)(?:2[0-3]\d|1\d\d|0?\d\d?)(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)){2}\.(?:25[0-4]|2[0-4]\d|[01]?[1-9]\d?|0{0,2}[1-9]):(?:[8-9]\d|[1-9]\d\d|[1-7]\d{3}|80(?:[0-7]\d|8[01]))\b", 3)
EndFunc ;==>_ParseProxyList

Edit 1: credits + cleaning

Edit 2: made groups non-capturing

Edit 3: updated source to allow for leading zero (see next posts)

Edit 4: cleaned patterns up a bit

Edited by Robjong
  • Like 1

Share this post


Link to post
Share on other sites
BrewManNH

A class A IP address range goes from 1 to 127, so 01.02.03.04 is a valid IP address.

As to the Hex statement, either is equally valid, your's is probably a better choice, less commands to parse.

As to your RegEx, it fails if any of the octets start with a zero, yet that's a perfectly valid IP address, as the leading zero is ignored when setting an IP address, or if you try to ping one.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites
Robjong

OK, thanks for the response. The pattern fails for leading zeros because, as the first question might have given away, I was under the impression that it was invalid, I should have known better.

I will update the script in my previous post after I have some dinner.

Share this post


Link to post
Share on other sites
Robjong

I have updated the functions in my previous post, they now allow leading zeros.

Share this post


Link to post
Share on other sites
Chance

Ok. the point in my thread is that the most valid working proxies are accepted, although a small magirity of those proxies might work, the thing is that most don't.

I've tested a lot, but I mean A LOT!

And this is the best I could come up with.

((?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd).(?:(?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd|0).){2}(?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd|0):(?:312[8-9]|28134|54321|45612|443|1d{2,3}|9d{3}|8d{1,3}))

the trick is the port, I duno why but these ports seem to work more often than others.

Now, I'm not saying that anyone above me is wrong, and to be honest, I'm not too keen with this stuff, but this has yealded the best results so far, possibly because of the port filtering part.

Edited by FlutterShy

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×