Sign in to follow this  
Followers 0
AutID

StringRegEx get all URLs domain's names

5 posts in this topic

#1 ·  Posted (edited)

So as the tittle says, I am trying to find a pattern to get all the urls from different string.

I want to capture from protocol, which is http in most of the cases, to the suffix of the domain name.

The problem is there are many different types of suffix of domain names in urls which makes it a little bit tricky.

In the beginning I made something like this

#include <Array.au3>
Local $sUrl = "http://www.google.com/(random expanded link)"
Local $aArray = StringRegExp($sUrl, '(?i)http://(.*?).com', 2)
If Not @error Then
 _ArrayDisplay($aArray)
EndIf

However if the suffix is something other than .com, for example .net, .org, this pattern will fail.
Then I thought of creating an array with the most popular suffix and loop it and get all the domain names but this would take a lot of coding which could be avoided if I had better regex skills.

Finally I came up with this pattern but I am not 100% sure that it will capture everything and sometimes I get some weird results.

$pattern = "(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"

So anyone has any ideas better ideas?

 

Edit: hmmm I tried this simple pattern and seems to work pretty well.

Local $aArray = StringRegExp($sUrl, '(?i)http://(.*?)/', 2)
Either I'm very tired either it was very simple. Any opinions?
  Edited by AutID

Share this post


Link to post
Share on other sites



'(?i)http://([^/]+)'

:)

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

for more complex urls, you can use something like this regex :

Local $aUrl[9] = ["http://server:12345/path/blabla", _ 
                  "http://server.com:1234/path?query_string#fragment_id", _
                  "ftp://user:password@server:1234/path", _
                  "ftp://user@server:1234/path", _
                  "http://www.server.com", _
                  "www.server.com/path", _
                  "server.com", _
                  "http://user@server.com:1234/path?query_string#fragment_id", _
                  "user@server.com:1234" ]
                  
Local $sPattern = "^(?i)(?:(?:[a-z]+):\/\/)?" & _ ; Protocol
                  "(?:(?:(?:[^@:]+))" & _         ; Username
                  "(?::(?:[^@]+))?@)?" & _        ; Password
                  "([^\/:]+)" & _                 ; Host
                  "(?::(?:\d+))?" & _             ; Port
                  "(?:\/(?:[^?]+)?)?" & _         ; Path
                  "(?:\?\N+)?"                    ; Query


For $i = 0 To UBound($aUrl) - 1
    $aHost = StringRegExp($aUrl[$i], $sPattern, 1)
    ConsoleWrite($aHost[0] & @TAB & $aUrl[$i] & @CRLF)
Next

https://regex101.com/r/yB3dO1/1

Edited by jguinch
1 person likes this

Share this post


Link to post
Share on other sites

Nice jguinch, but you might want to make it case insensitive.


Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0