Jump to content

StringRegEx get all URLs domain's names


AutID
 Share

Recommended Posts

So as the tittle says, I am trying to find a pattern to get all the urls from different string.

I want to capture from protocol, which is http in most of the cases, to the suffix of the domain name.

The problem is there are many different types of suffix of domain names in urls which makes it a little bit tricky.

In the beginning I made something like this

#include <Array.au3>
Local $sUrl = "http://www.google.com/(random expanded link)"
Local $aArray = StringRegExp($sUrl, '(?i)http://(.*?).com', 2)
If Not @error Then
 _ArrayDisplay($aArray)
EndIf

However if the suffix is something other than .com, for example .net, .org, this pattern will fail.
Then I thought of creating an array with the most popular suffix and loop it and get all the domain names but this would take a lot of coding which could be avoided if I had better regex skills.

Finally I came up with this pattern but I am not 100% sure that it will capture everything and sometimes I get some weird results.

$pattern = "(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"

So anyone has any ideas better ideas?

 

Edit: hmmm I tried this simple pattern and seems to work pretty well.

Local $aArray = StringRegExp($sUrl, '(?i)http://(.*?)/', 2)
Either I'm very tired either it was very simple. Any opinions?
  Edited by AutID
Link to comment
Share on other sites

for more complex urls, you can use something like this regex :

Local $aUrl[9] = ["http://server:12345/path/blabla", _ 
                  "http://server.com:1234/path?query_string#fragment_id", _
                  "ftp://user:password@server:1234/path", _
                  "ftp://user@server:1234/path", _
                  "http://www.server.com", _
                  "www.server.com/path", _
                  "server.com", _
                  "http://user@server.com:1234/path?query_string#fragment_id", _
                  "user@server.com:1234" ]
                  
Local $sPattern = "^(?i)(?:(?:[a-z]+):\/\/)?" & _ ; Protocol
                  "(?:(?:(?:[^@:]+))" & _         ; Username
                  "(?::(?:[^@]+))?@)?" & _        ; Password
                  "([^\/:]+)" & _                 ; Host
                  "(?::(?:\d+))?" & _             ; Port
                  "(?:\/(?:[^?]+)?)?" & _         ; Path
                  "(?:\?\N+)?"                    ; Query


For $i = 0 To UBound($aUrl) - 1
    $aHost = StringRegExp($aUrl[$i], $sPattern, 1)
    ConsoleWrite($aHost[0] & @TAB & $aUrl[$i] & @CRLF)
Next

https://regex101.com/r/yB3dO1/1

Edited by jguinch
Link to comment
Share on other sites

  • Moderators

Nice jguinch, but you might want to make it case insensitive.

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...