Jump to content

StringRegExp to filter URLs


jcpetu
 Share

Recommended Posts

Hi people,

I'm trying to get some URLs from a web site and I do it in two phases.

First, (thanks to TheXman) I extract all URLs between href=" ". Then I have to cycle to each array element returned by StringRegExp  in order to reject the URLs I don't need.

I wonder if it's any way of speed it up by using a StringRegExp in order to avoid bringing those URLs in the first step.

For instance, if it's possible in the first phase by using StringRegExp , I would like to bring all URLs but those with .png, ico, jpeg, jpg and css. I was trying to understand how to do it but StringRegExp is a language by itself.

And if it's possible, with StringRegExp as well, to filter the URLs that reference the same domain, so I will be able to reduce the if clause.

Thanks a lot in advance.

$Host="mesi.com"
$site = 'class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/sea-suma-cuatro-goles-en-tres-partidos-ante-el-bayern/">SEA SUMA CUATRO GOLES EN TRES PARTIDOS ANTE EL BAYERN</a></div><div class="desc_noticies"><p>Sea Jcpe suma cuatro goles en tres enfrentamientos contra el Bayern de Múnich en la Liga de Campeones: dos en [&hellip;]</p></div></div></div><div class="post_grid_noticies jcpe_noti_4"><div class="contenidor-zoom-out"><a href="https://jcpe.com/sea-marca-en-la-eliminacion-del-napoli/"><img width="2560" height="2560" src="https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-scaled.jpg?v=1596923563" class="img_grid_notis wp-post-image" alt="" srcset="https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-scaled.jpg?v=1596923563 2560w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-300x300.jpg?v=1596923563 300w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-1024x1024.jpg?v=1596923563 1024w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-150x150.jpg?v=1596923563 150w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-768x768.jpg?v=1596923563 768w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-1536x1536.jpg?v=1596923563 1536w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-2048x2048.jpg?v=1596923563 2048w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-75x75.jpg?v=1596923563 75w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></div><div class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/sea-marca-en-la-eliminacion-del-napoli/">SEA JCPE MARCA EN LA CLASIFICACIÓN CONTRA EL NAPOLI</a></div><div class="desc_noticies"><p>Sea Jcpe ha marcado un gol en la victoria del Equipo ante el Napoli por 3-1, que supone la clasificación [&hellip;]</p></div></div></div><div class="post_grid_noticies jcpe_noti_5"><div class="contenidor-zoom-out"><a href="https://jcpe.com/el-equipo-a-por-los-cuartos-de-final-de-la-champions/"><img width="2560" height="2560" src="https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-scaled.jpg?v=1596709556" class="img_grid_notis wp-post-image" alt="" srcset="https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-scaled.jpg?v=1596709556 2560w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-300x300.jpg?v=1596709556 300w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-1024x1024.jpg?v=1596709556 1024w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-150x150.jpg?v=1596709556 150w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-768x768.jpg?v=1596709556 768w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-1536x1536.jpg?v=1596709556 1536w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-2048x2048.jpg?v=1596709556 2048w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-75x75.jpg?v=1596709556 75w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></div><div class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/el-equipo-a-por-los-cuartos-de-final-de-la-champions/">EL EQUIPO, A POR LOS CUARTOS DE FINAL DE LA CHAMPION...</a></div><div class="desc_noticies"><p>El Equipo buscará este sábado en el Camp Nou la clasificación para los cuartos de final de la Liga de [&hellip;]</p></div></div></div></div></div></div><div class="mas-noticias mes-noticies"> <a href="noticias">Más noticias'

$aUrl = StringRegExpReplace($site, "(?i)href=[""'](.*?)[""']|\z;", 3)

For $i = 1 To UBound($aUrl) - 1
        If (StringInStr($aUrl[$i], ".com") Or StringInStr($aUrl[$i], "www.")) And _
                (Not StringInStr($aUrl[$i], $Host)) Then
            ;filter external domains differents than $Host=site.com

        ElseIf StringInStr($aUrl[$i], "http") And Not StringInStr($aUrl[$i], $Host) Then
            ;filter external domains differents than $Host=site.com
            
        ElseIf StringInStr($aUrl[$i], ".png") Or StringInStr($aUrl[$i], ".ico") Or _
                StringInStr($aUrl[$i], ".jpg") Or StringInStr($aUrl[$i], ".jpeg") Or _
                StringInStr($aUrl[$i], ".css") Then
            ;filter non desired elements
        ElseIf $aUrl[$i] = "" Or _
                $aUrl[$i] = "/" Or _
                $aUrl[$i] = $Host Or _
                $aUrl[$i] = "http://" & $Host Or _
                $aUrl[$i] = "https://" & $Host Or _
                $aUrl[$i] = "http://www." & $Host Or _
                $aUrl[$i] = "https://www." & $Host Or _
                $aUrl[$i] = $Host & "/" Or _
                $aUrl[$i] = "http://" & $Host & "/" Or _
                $aUrl[$i] = "https://" & $Host & "/" Or _
                $aUrl[$i] = "http://www." & $Host & "/" Or _
                $aUrl[$i] = "https://www." & $Host & "/" Then
                ;filter same domain
        Endif
Next

 

Link to comment
Share on other sites

Hi Danp2, yes I'm sorry it was a cut and paste error, the correct code is:

#include <array.au3>
#include <Debug.au3>
#include <String.au3>
#include "WinHttp.au3"

Local $hOpen = _WinHttpOpen()
If @error Then
    MsgBox(48, "Error", "Error initializing the usage of WinHTTP functions.")
    Exit
EndIf
Local $Host = "messi.com"
Local $hConnect = _WinHttpConnect($hOpen, $Host) ; <- yours here
If @error Then
    MsgBox(48, "Error", "Error specifying the initial target server of an HTTP request.")
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf
Local  $req = _WinHttpOpenRequest($hConnect)
If @error Then
    MsgBox(48, "Error", "Error creating an HTTP request handle.")
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf
_WinHttpSendRequest($req)
If @error Then
    MsgBox(48, "Error", "Error sending specified request.")
    _WinHttpCloseHandle($req)
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf

_WinHttpReceiveResponse($req) ;------------------------ Wait for the response
If @error Then
    MsgBox(48, "Error", "Error waiting for the response from the server.")
    _WinHttpCloseHandle($req)
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf

Local $sChunk, $gsHTML
If _WinHttpQueryDataAvailable($req) Then ;------------- See if there is data to read
    While 1
        $sChunk = _WinHttpReadData($req)
        If @error Then ExitLoop
        $gsHTML &= $sChunk
    WEnd
    ConsoleWrite($gsHTML & @CRLF) ; print to console
    $aUrl = _ArrayUnique(StringRegExp($gsHTML, 'href=(?:"|'')([^"'']+)', 3))
    
For $i = 1 To UBound($aUrl) - 1
        If (StringInStr($aUrl[$i], ".com") Or StringInStr($aUrl[$i], "www.")) And _
                (Not StringInStr($aUrl[$i], $Host)) Then
            ;filter external domains differents than $Host=site.com

        ElseIf StringInStr($aUrl[$i], "http") And Not StringInStr($aUrl[$i], $Host) Then
            ;filter external domains differents than $Host=site.com
            
        ElseIf StringInStr($aUrl[$i], ".png") Or StringInStr($aUrl[$i], ".ico") Or _
                StringInStr($aUrl[$i], ".jpg") Or StringInStr($aUrl[$i], ".jpeg") Or _
                StringInStr($aUrl[$i], ".css") Then
            ;filter non desired elements
        ElseIf $aUrl[$i] = "" Or _
                $aUrl[$i] = "/" Or _
                $aUrl[$i] = $Host Or _
                $aUrl[$i] = "http://" & $Host Or _
                $aUrl[$i] = "https://" & $Host Or _
                $aUrl[$i] = "http://www." & $Host Or _
                $aUrl[$i] = "https://www." & $Host Or _
                $aUrl[$i] = $Host & "/" Or _
                $aUrl[$i] = "http://" & $Host & "/" Or _
                $aUrl[$i] = "https://" & $Host & "/" Or _
                $aUrl[$i] = "http://www." & $Host & "/" Or _
                $aUrl[$i] = "https://www." & $Host & "/" Then
                ;filter same domain
        Endif
Next    
    
    
Else
    MsgBox(48, "Error", "Site is experiencing problems.")
EndIf

_WinHttpCloseHandle($req)
_WinHttpCloseHandle($hConnect)
_WinHttpCloseHandle($hOpen)

 

Edited by jcpetu
Link to comment
Share on other sites

I'm using WinHttp because all my program runs with it. First I bring all the site content as with _INetGetSource and then I use RegExp to extract only the links.

_ArrayUnique(StringRegExp($sresp, 'href=(?:"|'')([^"'']+)', 3))

For instance:

[https://static.messi.com/wp-content/uploads/2019/10/cropped-logo--192x192.png], _
[https://static.messi.com/wp-content/uploads/2019/10/cropped-logo--180x180.png],

and I would like to avoid bringing these links as well as png, ico, jpeg, jpg and css.

and if it's possible to bring only the link, per example, instead of:

[https://static.messi.com/wp-content/uploads/2019/10/cropped-logo--192x192.png]

bring only:

[https://static.messi.com/wp-content/uploads/2019/10/]

until the last /.

 

Linksfile.txt

Edited by jcpetu
Link to comment
Share on other sites

Hi mikell, thanks a lot, for me RexExp is double dutch  , it seems your magic does the trick of skiping the lines that contains png, jpg, ico & css right? That's part of the idea.

Thing is to bring all lines with href= followed by " or ' until the last /, for instance:

1) href="https://site.com/sub1/sub2/bla bla bla......." should bring https://site.com/sub1/sub2/

2) href='site.com/sub1/bla bla bla.......' should bring site.com/sub1/

3) href="https://site.com/sub1/sub2/bla bla bla.png" skip line

And ideally don't repeat lines with the same value, for instance the following line should be skipped because it's the same than line 1) but with different text after the last / (text text text instead of bla bla bla):

4)  href="https://site.com/sub1/sub2/text text text......."

I hope it's clear and thanks again.

 

 

Link to comment
Share on other sites

You can test with this code, thanks a lot:

#include <array.au3>
#include <Debug.au3>
#include <String.au3>
#include "WinHttp.au3"

Local $hOpen = _WinHttpOpen()
If @error Then
    MsgBox(48, "Error", "Error initializing the usage of WinHTTP functions.")
    Exit
EndIf
Local $Host = "messi.com"
Local $hConnect = _WinHttpConnect($hOpen, $Host) ; <- yours here
If @error Then
    MsgBox(48, "Error", "Error specifying the initial target server of an HTTP request.")
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf
Local  $req = _WinHttpOpenRequest($hConnect)
If @error Then
    MsgBox(48, "Error", "Error creating an HTTP request handle.")
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf
_WinHttpSendRequest($req)
If @error Then
    MsgBox(48, "Error", "Error sending specified request.")
    _WinHttpCloseHandle($req)
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf

_WinHttpReceiveResponse($req) ;------------------------ Wait for the response
If @error Then
    MsgBox(48, "Error", "Error waiting for the response from the server.")
    _WinHttpCloseHandle($req)
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf

Local $sChunk, $gsHTML
If _WinHttpQueryDataAvailable($req) Then ;------------- See if there is data to read
    While 1
        $sChunk = _WinHttpReadData($req)
        If @error Then ExitLoop
        $gsHTML &= $sChunk
    WEnd
    ConsoleWrite($gsHTML & @CRLF) ; print to console
    $aUrl = _ArrayUnique(StringRegExp($gsHTML, 'href=(?:"|'')([^"'']+)', 3))
    
For $i = 1 To UBound($aUrl) - 1
        If (StringInStr($aUrl[$i], ".com") Or StringInStr($aUrl[$i], "www.")) And _
                (Not StringInStr($aUrl[$i], $Host)) Then
            ;filter external domains differents than $Host=site.com

        ElseIf StringInStr($aUrl[$i], "http") And Not StringInStr($aUrl[$i], $Host) Then
            ;filter external domains differents than $Host=site.com
            
        ElseIf StringInStr($aUrl[$i], ".png") Or StringInStr($aUrl[$i], ".ico") Or _
                StringInStr($aUrl[$i], ".jpg") Or StringInStr($aUrl[$i], ".jpeg") Or _
                StringInStr($aUrl[$i], ".css") Then
            ;filter non desired elements
        ElseIf $aUrl[$i] = "" Or _
                $aUrl[$i] = "/" Or _
                $aUrl[$i] = $Host Or _
                $aUrl[$i] = "http://" & $Host Or _
                $aUrl[$i] = "https://" & $Host Or _
                $aUrl[$i] = "http://www." & $Host Or _
                $aUrl[$i] = "https://www." & $Host Or _
                $aUrl[$i] = $Host & "/" Or _
                $aUrl[$i] = "http://" & $Host & "/" Or _
                $aUrl[$i] = "https://" & $Host & "/" Or _
                $aUrl[$i] = "http://www." & $Host & "/" Or _
                $aUrl[$i] = "https://www." & $Host & "/" Then
                ;filter same domain
        Endif
Next    
    
    
Else
    MsgBox(48, "Error", "Site is experiencing problems.")
EndIf

_WinHttpCloseHandle($req)
_WinHttpCloseHandle($hConnect)
_WinHttpCloseHandle($hOpen)

 

Link to comment
Share on other sites

mikell, I realized that in some cases the folder doesn't end with a slash but with the same open symbol (I'm sorry), for instance:

1) href="https://site.com/sub1/sub2" --> should bring https://site.com/sub1/sub2. The open symbol is ".

2) href='site.com/sub1'  --> should bring site.com/sub1. The open symbol is '.

3) href='site.com' --> should bring site.com. It doesn't have any folders, it's only the site and the open symbol is '.

For this cases begining with href=" your solution is fine: 

4) href="https://site.com/sub1/sub2/bla bla bla.png" --> It brings:https://site.com/sub1/sub2/ which is OK.

But not for this cases begining with href=' :

5) href='https://site.com/sub1/sub2/bla bla bla.png' --> It should bring:https://site.com/sub1/sub2/ but it brings nothing.

To summarize:

The expression should bring any thing beginning with: href=" or href=' 

and ending with the last slash before the closing symbol " or ' (if it exists) as in example 4), or in case the slash doesn't exist , the text up to the closing symbol " or ' as in examples 1, 2 and 3.

 

 

 

 

Link to comment
Share on other sites

Hi Deye, this brings all http and https regardless of if it's part of a ref= or not. The only references it doesn't bring are those that doesn't terminate with slash. But I can use it either way assuming I'll loose some references until I get the silver bullet.

Just another question, some references begin with https:\/, what should I change in your RegExp to catch those as well?

Or perhaps with another RegExp ?

Edited by jcpetu
Link to comment
Share on other sites

jcpetu,

Regex is not magic, it's logic. So I fear that your multiple requirements are too much demanding for this logic
Just an example :

1) href="https://site.com/sub1/bla bla bla" --> should bring https://site.com/sub1/
1) href="https://site.com/sub1/sub2" --> should bring https://site.com/sub1/sub2

Here "sub2" and "bla bla bla" can be anything so how do you expect a regex to be able to make the difference ?
This will need to be treated manually

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...