Jump to content

StringRegExpReplace to deconcatenate


 Share

Recommended Posts

Hi guys,

I'm trying to extract all sites between href=" " from a concatenate string, but after several tests with StringRegExpReplace I'm not able to do it. I'll appreciate any help.

This is what I have so far:

$concatenate = 'class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/sea-suma-cuatro-goles-en-tres-partidos-ante-el-bayern/">SEA SUMA CUATRO GOLES EN TRES PARTIDOS ANTE EL BAYERN</a></div><div class="desc_noticies"><p>Sea Jcpe suma cuatro goles en tres enfrentamientos contra el Bayern de Múnich en la Liga de Campeones: dos en [&hellip;]</p></div></div></div><div class="post_grid_noticies jcpe_noti_4"><div class="contenidor-zoom-out"><a href="https://jcpe.com/sea-marca-en-la-eliminacion-del-napoli/"><img width="2560" height="2560" src="https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-scaled.jpg?v=1596923563" class="img_grid_notis wp-post-image" alt="" srcset="https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-scaled.jpg?v=1596923563 2560w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-300x300.jpg?v=1596923563 300w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-1024x1024.jpg?v=1596923563 1024w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-150x150.jpg?v=1596923563 150w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-768x768.jpg?v=1596923563 768w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-1536x1536.jpg?v=1596923563 1536w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-2048x2048.jpg?v=1596923563 2048w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-75x75.jpg?v=1596923563 75w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></div><div class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/sea-marca-en-la-eliminacion-del-napoli/">SEA JCPE MARCA EN LA CLASIFICACIÓN CONTRA EL NAPOLI</a></div><div class="desc_noticies"><p>Sea Jcpe ha marcado un gol en la victoria del Equipo ante el Napoli por 3-1, que supone la clasificación [&hellip;]</p></div></div></div><div class="post_grid_noticies jcpe_noti_5"><div class="contenidor-zoom-out"><a href="https://jcpe.com/el-equipo-a-por-los-cuartos-de-final-de-la-champions/"><img width="2560" height="2560" src="https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-scaled.jpg?v=1596709556" class="img_grid_notis wp-post-image" alt="" srcset="https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-scaled.jpg?v=1596709556 2560w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-300x300.jpg?v=1596709556 300w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-1024x1024.jpg?v=1596709556 1024w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-150x150.jpg?v=1596709556 150w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-768x768.jpg?v=1596709556 768w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-1536x1536.jpg?v=1596709556 1536w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-2048x2048.jpg?v=1596709556 2048w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-75x75.jpg?v=1596709556 75w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></div><div class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/el-equipo-a-por-los-cuartos-de-final-de-la-champions/">EL EQUIPO, A POR LOS CUARTOS DE FINAL DE LA CHAMPION...</a></div><div class="desc_noticies"><p>El Equipo buscará este sábado en el Camp Nou la clasificación para los cuartos de final de la Liga de [&hellip;]</p></div></div></div></div></div></div><div class="mas-noticias mes-noticies"> <a href="noticias">Más noticias'

$result = StringRegExpReplace($concatenate, "(?i)href=[""'](.*?)[""']|\z;", 3)

_ArrayDisplay($result)

 

Link to comment
Share on other sites

1 hour ago, jcpetu said:

$result = StringRegExpReplace($concatenate, "(?i)href=[""'](.*?)[""']|\z;", 3)

Your title and example refer to StringRegexpReplace.  Why would you use StringRegexReplace to extract the hrefs in this particular case?  That's not even the correct syntax for StringRegexpReplace.  That's the syntax for StringRegexp.  So how did you come up with the idea that you needed to use StringRegexpReplace?  :huh2: 

Here are a couple of ways that it could be done:

#include <Constants.au3>
#include <String.au3>
#include <Debug.au3>

$gsHTML = 'class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/sea-suma-cuatro-goles-en-tres-partidos-ante-el-bayern/">SEA SUMA CUATRO GOLES EN TRES PARTIDOS ANTE EL BAYERN</a></div><div class="desc_noticies"><p>Sea Jcpe suma cuatro goles en tres enfrentamientos contra el Bayern de Múnich en la Liga de Campeones: dos en […]</p></div></div></div><div class="post_grid_noticies jcpe_noti_4"><div class="contenidor-zoom-out"><a href="https://jcpe.com/sea-marca-en-la-eliminacion-del-napoli/"><img width="2560" height="2560" src="https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-scaled.jpg?v=1596923563" class="img_grid_notis wp-post-image" alt="" srcset="https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-scaled.jpg?v=1596923563 2560w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-300x300.jpg?v=1596923563 300w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-1024x1024.jpg?v=1596923563 1024w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-150x150.jpg?v=1596923563 150w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-768x768.jpg?v=1596923563 768w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-1536x1536.jpg?v=1596923563 1536w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-2048x2048.jpg?v=1596923563 2048w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-75x75.jpg?v=1596923563 75w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></div><div class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/sea-marca-en-la-eliminacion-del-napoli/">SEA JCPE MARCA EN LA CLASIFICACIÓN CONTRA EL NAPOLI</a></div><div class="desc_noticies"><p>Sea Jcpe ha marcado un gol en la victoria del Equipo ante el Napoli por 3-1, que supone la clasificación […]</p></div></div></div><div class="post_grid_noticies jcpe_noti_5"><div class="contenidor-zoom-out"><a href="https://jcpe.com/el-equipo-a-por-los-cuartos-de-final-de-la-champions/"><img width="2560" height="2560" src="https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-scaled.jpg?v=1596709556" class="img_grid_notis wp-post-image" alt="" srcset="https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-scaled.jpg?v=1596709556 2560w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-300x300.jpg?v=1596709556 300w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-1024x1024.jpg?v=1596709556 1024w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-150x150.jpg?v=1596709556 150w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-768x768.jpg?v=1596709556 768w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-1536x1536.jpg?v=1596709556 1536w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-2048x2048.jpg?v=1596709556 2048w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-75x75.jpg?v=1596709556 75w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></div><div class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/el-equipo-a-por-los-cuartos-de-final-de-la-champions/">EL EQUIPO, A POR LOS CUARTOS DE FINAL DE LA CHAMPION...</a></div><div class="desc_noticies"><p>El Equipo buscará este sábado en el Camp Nou la clasificación para los cuartos de final de la Liga de […]</p></div></div></div></div></div></div><div class="mas-noticias mes-noticies"> <a href="noticias">Más noticias'

$gaResult  = StringRegExp($gsHTML, 'href="([^"]+)', $STR_REGEXPARRAYGLOBALMATCH)
If IsArray($gaResult) Then _DebugArrayDisplay($gaResult)

$gaResult  = _StringBetween($gsHTML, 'href="', '"')
If Not @error Then _DebugArrayDisplay($gaResult)

 

Edited by TheXman
Added _StringBetween() example
Link to comment
Share on other sites

TheXman, thanks a lot for your rapid response. I'm sorry for the function misuse thing is that I was trying with both functions and after a lot of trial and error I mixed them up.

It partially works, because it doesn't bring all the possible results.

If you find href= in the view-source of the site it will get 85 matchs, and with your approach it gets 72.

Thanks again.

#include <array.au3>
#include <Debug.au3>
#include <String.au3>
#include "WinHttp.au3"

Local $hOpen = _WinHttpOpen()
If @error Then
    MsgBox(48, "Error", "Error initializing the usage of WinHTTP functions.")
    Exit
EndIf
Local $Host = "messi.com"
Local $hConnect = _WinHttpConnect($hOpen, $Host) ; <- yours here
If @error Then
    MsgBox(48, "Error", "Error specifying the initial target server of an HTTP request.")
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf
Local  $req = _WinHttpOpenRequest($hConnect)
If @error Then
    MsgBox(48, "Error", "Error creating an HTTP request handle.")
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf
_WinHttpSendRequest($req)
If @error Then
    MsgBox(48, "Error", "Error sending specified request.")
    _WinHttpCloseHandle($req)
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf

_WinHttpReceiveResponse($req) ;------------------------ Wait for the response
If @error Then
    MsgBox(48, "Error", "Error waiting for the response from the server.")
    _WinHttpCloseHandle($req)
    _WinHttpCloseHandle($hConnect)
    _WinHttpCloseHandle($hOpen)
    Exit
EndIf

Local $sChunk, $gsHTML
If _WinHttpQueryDataAvailable($req) Then ;------------- See if there is data to read
    While 1
        $sChunk = _WinHttpReadData($req)
        If @error Then ExitLoop
        $gsHTML &= $sChunk
    WEnd
    ConsoleWrite($gsHTML & @CRLF) ; print to console
    $gaResult = StringRegExp($gsHTML, 'href="([^"]+)', $STR_REGEXPARRAYGLOBALMATCH)
    If IsArray($gaResult) Then _DebugArrayDisplay($gaResult)
Else
    MsgBox(48, "Error", "Site is experiencing problems.")
EndIf

_WinHttpCloseHandle($req)
_WinHttpCloseHandle($hConnect)
_WinHttpCloseHandle($hOpen)

 

 

 

Link to comment
Share on other sites

My example was as accurate as the data in which you provided.  The discrepancy is because the website that you referenced (messi.com) has some hrefs enclosed in double quotes and others in single quotes.  I only looked for double quotes because that is what was in the data that you provided.

Also, my example was given to point you in the right direction, not to give you a fully working solution.

Since this is kind of a weird one, here's an example that will get both:

#include <Constants.au3>
#include <InetConstants.au3>
#include <Debug.au3>

$gsHTML = InetRead("https://messi.com", $INET_FORCEBYPASS)
If @error Then Exit MsgBox($MB_ICONERROR, "ERROR", "Unable to retrieve website")

$gsHTML = BinaryToString($gsHTML)

$gaResult  = StringRegExp($gsHTML, 'href=(?:"|'')([^"'']+)', $STR_REGEXPARRAYGLOBALMATCH)
If IsArray($gaResult) Then _DebugArrayDisplay($gaResult)

 

Edited by TheXman
Link to comment
Share on other sites

I updated my previous post with a more accurate example based on your actual data.

Edited by TheXman
Link to comment
Share on other sites

Link to comment
Share on other sites

Hi people, in some cases I need unique records so I apply _ArrayUnique to the resulting array. Is there any option to StringRegExp to get unique records and avoid using _ArrayUnique?

$gaResult  = StringRegExp($gsHTML, 'href=(?:"|'')([^"'']+)', $STR_REGEXPARRAYGLOBALMATCH)
If IsArray($gaResult) Then 
     $gaResult = _ArrayUnique($gaResult)
    _DebugArrayDisplay($gaResult)
EndIf

 

Link to comment
Share on other sites

3 hours ago, jcpetu said:

Is there any option to StringRegExp to get unique records and avoid using _ArrayUnique?

A regular expression to provide unique hrefs is certainly possible.  One way would be to use a negative lookahead.  But compared to using _ArrayUnique(), it would be MUCH slower and inefficient due to all of the backtracking that would need to be done by the regular expression engine.  _ArrayUnique() uses a scripting dictionary to remove duplicates which is lightning fast compared to most other AutoIt methods, assuming you area dealing with a 1D or 2D array and only need to remove duplicates based on a single column.

Why do you want to avoid using _ArrayUnique()?

Edited by TheXman
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...