Jump to content

clean up both right and left of url?


youtuber
 Share

Recommended Posts

I want to delete the left and right of Url addresses, but I've tried a few patterns but I haven't succeeded

$aURLs[6] = ["-_ http://autoitscript.com---", _
        " https://www.autoitscript.com => ", _
        "1-http://www.autoitscript.com", _
        "- www.autoitscript.com -", _
        "- www.autoitscript.org - _", _
        "-#$%& www.autoitscript.net -"]

For $i = 0 To 5
    $RegExp = StringRegExpReplace($aURLs[$i], 'How should the pattern be?','')
    $RegExp = StringRegExpReplace($aURLs[$i], '','How should the pattern be?')
    ConsoleWrite($RegExp & @CRLF)
Next

 

I did a sample test, but I failed

$pattern = '(.com|\.net|\.org)(.*)'
$pattern2 = '(*.)(http://|https://|www.)'

$aURLs = "-_ http://autoitscript.com---" & @CRLF & _
        " https://www.autoitscript.com => " & @CRLF & _
        "1-http://www.autoitscript.com" & @CRLF & _
        "- www.autoitscript.com -" & @CRLF & _
        "- www.autoitscript.org - _" & @CRLF & _
        "-#$%& www.autoitscript.net -"

$RegExp = StringRegExpReplace($aURLs, $pattern,'$1')
$RegExp = StringRegExpReplace($aURLs, $pattern2,'$1')
    ConsoleWrite($RegExp & @CRLF)

 

Edited by youtuber
Link to comment
Share on other sites

Ordinarily I would ask to see some of your attempts to help you understand where you were having an issue.  That's because I prefer to help you learn rather than to just give you solutions.  But I'm bored at the moment.  :muttley:

Here are just a couple of ways to do it.

example1()
example2()

Func example1()
    Local $aURLs[6] = [ _
                      "-_ http://autoitscript.com---", _
                      " https://www.autoitscript.com => ", _
                      "1-http://www.autoitscript.com", _
                      "- www.autoitscript.com -", _
                      "- www.autoitscript.org - _", _
                      "-#$%& www.autoitscript.net -" _
                      ]

    ConsoleWrite("Example1" & @CRLF)
    For $i = 0 To 5
        $RegExp = StringRegExpReplace($aURLs[$i], ".*?((?:https?://|www).*?[.](?:com|net|org)).*","\1")
        ConsoleWrite($RegExp & @CRLF)
    Next
EndFunc

Func example2()
    Local $aResult[0]
    Local $aURLs[6] = [ _
                      "-_ http://autoitscript.com---", _
                      " https://www.autoitscript.com => ", _
                      "1-http://www.autoitscript.com", _
                      "- www.autoitscript.com -", _
                      "- www.autoitscript.org - _", _
                      "-#$%& www.autoitscript.net -" _
                      ]

    ConsoleWrite("Example2" & @CRLF)
    For $i = 0 To 5
        $aResult = StringRegExp($aURLs[$i], "(?:https?://|www).*?[.](?:com|net|org)", 1)
        If IsArray($aResult) Then ConsoleWrite($aResult[0] & @CRLF)
    Next
EndFunc

 

Edited by TheXman
Link to comment
Share on other sites

well what will happen if a url is more specific :D

Local $aURLs[6] = [ _
                      "-_ http://www.international.in---", _
                      "- https://www.communications.com => ", _
                      "1-http://www.networksupport.net", _
                      "---     www.organizasion.org -", _
                      "- www.information.info - _", _
                      "-#$%& www.autoitscript.com -" _
                      ]

 

Link to comment
Share on other sites

I'm not going to play that game with you.  :naughty:  

I pointed you in the right direction.  Now it is time for you to put in a little effort.

If you encounter an obstacle, then come back with your attempt(s), clearly state what your issue is/are, provide your code or a workable example, and someone will probably help you.

Link to comment
Share on other sites

more fun with pipes, but I can make up roughly 400 ways to make it fail.  I fear the problem is not yet clearly defined, also you should totally be showing us what youve tried.

Local $aURLs[6] = [ _
                      "-_ http://www.international.in---", _
                      "- https://www.communications.com => ", _
                      "1-http://www.networksupport.net", _
                      "---     www.organizasion.org -", _
                      "- www.information.info - _", _
                      "-#$%& www.autoitscript.com -" _
                      ]


for $i = 0 to ubound($aURLs) - 1
    msgbox(0, '' , stringregexp($aURLs[$i] ,"(h.*?www\..*?\.\w\w+|www\..*?\.\w\w+)" , 3)[0])
next

 

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

I'm not trying anything but I'm looking for the best way to extract url addresses in a complex html or txt file.
Because I know I won't meet my needs when I go deeper :(
Do you think this is the best way?

(h.*?www\..*?\.\w\w+|www\..*?\.\w\w+)

I wonder if @mikell has an idea for us

@iamtheky it really is not my fault that url addresses can be similar to this

Local $aURLs[8] = [ _
                      "-_ http://www.international.in.us---", _
                      "- https://www.communications.com.fr => ", _
                      "1-http://www.networksupport.net.us", _
                      "---     www.organizasion.org -", _
                      "- www.information.info - _", _
                      "-#$%& www.autoitscript.com -", _
                      "- https://www.autoit-script.com.fr/ -" _
                      ]

 

Edited by youtuber
Link to comment
Share on other sites

Are not the previous answers correct ? I -personally- think they are
You can always make a regex fail, reason why I totally agree with what iamtheky said
BTW this one

$RegExp = StringRegExpReplace($aURLs[$i], '.*?((?:https?://|www)[.\w+]+).*', "$1")

is nothing but a mix of the previous ones. It works ... and obviously may fail, depending on the addresses, the context, etc

Link to comment
Share on other sites

This looks like a one clean way of doing it 

For $i = 0 To UBound($aURLs) - 1
    $aURLs[$i] = StringTrimLeft($aURLs[$i], _by($aURLs[$i]))
    $aURLs[$i] = StringTrimRight($aURLs[$i], _by(StringReverse($aURLs[$i])))
Next

_ArrayDisplay($aURLs)

Func _by($sValue)
    Local $aRet = StringRegExp($sValue, '(^.?[\W_]+)\w()', 3)
    If Not @error Then Return StringLen($aRet[0])
EndFunc

Edit:
a small fix on an extra space introduced when running TheXman's (next post) array example

Edited by Deye
Link to comment
Share on other sites

22 hours ago, youtuber said:

I'm not trying anything but I'm looking for the best way to extract url addresses in a complex html or txt file.

Here is one more example that uses a RFC 3986 compliant character set.  This regular expression will not handle URLs that do not start with either http://, https://, or "www.".   For example, it will not find "autoitscript.com" but it will find "https://autoitscript.com".  Like others have said, you have not adequately defined what, EXACTLY, you are looking for.  Without a specific, all-encompassing,  definition, all of our suggestions may miss certain cases.  All of the suggestions are based on the data that you provided.  Maybe you can help us help you by providing one of your "complex" html or text files so that we can see what you are working with.  This back and forth, what-if, way of getting to your solution is a waste of time and effort .

 

#include <Constants.au3>
#include <Array.au3>

example()
Func example()
    Local $aResult[0]
    Local $aURLs = [ _
                   "-_ http://autoitscript.com---"       , " https://www.autoitscript.com => ", _
                   "1-http://www.autoitscript.com"       , "- https://autoitscript.com -", _
                   "- www.autoitscript.com -"            , "- www.autoitscript.org - _", _
                   "- www.autoitscript.org... - _"       , "-#$%& www.autoitscript.net -", _
                   "-_ http://www.international.in---"   , "- https://www.communications.com => ", _
                   "1-http://www.networksupport.net"     , "---     www.organizasion.org -", _
                   "- www.information.info - _"          , "-#$%& www.autoitscript.com -", _
                   "-_ http://www.international.in.us---", "- https://www.communications.com.fr => ", _
                   "1-http://www.networksupport.net.us"  , "---     www.organizasion.org -", _
                   "- www.information.info/test.html - _", _
                   "-#$%& www.autoitscript.com/this&20is%20a%20test.html -", _
                   "-#$%& https://www.autoitscript.com/this&20is%20a%20test.html -", _
                   "- https://www.autoit-script.com.fr/ -" _
                   ]

    ;Parse URLs using RFC 3986 Compliant Character Set
    $aResult = StringRegExp( _
                   _ArrayToString($aURLs, @CRLF), _
                   "(?i)\b(?:https?://|www\.)[-A-Z0-9+&@#/%=~_$?!:,.]*[A-Z0-9+&@#/%=~_$]", _
                   $STR_REGEXPARRAYGLOBALMATCH)
    If IsArray($aResult) Then _ArrayDisplay($aResult)
EndFunc

 

Edited by TheXman
Removed "|" from RFC 3986 character set in regular expression
Link to comment
Share on other sites

29 minutes ago, youtuber said:

It's really great, thank you, this is a very good pattern.
But what is the difference between which should I use?

You're welcome.

You should use the most current version.  As it stated in my comment, as to why I changed it, I removed the "|" symbol from the set of characters that are valid in a URL.  It was added in error.  I just corrected it to make it match the spec, or at least to make it match the character set as closely as possible.

If you look at my previous post, you will also see that I created a hyperlink to the RFC.

 

From my previous post:

image.png.21e181669da0fe750768d56223f4820d.png

Edited by TheXman
Link to comment
Share on other sites

As suggested already there will always be reasons for anything  fail :yawn:

Add an extra "&" to the end and it flunks

I believe the stream cannot be handled so sterilely when its going all in a one direction 
So the idea of treating both other ends separately might still be a better way  ..

Yet, Still needed extra proofing to my example :

#include <File.au3>

Local $aURLs = [ _
        "&##$%&http://www.networksupport.net.us&##$%& - _", _
        " =https://www.autoitscript.com/forum/topic/195819-clean-up-both-right-and-left-of-url/?tab=comments#comment-1403743&##$%&", " https://www.autoitscript.com => ", _
        "- www.information.info/test.html - _", _
        "-#$%& www.autoitscript.com/this&20is%20a%20test.html -", _
        "- https://www.autoit-script.com.fr&##$%&##$%&##$##$%&" _
        ]
        
For $i = 0 To UBound($aURLs) - 1
    $aURLs[$i] = StringTrimLeft($aURLs[$i], _by($aURLs[$i]))
    $aURLs[$i] = StringTrimRight($aURLs[$i], _by(StringReverse($aURLs[$i])))
Next

_ArrayDisplay($aURLs)

Func _by($sValue)
    Local $aRet = StringRegExp($sValue, '(^.?\d?[\W_]+)\w()', 3)
    If Not @error Then Return StringLen($aRet[0])
EndFunc

 

Edited by Deye
Link to comment
Share on other sites

41 minutes ago, Deye said:

Add an extra "&" to the end and it flunks

Yes, as previously stated, there are many edge cases that would break the regular expression that I provided.

 

On 9/22/2018 at 3:57 PM, youtuber said:

I'm looking for the best way to extract url addresses in a complex html or txt file.

The original poster said that he was trying to parse URLs from "complex" html or text files.  Not sure what a "complex" html or text file is, but your solution appears to rely on the input being an array of pre-parsed data.  That means that your solution is not viable at all, if run against a file, without additional parsing.  My last example assumes that it would be run against a file, not an array.  It also successfully parsed out all of the examples that had been supplied up to the point in which I suggested it.

Link to comment
Share on other sites

TheXman,

Your examples were reading off an array, but I dig your full intention, seeing it also in the OP I guess some of us can get easily distracted at times ..

these are just examples but to help emphasize the original intention in code it could have been put like so:

#include <File.au3>

Local $sData = ' - www.autoitscript.org... - _-#$%&  _www.autoitscript.net -,' & @CRLF & _
        ' -_ http://www.international.in---- https://www.communications.com => , _' & @CRLF & _
        ' 1-http://www.networksupport.net&##$%--- www.organizasion.org -, _' & @CRLF & _
        ' - www.information.info - _-#$%& www.autoitscript.com -, _' & @CRLF & _
        ' -_ http://www.international.in.us&##-##$%&--- https://www.communications.com.fr&##$%$##$%& => , _' & @CRLF & _
        ' 1-http.networksupport.net.us- w-- _' & @CRLF & _
        ' - www.information.info/test.html, - _ _' & @CRLF & _
        ' -#$%& www.autoitscript.com/this&20is%20a%20test.html _-, ' & @CRLF & _
        ' -#$%& https://www.autoitscript.com/this&20is%20a%20test.html -&##$%&####$%&https://www.autoitscript.com/forum/topic/?tab=comments#comment-1403807&%$#3546737$aResult, _' & @CRLF & _
        ' $aResult- https://www.autoit-script.com.fr/ _ -$aResult' & @CRLF & _
        ' ]' & @CRLF

Local $aResult = StringRegExp($sData, "(?i)\b(?:https?://|www\.)[-A-Z0-9+&@#/%=~_$?!:,.]*[A-Z0-9+&@#/%=~_$]", _
        $STR_REGEXPARRAYGLOBALMATCH)
_ArrayDisplay($aResult)

; Yet Another Example
Local $aResult = StringRegExp($sData, '(?i)(?:https?://|www\.)+[\w.?+=&%@#!:\-/]+\w', 3)
$aResult[1] &= "          <=   Previously missed "
_ArrayDisplay($aResult)

Deye

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...