Jump to content

extract certain URLs from file using StringRegExp


hamohd70
 Share

Go to solution Solved by mikell,

Recommended Posts

here is the file..

I think you missed something :)

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

here is the file..

 

I think you forgot the file :ermm:

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to comment
Share on other sites

For extracting lines containing urls :

$aLines = StringRegExp($sContent, "(?mi)(\N*(?:(?:http)|(?:https)|(?:rtmp)|(?:rtmps)):\N*)", 3)
_ArrayDisplay($aLines)
Edited by jguinch
Link to comment
Share on other sites

Sorry, i forgot to attach the file. :sweating:

 

Thanks jguinch for the snippnet.

I did this way..

$file = FileOpenDialog("Select your file",@DesktopDir,"Text document (*.txt)|All files (*.*)")
$data = FileRead($File)

$lines = StringSplit($data, @lf)
If IsArray($lines) Then
    $linecount = $lines[0]
Else
    $linecount = 0
Endif

For $i = 1 to $linecount
    $txt = StringStripWS(FileReadLine($file, $i),1)
    if StringInStr($txt,'"') then $txt = StringReplace($txt,'"',"")
    if StringRegExp($txt, "(?mi)(\N*(?:(http)|(https)|(rtmp)|(rtmps)):\N*)") Then
        if StringInStr($txt,"http") Then
            $txt = StringMid($txt, StringInStr($txt,"http"))
        ElseIf StringInStr($txt,"https") Then
            $txt = StringMid($txt, StringInStr($txt,"https"))
        ElseIf StringInStr($txt,"rtmp") Then
            $txt = StringMid($txt, StringInStr($txt,"rtmp"))
        ElseIf StringInStr($txt,"rtmps") Then
            $txt = StringMid($txt, StringInStr($txt,"rtmps"))
        EndIf
        $upscount = $upscount + 1
        ConsoleWrite ($txt &@CRLF)
    EndIf
    If StringinStr($txt, '#EXT') Then
        $downscount = $downscount + 1
    EndIf
Next
Exit

works just fine. Any comments to make it more efficient are welcome !!

boc.txt

Link to comment
Share on other sites

Thanks Mikell,for the simplified expression.

Maybe we can add simple quotes ?

$aRes = StringRegExp($text, '(?:http|rtmp)s?[^"''\r\n]+', 3)
Edited by jguinch
Link to comment
Share on other sites

  • Solution

It doesn't seem very useful as there are no single quotes in the OP's file

BTW using this file, this one is funny too :

#Include <Array.au3>

$text = FileRead("boc.txt")

$aRes = StringRegExp($text, '(?m)^(?:.*?(?<=\A|\v{4}|,|el=)([^":,\r\n]+))|(?:http|rtmp|rtsp)s?[^"\r\n]+', 3)

Dim $a[UBound($aRes)/2][2]
For $i = 0 to UBound($aRes)-1
    If Mod($i, 2) = 0 Then
         $a[$i/2][0] = $aRes[$i]
    Else
         $a[($i-1)/2][1] = $aRes[$i]
    EndIf
Next
_ArrayDisplay($a)

Quite unsafe in case of file changes then intended for playing only  :)

Edit : exp simplification

Edited by mikell
Link to comment
Share on other sites

It doesn't seem very useful as there are no single quotes in the OP's file

BTW using this file, this one is funny too :

#Include <Array.au3>

$text = FileRead("boc.txt")

$aRes = StringRegExp($text, '(?m)^(?:.*?(?<=\A|\v{4}|,|el=)([^":,\r\n]+))|(?:http|rtmp|rtsp)s?[^"\r\n]+', 3)

Dim $a[UBound($aRes)/2][2]
For $i = 0 to UBound($aRes)-1
    If Mod($i, 2) = 0 Then
         $a[$i/2][0] = $aRes[$i]
    Else
         $a[($i-1)/2][1] = $aRes[$i]
    EndIf
Next
_ArrayDisplay($a)

Quite unsafe in case of file changes then intended for playing only  :)

Edit : exp simplification

 

interesting code.

can you please explain the StringRegEx part?

$aRes = StringRegExp($text, '(?m)^(?:.*?(?<=\A|\v{4}|,|el=)([^":,\r\n]+))|(?:http|rtmp|rtsp)s?[^"\r\n]+', 3)
Link to comment
Share on other sites

$aRes = StringRegExp($text, '(?mx) ^   (?:.*? (?<=A|v{4}|,|el=) ([^":,v]+) )   |   (?:http|rtmp|rtsp)s?[^"v]+'  , 3)

 

(?m) multiline allows ^ to match at start of each line

The main | causes the regex to match alternatively a channel or an url

First part (channels) :
(?:.*?   everything, until the capturing group matching the sequence defined by the character class  ([^":,v]+) and preceded (?<=  by either A (beginning of text), v{4} (i.e.@crlf & @crlf ) , a comma , or el=

Returns the capturing group

2nd part (urls) :
matches http, rtmp, rtsp AND an optional s AND the sequence defined by the character class [^"v]+
The lack of capturing group in this part causes the regex to return the whole match
 

Edit

For details and more definitions please have a look at the helpfile where jchd wrote a very nice StringRegExp explaining chapter

:)

 

Edited by mikell
Link to comment
Share on other sites

For anyone interessed, you can read a step-by-step, correct, plain english, breakdown of a PCRE pattern by pasting it there.

Applied to the above pattern, you get this:

/(?m)^(?:.*?(?<=\A|\v{4}|,|el=)([^":,\r\n]+))|(?:http|rtmp|rtsp)s?[^"\r\n]+/

    1st Alternative: (?m)^(?:.*?(?<=\A|\v{4}|,|el=)([^":,\r\n]+))
        (?m) Match the remainder of the pattern with the following options:
            m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
        ^ assert position at start of a line
        (?:.*?(?<=\A|\v{4}|,|el=)([^":,\r\n]+)) Non-capturing group
            .*? matches any character (except newline)
                Quantifier: Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
            (?<=\A|\v{4}|,|el=) Positive Lookbehind - Assert that the regex below can be matched
                1st Alternative: \A
                    \A assert position at start of the string
                2nd Alternative: \v{4}
                    \v{4} matches any vertical whitespace character
                        Quantifier: Exactly 4 times
                3rd Alternative: ,
                    , matches the character , literally
                4th Alternative: el=
                    el= matches the characters el= literally (case sensitive)
            1st Capturing group ([^":,\r\n]+)
                [^":,\r\n]+ match a single character not present in the list below
                    Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
                    ":, a single character in the list ":, literally (case sensitive)
                    \r matches a carriage return (ASCII 13)
                    \n matches a fine-feed (newline) character (ASCII 10)
    2nd Alternative: (?:http|rtmp|rtsp)s?[^"\r\n]+
        (?:http|rtmp|rtsp) Non-capturing group
            1st Alternative: http
                http matches the characters http literally (case sensitive)
            2nd Alternative: rtmp
                rtmp matches the characters rtmp literally (case sensitive)
            3rd Alternative: rtsp
                rtsp matches the characters rtsp literally (case sensitive)
        s? matches the character s literally (case sensitive)
            Quantifier: Between zero and one time, as many times as possible, giving back as needed [greedy]
        [^"\r\n]+ match a single character not present in the list below
            Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            " a single character in the list " literally (case sensitive)
            \r matches a carriage return (ASCII 13)
            \n matches a fine-feed (newline) character (ASCII 10)

Actually you get even more if you leave colorizing ON.

Option 3 can be simulated by typing g in the modifier input.

You can use the very useful "regex debugger" tool to follow stepwise how the engine proceeds thru subject and pattern.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...