Sign in to follow this  
Followers 0
Dieuz

Regular Expression Help!

18 posts in this topic

#1 ·  Posted (edited)

Hey guys,

I want to delete urls from a file that doesnt meet my criteria but I dont know how to use the StringRegExp properly to achieve it.

Wrong format: http: //www.site1.com

Good format: http: //www.site1.com/anything

$file = FileOpen("URL.txt", 2) ; How can I set it to Read/Write mode at the same time?
$count = _FileCountLines("URL.txt")

For $x = 1 to $count

$url = FileReadLine($file, $x)

; Remove Url from file if url doesnt meet criteria (using StringRegExp?)
; Wrong format: http://www.site1.com
; Good format: htt://www.site1.com/anything
    
Next

FileClose($file)

URL.txt

http://www.site1.com
http://www.site2.com/anything
http://www.site2.com/test
http://www.site3.com
http://www.site3.com/test

After running the code above, I would like to have this in URL.txt :

http://www.site2.com/anything
http://www.site2.com/test
http://www.site3.com/test

Thanks!

;)

Edited by Dieuz

Share this post


Link to post
Share on other sites



i posted something similar here see if it helps

Share this post


Link to post
Share on other sites

#include <Array.au3>

Local $aMatch, $sText = _
    "http://www.site1.com" & @CRLF & _
    "http://www.site2.com/anything" & @CRLF & _
    "http://www.site2.com/test" & @CRLF & _
    "http://www.site3.com" & @CRLF & _
    "http://www.site3.com/test"
    
$aMatch = StringRegExp($sText, "(?i)http://www\.[^.\r\n]+\.[^/\r\n]+/.+", 3)

If IsArray($aMatch) Then _ArrayDisplay($aMatch)

The pattern is simple (read not restrictive that much). Tweak as necessary.

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Thanks, I can see the pattern!

How could I tweak so it wont accept:

http: //www.site1.com/

(normal website with a backslash at the end but with nothing after it)

Thanks!

EDIT: Found the FileRead function ;)

Edited by Dieuz

Share this post


Link to post
Share on other sites

Thanks, the Regular Expression seems to be acurate.

How can I extract all lines from a file and transfer it to a string like you did?

Funny enough, his example uses line ends as breaks. So you only have to read the data from a file and you are done. xD

Function is FileRead.

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

"\w+://.+/.{2,}"

EDIT: Better

"(?i)(?m:^)(\w+://.+/\w.*)(?:\v|\z)"

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

Thanks alot guys!

Here's what I got so far:

$file = FileOpen("BACKLINK.txt", 0)  
$readbacklink = FileRead($file)
$bl_array = StringRegExp($readbacklink, "\w+://.+/.{2,}",3)
FileClose($file)

_FileCreate("BACKLINK.txt")

$file2 = FileOpen("BACKLINK.txt", 1)  

    For $w = 0 to UBound($bl_array) - 1
        
    FileWriteLine($file2, $bl_array[$w])
        
    Next

FileClose($file2)

There is still one thing that isnt wotking properly. When the RegExp extract the links and add them to the array, it add "[]h" at the end of each link...

Edited by Dieuz

Share this post


Link to post
Share on other sites

First off, change that SRE to the one I used in the edit.

Secondly, you don't need FileOpen() or FileClose() for the reading part.

Next: Are you saying that with a plain text file as given above you are getting the extra characters added?

Try This

$sStr = FileRead("backlink.txt")
$bl_array = StringRegExp($sStr, "(?i)(?m:^)(\w+://.+/\w.*)(?:\v|\z)"3)
If NOT @Error Then
   Local $sOut = ""
   For $i = 0 To Ubound($bl_array) -1
        $sOut &= $bl_array[$i]
   Next
   $hFile = FileOpen("backlink.txt", 2)
   FileWrite($hFile, StringStripWS($sOut, 2))
   FileClose($hFile)
EndIf

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

First,

Next: Are you saying that with a plain text file as given above you are getting the extra characters added?

Yes, even with a plain text file.

Here is what I see if I do an _ArrayDisplay():

Posted Image

Second, with the above code there is no "Line break" between the URLS in the file. It's why I tought it was usefull to use FileWriteLine()

By the way, thanks for taking the time to help me! Appreciated it!

;)

Edited by Dieuz

Share this post


Link to post
Share on other sites

Change

"(?i)(?m:^)(\w+://.+/\w.*)(?:\v|\z)"

to

"(?i)(?m:^)(\w+://.+/\w.*)(?:\v|\z)+"

and see what you get. Please report back.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

I am still getting the character added to every link.

Here is a working example so you can try it without having any file.

#include <Array.au3>
#Include <File.au3>

Local $bl_array, $sText = _
    "http://www.site1.com/" & @CRLF & _
    "http://site2.com/anything" & @CRLF & _
  "http://www.site2.com/test" & @CRLF & _
    "http://www.site3.com/" & @CRLF & _
    "http://www.site3.com/test"

$bl_array = StringRegExp($sText, "(?i)(?m:^)(\w+://.+/\w.*)(?:\v|\z)+",3)

_ArrayDisplay($bl_array)

As you can see every url is on a different line in the file.

Edited by Dieuz

Share this post


Link to post
Share on other sites

George (& Dieuz),

If it is of any assistance, I am not getting any additional characters when I run that script (on 3.3.1.7). I get what I expected.

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

Strange... I do have Version 3.3.1.7 and Im getting the additional characters like in the picture I posted.

Share this post


Link to post
Share on other sites

George (& Dieuz),

If it is of any assistance, I am not getting any additional characters when I run that script (on 3.3.1.7). I get what I expected.

M23

Either am I and I suspect his problem is in the text file. I've often seen this happen with a database, spreadsheet or some html code.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Arg...even without using a file at all (running the simple script above), I am getting the additional characters... why so ;)

Share this post


Link to post
Share on other sites

#16 ·  Posted (edited)

Strange... I do have Version 3.3.1.7 and Im getting the additional characters like in the picture I posted.

Okay, I'm assuming that you still get them with the code you posted (I'm not)

Try it with my code written the way it should have been (there is an error in it).

$sStr = FileRead("backlink.txt")
$bl_array = StringRegExp($sStr, "(?i)(?m:^)(\w+://.+/\w.*)(?:\v|\z)+",3)
If NOT @Error Then
   Local $sOut = ""
   For $i = 0 To Ubound($bl_array) -1
        $sOut &= $bl_array[$i] & @CRLF
   Next
   $hFile = FileOpen("backlink.txt", 2)
   FileWrite($hFile, StringStripWS($sOut, 2))
   FileClose($hFile)
EndIf

If that still fails try something that sounds really stupid at first glance, reboot your system and try it again.

Also what text editor are you reading the file with?

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

$sStr = FileRead("backlink.txt")
$bl_array = StringRegExp($sStr, "(?i)(?m:^)(\w+://.+/\w.*)(?:\v|\z)+",3)
If NOT @Error Then
   Local $sOut = ""
   For $i = 0 To Ubound($bl_array) -1
        $sOut &= $bl_array[$i] & @CRLF
   Next
   $hFile = FileOpen("backlink.txt", 2)
   FileWrite($hFile, StringStripWS($sOut, 2))
   FileClose($hFile)
EndIf

This code DOES work now. I really dont know why it wasnt working at first. I am not getting any additional characters! Thanks alot!

Quick & last question, what would be the best way to make sure there is no duplicate element (url) in the $bl_array?

Seriously, thanks everyone for your help! I can now continue working on my app!

Share this post


Link to post
Share on other sites

$sStr = FileRead("backlink.txt")
$bl_array = StringRegExp($sStr, "(?i)(?m:^)(\w+://.+/\w.*)(?:\v|\z)+",3)
If NOT @Error Then
   Local $sOut = ""
   For $i = 0 To Ubound($bl_array) -1
        If NOT StringInStr($sOut, $bl_Array{$i] & @CRLF) Then $sOut &= $bl_array[$i] & @CRLF
   Next
   $hFile = FileOpen("backlink.txt", 2)
   FileWrite($hFile, StringStripWS($sOut, 2))
   FileClose($hFile)
EndIf


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0