handofthrawn Posted July 24, 2014 Share Posted July 24, 2014 I have a web site that gives me something like the text below. I want to find all the words that come after "EDT" (first one is "BLAH") and then skip to the next EDT string. On occasion, there are multiple words I need seperated by a comma like the second group of words (RandomBlah4 and Blah1). Before the word I need is always EDT and a space and after the word is a space unless there is a comma for a second word. I need a script to pull these words out of a large amount of text and then put them into a text file, clipboard, whatever so I can put them into individual excel columns. Sample Text "12:39 EDT BLAH This blah is something I need with more random text. Random text that goes on and on 12:35 EDT RandomBlah4, Blah1 I also need RandomBlah4 and Blah1 Randtom text I don't need" My question is what is the best way to do this? I am not a newb to Autoit but I'm not a master of it either. Before I spent a day trying to throw darts in the dark I figured I would ask what is the best way to go about this? Should I use stringreplace or will that be too difficult with something that can be 10+ pages long? Should I put the text in a word doc and use _Word_docFindReplace? How about using StringRegExpReplace and searching for anything with a space and characters that end in EDT and removing them from the document? Thanks for any help you can provide. Link to comment Share on other sites More sharing options...
jchd Posted July 24, 2014 Share Posted July 24, 2014 What is your definition of "word" in this precise context? How are multiple words separated when there are several? How to determine the end of the sequence of words that you want extracted? Unless you give a definitive precise answers to those 3 questions I'm afraid your quest is going to be a moving target. Regular expressions are very powerful but are in fact programs which need a precise specification to provide the expected result reliably. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
handofthrawn Posted July 24, 2014 Author Share Posted July 24, 2014 The word would be a group of letters in between EDT and the randomtext minus all the spaces. For example EDT Hotdog randomtext EDT Football randomtext EDT Fall,Summer randomtext EDT Spring,Winter,Pool,Class randomtext EDT Ball randomtext The words here are: Hotdog, Football, Fall, Summer, Spring, Winter, Pool, Class, and Ball The pattern is always EDT with a bunch of spaces, the word(s), then more spaces. Multiple words are always seperated by a ",". The above has two examples of multiple words. Essentially my program will look for these two examples: "EDT Hotdog blahblahblahblbh" "EDT BALL,SUN,FALL blahblahblahblbh" And it would give me back the text Hotdog BALL SUN FALL Link to comment Share on other sites More sharing options...
UEZ Posted July 24, 2014 Share Posted July 24, 2014 (edited) Try this: $sText = "EDT Hotdog randomtext" & @CRLF & _ "EDT Football randomtext" & @CRLF & _ "EDT Fall,Summer randomtext" & @CRLF & _ "EDT Spring,Winter,Pool,Class randomtext" & @CRLF & _ "EDT Ball randomtext" MsgBox(0, "RegEx Test", StringRegExpReplace($sText, "EDT\h*(.+)\h+.+", "$1")) Br, UEZ Edited July 24, 2014 by UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
handofthrawn Posted July 24, 2014 Author Share Posted July 24, 2014 Thanks so much UEZ. I will give this a go! Link to comment Share on other sites More sharing options...
handofthrawn Posted July 24, 2014 Author Share Posted July 24, 2014 I'm struggling a little bit trying to modify your code to get it right. I put in your code into autoit and it spit out the correct words (although it didn't seperate the "Fall,Summer" but no biggie). When I tried to put the example into a text file I ran into problems. All it gave back was hotdog. I made a new example that is a lot more real because this feels like its being a bit tricky. I put the example below into test.txt and this is my code. $sText = FileReadLine("test.txt", 1) MsgBox(0, "RegEx Test", StringRegExpReplace($sText, "EDTh*(.+)h+.+", "$1")) ************************EXAMPLE******************************* 10:27 EDT BALL shakespeare is awsome The Internets are dope. Teachers aren't cool I'm going to skip class 10:19 EDT FALL, SPRING cat videos are what i watch all day pewdewpie is the best, off to watch the biebs ************************EXAMPLE******************************* That example comes back with "10:27 BALL" I want it to come back Ball FALL SPRING Maybe if I understand your code for StringRegExpReplace I will be able to tweak it more myself. I read the function but some of the stuff you did didn't seem to be listed there. I don't get the h, I assume the * means wild card. The +.+ I'm confused with too . If that looking for a period? Thank you again for any help. Link to comment Share on other sites More sharing options...
Exit Posted July 24, 2014 Share Posted July 24, 2014 (edited) #include <array.au3> $sText = "10:27 EDT BALL " & @crlf & _ "shakespeare is awsome The Internets are dope. Teachers aren't cool I'm going to skip class " & @crlf & _ "10:19 EDT FALL, SPRING cat videos are what i watch all day " & @crlf & _ "pewdewpie is the best, off to watch the biebs " MsgBox(0, "Solution of UEZ", StringRegExpReplace($sText, "EDT\h*(.+)\h+.+", "$1")) $a1 = StringSplit(StringReplace(StringStripWS(StringReplace($sText,@TAB," "), 4), ", ", ",,"), "EDT ", 3) _ArrayDelete($a1,0) ; first entry is before EDT $words="" for $i=0 to UBound($a1)-1 $a2=StringSplit($a1[$i]," ") $words &= StringReplace($a2[1],","," ")& " " Next $a2=StringSplit(StringStripWS($words,7)," ",2) _ArrayDisplay($a2,"Solution of EXIT") Edited August 6, 2014 by Exit App: Au3toCmd UDF: _SingleScript() Link to comment Share on other sites More sharing options...
UEZ Posted July 24, 2014 Share Posted July 24, 2014 The import thing is in such cases to provide a real text rather than an example to be able to find a solution. Let me try again... Br, UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
mikell Posted July 24, 2014 Share Posted July 24, 2014 The precision in requirements is really the main thing So this will work with your last example.... until you add different requirements #Include <Array.au3> $sText = FileRead("test.txt") $a = StringRegExp($sText, 'EDT\h+(.+?)\h{2,}', 3) _ArrayDisplay($a) $res = "" For $i = 0 to UBound($a)-1 $tmp = StringRegExp($a[$i], '([^\s,]+)', 3) For $j = 0 to UBound($tmp)-1 $res &= $tmp[$j] & @crlf Next Next Msgbox(0,"", $res) Link to comment Share on other sites More sharing options...
handofthrawn Posted July 24, 2014 Author Share Posted July 24, 2014 (edited) I really apologize for the bad first examples. I was trying to be succinct and ending up screwing it up badly. I really appreciate all the help I've been given. I know how frustrating it can be to try to help someone and they have a moving target and this was definitely not my intention. Exit, you nailed it! Thanks. Last question (I pray!), is there a way to remove the Row numbers and just keep the column where the words are listd? I'm asking because right now when I select everything it spits out [0]|BALL [1]|FALL [2]|SPRING Thanks again everyone for the help. Edited July 24, 2014 by handofthrawn Link to comment Share on other sites More sharing options...
jchd Posted July 24, 2014 Share Posted July 24, 2014 Look at _ArrayDisplay in the help file. There is a flag where you can set display options. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
handofthrawn Posted July 24, 2014 Author Share Posted July 24, 2014 Thanks jchd. I should be good to go! Link to comment Share on other sites More sharing options...
Malkey Posted July 25, 2014 Share Posted July 25, 2014 (edited) And another approach. #include <Array.au3> ; For display purposes only. Local $sText = "EDT Hotdog randomtext" & @CRLF & _ "EDT Football randomtext" & @CRLF & _ "EDT Fall,Summer randomtext" & @CRLF & _ "EDT Spring,Winter,Pool,Class randomtext" & @CRLF & _ "10:27 EDT BALL " & @CRLF & _ "shakespeare is awsome The Internets are dope. Teachers aren't cool I'm going to skip class " & @CRLF & _ "10:19 EDT FALL, SPRING cat videos are what i watch all day " & @CRLF & _ "pewdewpie is the best, off to watch the biebs " Local $sText = FileRead("test.txt") Local $sResults = StringRegExpReplace($sText, "(?m)(^.*EDT\h\h[^A-Z]*\h.*$)|(^.*EDT\h{2,})|(^.*(?!EDT))|((\h{2,}|\t).*$)", "") ; The above RegExp pattern erases the following:- ; "(^.*EDT\h\h[^A-Z]*\h.*$)" - The entire line that has no upper case characters between "EDT@Tab@Tab" and "@Tab" ; "(^.*EDT\h{2,})" - The beginning of all lines contain all characters up to and including "EDT@Tab@Tab or more space" ; "(^.*(?!EDT))" - The entire line that does not have the characters, "EDT" present. ; "((\h{2,}|\t).*$)" - The end of all lines from and including either two horizontal white spaces or one tab character." $sResults = StringStripWS(StringRegExpReplace($sResults, "\h*,\h*", @CRLF), 6) ; $STR_STRIPTRAILING (2) + $STR_STRIPSPACES (4) = 6 ; "\h*,\h*" - Replace all comas with @CRLF. Comas may or may not have any number of spaces on either side of the comma. MsgBox(0, "String - Malkey's Solution", $sResults) ;Or Local $a2 = StringRegExp($sResults, "\V+", 3) _ArrayDisplay($a2, "Array - Malkey's Solution ") Edit: Having the test file of post#17, I was able to refine the main RegExp pattern. Edited July 27, 2014 by Malkey Link to comment Share on other sites More sharing options...
TheSaint Posted July 25, 2014 Share Posted July 25, 2014 Personally I like the simple approach, and would just read line by line and StringSplit on EDT and then strip the whitespace. To get the lines, you can just do a full read into memory (a variable) and do lines as splits on the carriage returns. Make sure brain is in gear before opening mouth! Remember, what is not said, can be just as important as what is said. Spoiler What is the Secret Key? Life is like a Donut If I put effort into communication, I expect you to read properly & fully, or just not comment. Ignoring those who try to divert conversation with irrelevancies. If I'm intent on insulting you or being rude, I will be obvious, not ambiguous about it. I'm only big and bad, to those who have an over-active imagination. I may have the Artistic Liesense to disagree with you. TheSaint's Toolbox (be advised many downloads are not working due to ISP screwup with my storage) Link to comment Share on other sites More sharing options...
handofthrawn Posted July 25, 2014 Author Share Posted July 25, 2014 I see the error of my newb ways. I am trying to find this import button but I can't find it. I just tried to upload a file but I screwed that up too. I tried to upload it to Misc and then I realized that Misc download section was for cool scripts and not a place for people to upload their files for sharing. The import thing is in such cases to provide a real text rather than an example to be able to find a solution. I am messing around with the stringregexpreplace a bit more so I can understand what you guys are writing. For the longest time I was googling and getting no where. The settings were placed in stringregexp all along Link to comment Share on other sites More sharing options...
jchd Posted July 25, 2014 Share Posted July 25, 2014 Right. Duplicating all this would be a waste. Just checking, you're right to imply that there should be a clearly visible reference to all the help discussion in StringRegExp. I wrote it but neglected to reference is boldly in the companion function. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
handofthrawn Posted July 27, 2014 Author Share Posted July 27, 2014 I attached my real life example so I no longer screw this up. For the past 3 days I've tried to take all the responses and learn as much as I can but I'm hitting some roadblocks. If anyone has a moment to correct my responses on any of these I would greatly appreciate it. I hate asking for help for this one problem so I'm doing my best to learn as much as possible so I can fix future problems myself and even help others (got to give back the love). UEZ's solution: StringRegExpReplace($sText, "EDTh*(.+)h+.+", "$1") 1.) "EDT" - Look for EDT 2.) "h*" - Look for any amount of whitespace until step 3 3.) "(.+)" - Capture any amount of characters 4.) "h" - Any amount of whitespace Store this all in $1.Exit solution: Same stringregexpreplace as UEZ but puts it in an array. $a1= StringSplit(StringReplace(StringStripWS($stext,4),", ",",,"),"EDT ",3) _ArrayDelete($a1,0) ; first entry is before EDT $words="" for $i=0 to UBound($a1)-1 $a2=StringSplit($a1[$i]," ") $words &= StringReplace($a2[1],","," ")& " " Next $a2=StringSplit(StringStripWS($words,7)," ",2) _ArrayDisplay($a2,"Solution of EXIT") I'm having trouble reading how this works. Is it saying to strip $stext of whitespace and then split the string from having ", " to ",,". After that I get lost, I don't understand the "EDT ", 3" portion at all. I half understand the for loop. It looks as many lines as in $a1. Is the $a2=Stringsplit line adding a space between the words? And is $words taking everything with a comma and replacing it with a " " and then adding another " " to it with the )& " " ? Finally the Next part of the loop is stripping the whitespace out of $words? I don't get the ,7 part because the function only has 1, 2, 4, and 8 for flags. I also don't get why stringsplit is using a 2 option to disable the return count.Mikell used this: $a = StringRegExp($sText, 'EDTh+(.+?)h{2,}', 3) 1.) This looks for EDT and white space after it 2.) (.+?) Does this capture all characters until the h? 3.) h{2,} looks at all the whitespace? 4.) 3 flag says to do this 3 times? Thanks again to everyone who has helped and also to anyone who can help. test.txt Link to comment Share on other sites More sharing options...
mikell Posted July 27, 2014 Share Posted July 27, 2014 (edited) So here is my contribution $a = StringRegExp($sText, 'EDTh+(.+?)h{2,}', 3) 1.) EDTh+ : This looks for EDT and one white space after it <= h+ means "one or more white space(s)" 2.) (.+?) Does this capture all characters until the h? <= yes it does (one or more character, lazy) 3.) h{2,} looks at all the whitespace? <= it looks for 2 or more white spaces 4.) 3 flag says to do this 3 times? <= no, it says : return the matches as an array $tmp = StringRegExp($a[$i], '([^s,]+)', 3) This one means : match one or more characters which are not a s (space) or a comma, return the results as an array Edited July 27, 2014 by mikell Link to comment Share on other sites More sharing options...
handofthrawn Posted July 27, 2014 Author Share Posted July 27, 2014 Thanks Mikell. A followup question if you are still around. I noticed in my test.txt example that its not spacebars of whitespace but specifically two tabs of whitespace between EDT and WORD(s). After the WORD(s), its another tab of white space. Will h+ treat tabs of whitespace the same as spacebars of whitespace? No biggie if you aren't around or don't know, I'm sure some follow up testing will give me the answer. Link to comment Share on other sites More sharing options...
mikell Posted July 27, 2014 Share Posted July 27, 2014 (edited) You're right, for the txt file you provided here is the working script #Include <Array.au3> $sText = FileRead("test.txt") $a = StringRegExp($sText, 'EDT\h+([$.A-Z,' & Chr(32) & ']+)', 3) _ArrayDisplay($a) $res = "" For $i = 0 to UBound($a)-1 If StringStripWS($a[$i], 3) = "" Then ContinueLoop ; this excludes empty lines $tmp = StringRegExp($a[$i], '([^\s,]+)', 3) For $j = 0 to UBound($tmp)-1 $res &= $tmp[$j] & @crlf Next Next Msgbox(0,"", $res) FileWrite("results.txt", $res) Edit Rectification after remarks from Malkey Edited July 28, 2014 by mikell Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now