What functions can help me pull strings out of a parapragh?

handofthrawn · July 24, 2014

I have a web site that gives me something like the text below. I want to find all the words that come after "EDT" (first one is "BLAH") and then skip to the next EDT string. On occasion, there are multiple words I need seperated by a comma like the second group of words (RandomBlah4 and Blah1). Before the word I need is always EDT and a space and after the word is a space unless there is a comma for a second word. I need a script to pull these words out of a large amount of text and then put them into a text file, clipboard, whatever so I can put them into individual excel columns.

Sample Text

"12:39 EDT BLAH This blah is something I need with more random text.

Random text that goes on and on

12:35 EDT RandomBlah4, Blah1 I also need RandomBlah4 and Blah1

Randtom text I don't need"

My question is what is the best way to do this? I am not a newb to Autoit but I'm not a master of it either. Before I spent a day trying to throw darts in the dark I figured I would ask what is the best way to go about this? Should I use stringreplace or will that be too difficult with something that can be 10+ pages long? Should I put the text in a word doc and use _Word_docFindReplace? How about using StringRegExpReplace and searching for anything with a space and characters that end in EDT and removing them from the document?

Thanks for any help you can provide.

jchd · July 24, 2014

What is your definition of "word" in this precise context?

How are multiple words separated when there are several?

How to determine the end of the sequence of words that you want extracted?

Unless you give a definitive precise answers to those 3 questions I'm afraid your quest is going to be a moving target. Regular expressions are very powerful but are in fact programs which need a precise specification to provide the expected result reliably.

handofthrawn · July 24, 2014

The word would be a group of letters in between EDT and the randomtext minus all the spaces. For example

EDT Hotdog randomtext

EDT Football randomtext

EDT Fall,Summer randomtext

EDT Spring,Winter,Pool,Class randomtext

EDT Ball randomtext

The words here are: Hotdog, Football, Fall, Summer, Spring, Winter, Pool, Class, and Ball

The pattern is always EDT with a bunch of spaces, the word(s), then more spaces.

Multiple words are always seperated by a ",". The above has two examples of multiple words.

Essentially my program will look for these two examples:

"EDT Hotdog blahblahblahblbh"

"EDT BALL,SUN,FALL blahblahblahblbh"

And it would give me back the text

Hotdog

BALL

SUN

FALL

UEZ · July 24, 2014

Try this:

$sText = "EDT    Hotdog     randomtext" & @CRLF & _
"EDT    Football   randomtext" & @CRLF & _
"EDT    Fall,Summer   randomtext" & @CRLF & _
"EDT    Spring,Winter,Pool,Class   randomtext" & @CRLF & _
"EDT    Ball  randomtext"

MsgBox(0, "RegEx Test", StringRegExpReplace($sText, "EDT\h*(.+)\h+.+", "$1"))

Br,

UEZ

Edited July 24, 2014 by UEZ

handofthrawn · July 24, 2014

Thanks so much UEZ. I will give this a go!

handofthrawn · July 24, 2014

I'm struggling a little bit trying to modify your code to get it right. I put in your code into autoit and it spit out the correct words (although it didn't seperate the "Fall,Summer" but no biggie). When I tried to put the example into a text file I ran into problems. All it gave back was hotdog.

I made a new example that is a lot more real because this feels like its being a bit tricky. I put the example below into test.txt and this is my code.

$sText = FileReadLine("test.txt", 1)
MsgBox(0, "RegEx Test", StringRegExpReplace($sText, "EDTh*(.+)h+.+", "$1"))

************************EXAMPLE*******************************
10:27 EDT       BALL
shakespeare is awsome    The Internets are dope. Teachers aren't cool I'm going to skip class
10:19 EDT       FALL, SPRING   cat videos are what i watch all day
pewdewpie is the best, off to watch the biebs
************************EXAMPLE*******************************

That example comes back with "10:27 BALL" I want it to come back

Ball

FALL

SPRING

Maybe if I understand your code for StringRegExpReplace I will be able to tweak it more myself. I read the function but some of the stuff you did didn't seem to be listed there. I don't get the h, I assume the * means wild card. The +.+ I'm confused with too . If that looking for a period?

Thank you again for any help.

Exit · July 24, 2014

#include <array.au3>

 $sText = "10:27 EDT        BALL    " & @crlf & _
 "shakespeare is awsome     The Internets are dope.  Teachers aren't cool I'm going to skip class " & @crlf & _
 "10:19 EDT        FALL, SPRING    cat videos are what i watch all day " & @crlf & _
 "pewdewpie is the best, off to watch the biebs "

MsgBox(0, "Solution of UEZ", StringRegExpReplace($sText, "EDT\h*(.+)\h+.+", "$1"))

$a1 = StringSplit(StringReplace(StringStripWS(StringReplace($sText,@TAB," "), 4), ", ", ",,"), "EDT ", 3)
_ArrayDelete($a1,0) ; first entry is before EDT
$words=""
for $i=0 to UBound($a1)-1
    $a2=StringSplit($a1[$i]," ")
    $words &= StringReplace($a2[1],","," ")& " "
Next
    $a2=StringSplit(StringStripWS($words,7)," ",2)
_ArrayDisplay($a2,"Solution of EXIT")

Edited August 6, 2014 by Exit

UEZ · July 24, 2014

The import thing is in such cases to provide a real text rather than an example to be able to find a solution.

Let me try again...

Br,

UEZ

mikell · July 24, 2014

The precision in requirements is really the main thing

So this will work with your last example.... until you add different requirements

#Include <Array.au3>

$sText = FileRead("test.txt")
$a = StringRegExp($sText, 'EDT\h+(.+?)\h{2,}', 3)
_ArrayDisplay($a)
$res = ""
For $i = 0 to UBound($a)-1
   $tmp = StringRegExp($a[$i], '([^\s,]+)', 3)
   For $j = 0 to UBound($tmp)-1
       $res &= $tmp[$j] & @crlf
   Next
Next
Msgbox(0,"", $res)

handofthrawn · July 24, 2014

I really apologize for the bad first examples. I was trying to be succinct and ending up screwing it up badly. I really appreciate all the help I've been given. I know how frustrating it can be to try to help someone and they have a moving target and this was definitely not my intention.

Exit, you nailed it! Thanks. Last question (I pray!), is there a way to remove the Row numbers and just keep the column where the words are listd? I'm asking because right now when I select everything it spits out

[0]|BALL
[1]|FALL
[2]|SPRING

Thanks again everyone for the help.

Edited July 24, 2014 by handofthrawn

jchd · July 24, 2014

Look at _ArrayDisplay in the help file. There is a flag where you can set display options.

handofthrawn · July 24, 2014

Thanks jchd. I should be good to go!

Malkey · July 25, 2014

And another approach.

#include <Array.au3> ; For display purposes only.

Local $sText = "EDT    Hotdog     randomtext" & @CRLF & _
        "EDT    Football   randomtext" & @CRLF & _
        "EDT    Fall,Summer   randomtext" & @CRLF & _
        "EDT    Spring,Winter,Pool,Class   randomtext" & @CRLF & _
        "10:27 EDT        BALL    " & @CRLF & _
        "shakespeare is awsome     The Internets are dope.  Teachers aren't cool I'm going to skip class " & @CRLF & _
        "10:19 EDT        FALL, SPRING    cat videos are what i watch all day " & @CRLF & _
        "pewdewpie is the best, off to watch the biebs "

Local $sText = FileRead("test.txt")

Local $sResults = StringRegExpReplace($sText, "(?m)(^.*EDT\h\h[^A-Z]*\h.*$)|(^.*EDT\h{2,})|(^.*(?!EDT))|((\h{2,}|\t).*$)", "")
; The above RegExp pattern erases the following:-
; "(^.*EDT\h\h[^A-Z]*\h.*$)" - The entire line that has no upper case characters between "EDT@Tab@Tab" and "@Tab"
; "(^.*EDT\h{2,})" -           The beginning of all lines contain all characters up to and including "EDT@Tab@Tab or more space"
; "(^.*(?!EDT))" -             The entire line that does not have the characters, "EDT" present.
; "((\h{2,}|\t).*$)" -         The end of all lines from and including either two horizontal white spaces or one tab character."

$sResults = StringStripWS(StringRegExpReplace($sResults, "\h*,\h*", @CRLF), 6) ; $STR_STRIPTRAILING (2) + $STR_STRIPSPACES (4) = 6
;  "\h*,\h*" - Replace all comas with @CRLF. Comas may or may not have any number of spaces on either side of the comma.

MsgBox(0, "String - Malkey's Solution", $sResults)

;Or

Local $a2 = StringRegExp($sResults, "\V+", 3)
_ArrayDisplay($a2, "Array - Malkey's Solution ")

Edit: Having the test file of post#17, I was able to refine the main RegExp pattern.

Edited July 27, 2014 by Malkey

TheSaint · July 25, 2014

Personally I like the simple approach, and would just read line by line and StringSplit on EDT and then strip the whitespace.

To get the lines, you can just do a full read into memory (a variable) and do lines as splits on the carriage returns.

handofthrawn · July 25, 2014

I see the error of my newb ways. I am trying to find this import button but I can't find it. I just tried to upload a file but I screwed that up too. I tried to upload it to Misc and then I realized that Misc download section was for cool scripts and not a place for people to upload their files for sharing.

The import thing is in such cases to provide a real text rather than an example to be able to find a solution.

I am messing around with the stringregexpreplace a bit more so I can understand what you guys are writing. For the longest time I was googling and getting no where. The settings were placed in stringregexp all along

jchd · July 25, 2014

Right. Duplicating all this would be a waste. Just checking, you're right to imply that there should be a clearly visible reference to all the help discussion in StringRegExp. I wrote it but neglected to reference is boldly in the companion function.

handofthrawn · July 27, 2014

I attached my real life example so I no longer screw this up.

For the past 3 days I've tried to take all the responses and learn as much as I can but I'm hitting some roadblocks. If anyone has a moment to correct my responses on any of these I would greatly appreciate it. I hate asking for help for this one problem so I'm doing my best to learn as much as possible so I can fix future problems myself and even help others (got to give back the love).

UEZ's solution: StringRegExpReplace($sText, "EDTh*(.+)h+.+", "$1")
1.) "EDT" - Look for EDT
2.) "h*" - Look for any amount of whitespace until step 3
3.) "(.+)" - Capture any amount of characters
4.) "h" - Any amount of whitespace
Store this all in $1.

Exit solution: Same stringregexpreplace as UEZ but puts it in an array.
$a1= StringSplit(StringReplace(StringStripWS($stext,4),", ",",,"),"EDT ",3)
_ArrayDelete($a1,0) ; first entry is before EDT
$words=""
for $i=0 to UBound($a1)-1
    $a2=StringSplit($a1[$i]," ")
    $words &= StringReplace($a2[1],","," ")& " "
Next
    $a2=StringSplit(StringStripWS($words,7)," ",2)
_ArrayDisplay($a2,"Solution of EXIT")

I'm having trouble reading how this works. Is it saying to strip $stext of whitespace and then split the string from having ", " to ",,". After that I get lost, I don't understand the "EDT ", 3" portion at all. I half understand the for loop. It looks as many lines as in $a1. Is the $a2=Stringsplit line adding a space between the words? And is $words taking everything with a comma and replacing it with a " " and then adding another " " to it with the )& " " ?

Finally the Next part of the loop is stripping the whitespace out of $words? I don't get the ,7 part because the function only has 1, 2, 4, and 8 for flags. I also don't get why stringsplit is using a 2 option to disable the return count.

Mikell used this: $a = StringRegExp($sText, 'EDTh+(.+?)h{2,}', 3)
1.) This looks for EDT and white space after it
2.) (.+?) Does this capture all characters until the h?
3.) h{2,} looks at all the whitespace?
4.) 3 flag says to do this 3 times?

Thanks again to everyone who has helped and also to anyone who can help.

test.txt

mikell · July 27, 2014

So here is my contribution

$a = StringRegExp($sText, 'EDTh+(.+?)h{2,}', 3)

1.) EDTh+ : This looks for EDT and one white space after it     <= h+ means "one or more white space(s)"
2.) (.+?) Does this capture all characters until the h?     <= yes it does (one or more character, lazy)
3.) h{2,} looks at all the whitespace?    <= it looks for 2 or more white spaces
4.) 3 flag says to do this 3 times?     <= no, it says : return the matches as an array

$tmp = StringRegExp($a[$i], '([^s,]+)', 3)

This one means : match one or more characters which are not a s (space) or a comma, return the results as an array

Edited July 27, 2014 by mikell

handofthrawn · July 27, 2014

Thanks Mikell. A followup question if you are still around. I noticed in my test.txt example that its not spacebars of whitespace but specifically two tabs of whitespace between EDT and WORD(s). After the WORD(s), its another tab of white space. Will h+ treat tabs of whitespace the same as spacebars of whitespace? No biggie if you aren't around or don't know, I'm sure some follow up testing will give me the answer.

mikell · July 27, 2014

You're right, for the txt file you provided here is the working script

#Include <Array.au3>

$sText = FileRead("test.txt")
$a = StringRegExp($sText, 'EDT\h+([$.A-Z,' & Chr(32) & ']+)', 3)
_ArrayDisplay($a)
$res = ""
For $i = 0 to UBound($a)-1
   If StringStripWS($a[$i], 3) = "" Then ContinueLoop   ; this excludes empty lines
   $tmp = StringRegExp($a[$i], '([^\s,]+)', 3)
   For $j = 0 to UBound($tmp)-1
       $res &= $tmp[$j] & @crlf
   Next
Next
Msgbox(0,"", $res)
FileWrite("results.txt", $res)

Edit

Rectification after remarks from Malkey

Edited July 28, 2014 by mikell

What functions can help me pull strings out of a parapragh?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members