Jump to content

What functions can help me pull strings out of a parapragh?


Recommended Posts

I have a web site that gives me something like the text below.  I want to find all the words that come after "EDT" (first one is "BLAH") and then skip to the next EDT string.  On occasion, there are multiple words I need seperated by a comma like the second group of words (RandomBlah4 and Blah1).   Before the word I need is always EDT and a space and after the word is a space unless there is a comma for a second word.  I need a script to pull these words out of a large amount of text and then put them into a text file, clipboard, whatever so I can put them into individual excel columns. 

Sample Text

"12:39 EDT        BLAH    This blah is something I need with more random text.

Random text that goes on and on

12:35 EDT        RandomBlah4, Blah1    I also need RandomBlah4 and Blah1

Randtom text I don't need"

My question is what is the best way to do this?  I am not a newb to Autoit but I'm not a master of it either.  Before I spent a day trying to throw darts in the dark I figured I would ask what is the best way to go about this?  Should I use stringreplace or will that be too difficult with something that can be 10+ pages long?  Should I put the text in a word doc and use _Word_docFindReplace?  How about using StringRegExpReplace and searching for anything with a space and characters that end in EDT and removing them from the document? 

Thanks for any help you can provide.

Link to comment
Share on other sites

What is your definition of "word" in this precise context?

How are multiple words separated when there are several?

How to determine the end of the sequence of words that you want extracted?

Unless you give a definitive precise answers to those 3 questions I'm afraid your quest is going to be a moving target. Regular expressions are very powerful but are in fact programs which need a precise specification to provide the expected result reliably.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

The word would be a group of letters in between EDT and the randomtext minus all the spaces.  For example

EDT    Hotdog     randomtext

EDT    Football   randomtext

EDT    Fall,Summer   randomtext

EDT    Spring,Winter,Pool,Class   randomtext

EDT    Ball  randomtext

The words here are: Hotdog, Football, Fall, Summer, Spring, Winter, Pool, Class, and Ball

The pattern is always EDT with a bunch of spaces, the word(s), then more spaces.

Multiple words are always seperated by a ",".  The above has two examples of multiple words.

Essentially my program will look for these two examples:

"EDT    Hotdog     blahblahblahblbh"

"EDT    BALL,SUN,FALL blahblahblahblbh"

And it would give me back the text

Hotdog

BALL

SUN

FALL

Link to comment
Share on other sites

Try this:

$sText = "EDT    Hotdog     randomtext" & @CRLF & _
"EDT    Football   randomtext" & @CRLF & _
"EDT    Fall,Summer   randomtext" & @CRLF & _
"EDT    Spring,Winter,Pool,Class   randomtext" & @CRLF & _
"EDT    Ball  randomtext"

MsgBox(0, "RegEx Test", StringRegExpReplace($sText, "EDT\h*(.+)\h+.+", "$1"))

Br,

UEZ

Edited by UEZ

Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Link to comment
Share on other sites

I'm struggling a little bit trying to modify your code to get it right.  I put in your code into autoit and it spit out the correct words (although it didn't seperate the "Fall,Summer" but no biggie).  When I tried to put the example into a text file I ran into problems.  All it gave back was hotdog. 

I made a new example that is a lot more real because this feels like its being a bit tricky.  I put the example below into test.txt and this is my code.

$sText = FileReadLine("test.txt", 1)
MsgBox(0, "RegEx Test", StringRegExpReplace($sText, "EDTh*(.+)h+.+", "$1"))

************************EXAMPLE*******************************
 10:27 EDT        BALL    
shakespeare is awsome     The Internets are dope.  Teachers aren't cool I'm going to skip class
10:19 EDT        FALL, SPRING    cat videos are what i watch all day
pewdewpie is the best, off to watch the biebs
************************EXAMPLE*******************************

That example comes back with "10:27 BALL"  I want it to come back

Ball

FALL

SPRING

Maybe if I understand your code for StringRegExpReplace I will be able to tweak it more myself.  I read the function but some of the stuff you did didn't seem to be listed there.  I don't get the h, I assume the * means wild card.  The +.+ I'm confused with too .  If that looking for a period? 

Thank you again for any help. 
 

Link to comment
Share on other sites

#include <array.au3>

 $sText = "10:27 EDT        BALL    " & @crlf & _
 "shakespeare is awsome     The Internets are dope.  Teachers aren't cool I'm going to skip class " & @crlf & _
 "10:19 EDT        FALL, SPRING    cat videos are what i watch all day " & @crlf & _
 "pewdewpie is the best, off to watch the biebs "

MsgBox(0, "Solution of UEZ", StringRegExpReplace($sText, "EDT\h*(.+)\h+.+", "$1"))

$a1 = StringSplit(StringReplace(StringStripWS(StringReplace($sText,@TAB," "), 4), ", ", ",,"), "EDT ", 3)
_ArrayDelete($a1,0) ; first entry is before EDT
$words=""
for $i=0 to UBound($a1)-1
    $a2=StringSplit($a1[$i]," ")
    $words &= StringReplace($a2[1],","," ")& " "
Next
    $a2=StringSplit(StringStripWS($words,7)," ",2)
_ArrayDisplay($a2,"Solution of EXIT")

Edited by Exit

App: Au3toCmd              UDF: _SingleScript()                             

Link to comment
Share on other sites

The import thing is in such cases to provide a real text rather than an example to be able to find a solution.

Let me try again...

 

Br,

UEZ

Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Link to comment
Share on other sites

The precision in requirements is really the main thing

So this will work with your last example.... until you add different requirements

#Include <Array.au3>

$sText = FileRead("test.txt")
$a = StringRegExp($sText, 'EDT\h+(.+?)\h{2,}', 3)
_ArrayDisplay($a)
$res = ""
For $i = 0 to UBound($a)-1
   $tmp = StringRegExp($a[$i], '([^\s,]+)', 3)
   For $j = 0 to UBound($tmp)-1
       $res &= $tmp[$j] & @crlf
   Next
Next
Msgbox(0,"", $res)
Link to comment
Share on other sites

I really apologize for the bad first examples.  I was trying to be succinct and ending up screwing it up badly.  I really appreciate all the help I've been given.  I know how frustrating it can be to try to help someone and they have a moving target and this was definitely not my intention.

Exit, you nailed it!  Thanks.  Last question (I pray!), is there a way to remove the Row numbers and just keep the column where the words are listd?  I'm asking because right now when I select everything it spits out

[0]|BALL
[1]|FALL
[2]|SPRING

Thanks again everyone for the help. 
 

Edited by handofthrawn
Link to comment
Share on other sites

Look at _ArrayDisplay in the help file. There is a flag where you can set display options.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

And another approach.

#include <Array.au3> ; For display purposes only.

Local $sText = "EDT    Hotdog     randomtext" & @CRLF & _
        "EDT    Football   randomtext" & @CRLF & _
        "EDT    Fall,Summer   randomtext" & @CRLF & _
        "EDT    Spring,Winter,Pool,Class   randomtext" & @CRLF & _
        "10:27 EDT        BALL    " & @CRLF & _
        "shakespeare is awsome     The Internets are dope.  Teachers aren't cool I'm going to skip class " & @CRLF & _
        "10:19 EDT        FALL, SPRING    cat videos are what i watch all day " & @CRLF & _
        "pewdewpie is the best, off to watch the biebs "

Local $sText = FileRead("test.txt")

Local $sResults = StringRegExpReplace($sText, "(?m)(^.*EDT\h\h[^A-Z]*\h.*$)|(^.*EDT\h{2,})|(^.*(?!EDT))|((\h{2,}|\t).*$)", "")
; The above RegExp pattern erases the following:-
; "(^.*EDT\h\h[^A-Z]*\h.*$)" - The entire line that has no upper case characters between "EDT@Tab@Tab" and "@Tab"
; "(^.*EDT\h{2,})" -           The beginning of all lines contain all characters up to and including "EDT@Tab@Tab or more space"
; "(^.*(?!EDT))" -             The entire line that does not have the characters, "EDT" present.
; "((\h{2,}|\t).*$)" -         The end of all lines from and including either two horizontal white spaces or one tab character."

$sResults = StringStripWS(StringRegExpReplace($sResults, "\h*,\h*", @CRLF), 6) ; $STR_STRIPTRAILING (2) + $STR_STRIPSPACES (4) = 6
;  "\h*,\h*" - Replace all comas with @CRLF. Comas may or may not have any number of spaces on either side of the comma.

MsgBox(0, "String - Malkey's Solution", $sResults)

;Or

Local $a2 = StringRegExp($sResults, "\V+", 3)
_ArrayDisplay($a2, "Array - Malkey's Solution ")

Edit: Having the test file of post#17, I was able to refine the main RegExp pattern.

Edited by Malkey
Link to comment
Share on other sites

Personally I like the simple approach, and would just read line by line and StringSplit on EDT and then strip the whitespace.

To get the lines, you can just do a full read into memory (a variable) and do lines as splits on the carriage returns.

Make sure brain is in gear before opening mouth!
Remember, what is not said, can be just as important as what is said.

Spoiler

What is the Secret Key? Life is like a Donut

If I put effort into communication, I expect you to read properly & fully, or just not comment.
Ignoring those who try to divert conversation with irrelevancies.
If I'm intent on insulting you or being rude, I will be obvious, not ambiguous about it.
I'm only big and bad, to those who have an over-active imagination.

I may have the Artistic Liesense ;) to disagree with you. TheSaint's Toolbox (be advised many downloads are not working due to ISP screwup with my storage)

userbar.png

Link to comment
Share on other sites

I see the error of my newb ways.  I am trying to find this import button but I can't find it.  I just tried to upload a file but I screwed that up too.  I tried to upload it to Misc and then I realized that Misc download section was for cool scripts and not a place for people to upload their files for sharing.

The import thing is in such cases to provide a real text rather than an example to be able to find a solution.

 

I am messing around with the stringregexpreplace a bit more so I can understand what you guys are writing.  For the longest time I was googling and getting no where.  The settings were placed in stringregexp all along :(

Link to comment
Share on other sites

Right. Duplicating all this would be a waste. Just checking, you're right to imply that there should be a clearly visible reference to all the help discussion in StringRegExp. I wrote it but neglected to reference is boldly in the companion function.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I attached my real life example so I no longer screw this up. 

For the past 3 days I've tried to take all the responses and learn as much as I can but I'm hitting some roadblocks.  If anyone has a moment to correct my responses on any of these I would greatly appreciate it.  I hate asking for help for this one problem so I'm doing my best to learn as much as possible so I can fix future problems myself and even help others (got to give back the love).  

UEZ's solution: StringRegExpReplace($sText, "EDTh*(.+)h+.+", "$1")
1.) "EDT" - Look for EDT
2.) "h*" - Look for any amount of whitespace until step 3
3.) "(.+)" - Capture any amount of characters
4.) "h" - Any amount of whitespace
Store this all in $1.

Exit solution: Same stringregexpreplace as UEZ but puts it in an array.  
$a1= StringSplit(StringReplace(StringStripWS($stext,4),", ",",,"),"EDT ",3)
_ArrayDelete($a1,0) ; first entry is before EDT
$words=""
for $i=0 to UBound($a1)-1
    $a2=StringSplit($a1[$i]," ")
    $words &= StringReplace($a2[1],","," ")& " "
Next
    $a2=StringSplit(StringStripWS($words,7)," ",2)
_ArrayDisplay($a2,"Solution of EXIT")

I'm having trouble reading how this works.  Is it saying to strip $stext of whitespace and then split the string from having ", " to ",,".  After that I get lost, I don't understand the "EDT ", 3" portion at all.  I half understand the for loop.  It looks as many lines as in $a1.  Is the $a2=Stringsplit line adding a space between the words?  And is $words taking everything with a comma and replacing it with a " " and then adding another " " to it with the )& " " ?

Finally the Next part of the loop is stripping the whitespace out of $words?  I don't get the ,7 part because the function only has 1, 2, 4, and 8 for flags.  I also don't get why stringsplit is using a 2 option to disable the return count.

Mikell used this: $a = StringRegExp($sText, 'EDTh+(.+?)h{2,}', 3)
1.) This looks for EDT and white space after it
2.) (.+?) Does this capture all characters until the h?
3.) h{2,} looks at all the whitespace?
4.) 3 flag says to do this 3 times?

Thanks again to everyone who has helped and also to anyone who can help.
 

test.txt

Link to comment
Share on other sites

So here is my contribution  :)

$a = StringRegExp($sText, 'EDTh+(.+?)h{2,}', 3)

1.) EDTh+ : This looks for EDT and one white space after it     <= h+  means "one or more white space(s)"
2.) (.+?) Does this capture all characters until the h?     <= yes it does (one or more character, lazy)
3.) h{2,} looks at all the whitespace?    <= it looks for 2 or more white spaces
4.) 3 flag says to do this 3 times?     <= no, it says : return the matches as an array

 

$tmp = StringRegExp($a[$i], '([^s,]+)', 3)

This one means : match one or more characters which are not a s (space) or a comma, return the results as an array

:)

Edited by mikell
Link to comment
Share on other sites

Thanks Mikell.  A followup question if you are still around.  I noticed in my test.txt example that its not spacebars of whitespace but specifically two tabs of whitespace between EDT and WORD(s).  After the WORD(s), its another tab of white space.  Will h+ treat tabs of whitespace the same as spacebars of whitespace?  No biggie if you aren't around or don't know, I'm sure some follow up testing will give me the answer.

Link to comment
Share on other sites

You're right, for the txt file you provided here is the working script

#Include <Array.au3>

$sText = FileRead("test.txt")
$a = StringRegExp($sText, 'EDT\h+([$.A-Z,' & Chr(32) & ']+)', 3)
_ArrayDisplay($a)
$res = ""
For $i = 0 to UBound($a)-1
   If StringStripWS($a[$i], 3) = "" Then ContinueLoop   ; this excludes empty lines
   $tmp = StringRegExp($a[$i], '([^\s,]+)', 3)
   For $j = 0 to UBound($tmp)-1
       $res &= $tmp[$j] & @crlf
   Next
Next
Msgbox(0,"", $res)
FileWrite("results.txt", $res)

:)

Edit

Rectification after remarks from Malkey

Edited by mikell
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...