cag8f Posted February 1, 2014 Posted February 1, 2014 I've managed to grab the text of a web page using _IEBodyReadText($oIE) When I display this text (e.g. using ConsoleWrite), it is broken up into multiple lines of text. How can I now grab a particular line of this text? For example if I want to grab the 5th line of text, would I tell Autoit (using StringRegExp) to grab the text between the 5th and 6th newline characters? If so, what would be the syntax for this? I've been messing around with different newline wildcards in StringRegExp and can't seem to get it to work. If that's not the most efficient method of parsing a line of text, what is? Thanks in advance.
l3ill Posted February 1, 2014 Posted February 1, 2014 (edited) cag8f, you could FileWrite the text to a file and get the lines you want with FileReadLine. Bill Edited February 1, 2014 by l3ill My Contributions... SnippetBrowser NewSciTE PathFinder Text File Manipulation FTP Connection Tester / INI File - Read, Write, Save & Load Example
cag8f Posted February 1, 2014 Author Posted February 1, 2014 Right, I know that. I was just wondering if I could do it without FileWrite, using some string functions.
l3ill Posted February 1, 2014 Posted February 1, 2014 (edited) Aahsoo.... well, I think to manipulate the text line by line it will have to be in an array or file of some kind. Edit: BTW you can set it up that you not see any of the file process happening if that was your concern... Edited February 1, 2014 by l3ill My Contributions... SnippetBrowser NewSciTE PathFinder Text File Manipulation FTP Connection Tester / INI File - Read, Write, Save & Load Example
cag8f Posted February 1, 2014 Author Posted February 1, 2014 (edited) Not seeing the process isn't my concern. This is more a learning exercise on my part than anything. I'm able to parse this long string using text matches and such in StringRegExp(). Reading through the StringRegExp() help file I see that at least in some cases, StringRegExp() might be able to parse strings using newline matches. I was trying extract text using such methods. Maybe I need to start with a simpler example. Edit: To clarify, doesn't my long string have newline characters stored within? If so, shouldn't I be able to parse this string using the newline character info and StringRegExp() (and other string functions)? Edited February 1, 2014 by cag8f
Solution guestscripter Posted February 1, 2014 Solution Posted February 1, 2014 (edited) Here´s a snippet from a working function I have that uses StringRegExp on a string with several lines #Region Qualify If Not StringRegExp($sSource, '\*\*\*.*\*\*\*\r\n.*\r\n\d+ [A-Z]*/EINH\.: \d*\r\n(?>INCENTIVE\(I\): 004=003;\r\n)*(?>\d+ NEU [A-Z, -]+ (?>M|F) \d+ \d+ \d{2}.\d{2}-\d{2}.\d{2} .*?\r\n)+', 0) Then Return False #EndRegion Qualify Notice the \r\n which is Return+Linefeed i.e. equal to CR+LF i.e. @CRLF. And to look for a variable amount of lines in something, I´ve used a "non-capturing group" like (simplified example) .*(?>.*rn)*. ...so try, if say your string is: stuff in line 1 stuff in line 2 line 3 stuff more stuff (line 4) fifth line WE HAVE A NEW LINE the pattern (with $flag = 3) .+(?>\r\n)* would return an array with: 0 => stuff in line 1 1 => stuff in line 2 2 => line 3 stuff 3 => more stuff (line 4) 4 => fifth line 5 => WE HAVE A NEW LINE P.S. I very firmly suggest, if you don´t yet, the use of the "StringRegExpGUI" (found in the helpfile, and/or the AutoIt Includes folder) Edited February 1, 2014 by guestscripter ImageSearch15.au3 featuring _ImageSearchStartup() and _ImageSearchShutdown()
Malkey Posted February 1, 2014 Posted February 1, 2014 I've managed to grab the text of a web page using _IEBodyReadText($oIE) When I display this text (e.g. using ConsoleWrite), it is broken up into multiple lines of text. How can I now grab a particular line of this text? For example if I want to grab the 5th line of text, would I tell Autoit (using StringRegExp) to grab the text between the 5th and 6th newline characters? If so, what would be the syntax for this? I've been messing around with different newline wildcards in StringRegExp and can't seem to get it to work. If that's not the most efficient method of parsing a line of text, what is? Thanks in advance. Normally, the newline character is at the end of a particular line. So, if you want the 5th line you need to grab the text between the newline character at the end of the 4th line and the newline character at the end of line you want, the 5th newline character. Try this on the text you managed to grab from the web page. Local $sText = _ "Line 1" & @CRLF & _ "Line 2" & @CRLF & _ "Line 3" & @CRLF & _ "Line 4" & @CRLF & _ "Line 5" & @LF & _ "Line 6" & @LF & _ "Line 7" & @LF & _ "Line 8" & @LF & _ "Line 9" & @LF & _ "Line 10" Local $iLineNum = 5 Local $sLine = StringRegExpReplace($sText, "(?s)(\V*\R){0," & ($iLineNum - 1) & "}(\V*)\R?.*$", "$2") ConsoleWrite($sLine & @LF)
cag8f Posted February 2, 2014 Author Posted February 2, 2014 Thanks to both of you. Wayfarer, your explanation was pretty good and your code: .+(?>\r\n) accomplished what I needed. But I'm still trying to fully understand why. A few questions on this syntax: 1. Would R ("Matches any Unicode newline sequence by default") be suitable for use instead of rn? It yields the same results when I plug it in. 2. I'm having trouble understanding how a non-capturing group is working in this situation. I am definitely understanding something wrong, so bear with me. From the StringRegExp description, "Capturing groups remember the text they matched for use in backreferences and they populate the optionally returned array." Isn't this saying that every StringRegExp match is written into the returned array (the same array returned by setting flag=3)? If so, wouldn't a non-capturing group omit the writing-to-an-array part? If so, how is StringRegExp still returning an array of values if the non-capturing group isn't writing any of the values to the array? I'd just like some clarification for my own education. Also thanks for the StringRegExp GUI tip--it really is handy.
jchd Posted February 2, 2014 Posted February 2, 2014 (edited) 0. (?> ) is NOT a non-capturing group but a look-ahead assertion. Relates to point 2. (*) 1. Reading the end of the paragraph you cite from the help, one must consider the answer to be YES. 2. In absence of explicit capturing group(s), StringRegExp options 1 to 4 return match(es). Edit (*) One more indication I shouldn't post at 3:08 AM Edited February 2, 2014 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
cag8f Posted February 2, 2014 Author Posted February 2, 2014 Thanks for the reply: 0. I referred to (?>...) as a non-capturing group because the StringRegExp help file refers to it as an "atomic non-capturing group." 1. I read that end paragraph but didn't quite understand the word 'unbreakable' in that context. 2. OK so in my case do I have 2 conflicting parameters? i.e. flag=3 tells StringRegExp to save matches to an array, while the (?>) tells StringRegExp to not save matches to an array? This probably is not what is happening, so what am I misinterpreting?
kylomas Posted February 2, 2014 Posted February 2, 2014 Another alternative... expandcollapse popup; ; script requires 3.3.10+ ; Local $bGenFile = true If $bGenFile Then _gen_file() Local $LineToReturn = 2345 ConsoleWrite(_get_line(@ScriptDir & '\test10.txt', $LineToReturn) & @LF) Func _get_line($file, $line) Local $_str $_str = FileRead($file) $_str = StringRegExpReplace($_str, '\R', @LF) ; normalize EOL's Return StringSplit($_str, @LF, 2)[$line - 1] ; use function direct referencing to return line EndFunc ;==>_get_line Func _gen_file() Local $str, $st = TimerInit() For $1 = 1 To 5000 $str &= StringFormat('line %04i ', $1) For $2 = 1 To Random(10, 100, 1) $str &= Chr(Random(65, 90, 1)) Next $str &= (Random(0, 1, 1)) ? @CRLF : @LF Next FileDelete(@ScriptDir & '\test10.txt') FileWrite(@ScriptDir & '\test10.txt', $str) ConsoleWrite(StringFormat('Time to gen file = %2.4f seconds', TimerDiff($st) / 1000) & @LF) EndFunc ;==>_gen_file kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill
jchd Posted February 2, 2014 Posted February 2, 2014 cag8f, 0. Sorry for misreading my own prose. As my edit above shows, I was half asleep and posted too fast. 1. unbreakable in "(?>rn|n|r)" means that it will match rn as a whole if this sequence is encountered (that is, the engine will not match only r in this case). Suppose your subject is "abc" then the pattern "(?>ab|a)b" will fail: "ab" in the subject will match "ab" in the atomic group and -thanks to atomic grouping) the engine will not backtrack, so that the last "b" in pattern will not match "c" in subject. I choose this wording for compactness: again the succint help on StringRegExp can't be substituted to the detailed reference PCRE documentation. 2. non-capturing groups are not saved, unless they are in the middle of a captured group. In your case .* will match the line content and atomic non capturing match rn will not make it in the output array. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
cag8f Posted February 3, 2014 Author Posted February 3, 2014 (edited) OK I think I'm getting there. So the newline wildcards, rn, will not be saved to the array, since they are within a non-captured group. But since the .+* is *not* inside a non-captured group, and flag=3, the rest of the line *is* saved to the array. Is this correct? Edit: .+*, not .+ Edited February 3, 2014 by cag8f
jchd Posted February 3, 2014 Posted February 3, 2014 Yep, you get it. In all cases, experimentation is always good for learning. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now