How to detect new lines?

cag8f · February 1, 2014

I've managed to grab the text of a web page using

_IEBodyReadText($oIE)

When I display this text (e.g. using ConsoleWrite), it is broken up into multiple lines of text. How can I now grab a particular line of this text? For example if I want to grab the 5th line of text, would I tell Autoit (using StringRegExp) to grab the text between the 5th and 6th newline characters? If so, what would be the syntax for this? I've been messing around with different newline wildcards in StringRegExp and can't seem to get it to work.

If that's not the most efficient method of parsing a line of text, what is?

Thanks in advance.

l3ill · February 1, 2014

cag8f,

you could FileWrite the text to a file and get the lines you want with FileReadLine.

Bill

Edited February 1, 2014 by l3ill

cag8f · February 1, 2014

Right, I know that. I was just wondering if I could do it without FileWrite, using some string functions.

l3ill · February 1, 2014

Aahsoo....

well, I think to manipulate the text line by line it will have to be in an array or file of some kind.

Edit: BTW you can set it up that you not see any of the file process happening if that was your concern...

Edited February 1, 2014 by l3ill

cag8f · February 1, 2014

Not seeing the process isn't my concern. This is more a learning exercise on my part than anything. I'm able to parse this long string using text matches and such in StringRegExp(). Reading through the StringRegExp() help file I see that at least in some cases, StringRegExp() might be able to parse strings using newline matches. I was trying extract text using such methods. Maybe I need to start with a simpler example.

Edit: To clarify, doesn't my long string have newline characters stored within? If so, shouldn't I be able to parse this string using the newline character info and StringRegExp() (and other string functions)?

Edited February 1, 2014 by cag8f

guestscripter · February 1, 2014

Here´s a snippet from a working function I have that uses StringRegExp on a string with several lines

#Region Qualify
    If Not StringRegExp($sSource, '\*\*\*.*\*\*\*\r\n.*\r\n\d+ [A-Z]*/EINH\.: \d*\r\n(?>INCENTIVE\(I\): 004=003;\r\n)*(?>\d+ NEU [A-Z, -]+ (?>M|F) \d+ \d+ \d{2}.\d{2}-\d{2}.\d{2} .*?\r\n)+', 0) Then Return False
#EndRegion Qualify

Notice the

\r\n

which is Return+Linefeed i.e. equal to CR+LF i.e. @CRLF.

And to look for a variable amount of lines in something, I´ve used a "non-capturing group" like (simplified example) .*(?>.*rn)*.

...so try, if say your string is:

stuff in line 1
stuff in line 2
line 3 stuff
more stuff (line 4)
fifth line
WE HAVE A NEW LINE

the pattern (with $flag = 3)

.+(?>\r\n)*

would return an array with:

0 => stuff in line 1

1 => stuff in line 2

2 => line 3 stuff

3 => more stuff (line 4)

4 => fifth line

5 => WE HAVE A NEW LINE

P.S. I very firmly suggest, if you don´t yet, the use of the "StringRegExpGUI" (found in the helpfile, and/or the AutoIt Includes folder)

Edited February 1, 2014 by guestscripter

Malkey · February 1, 2014

I've managed to grab the text of a web page using
_IEBodyReadText($oIE)
When I display this text (e.g. using ConsoleWrite), it is broken up into multiple lines of text. How can I now grab a particular line of this text? For example if I want to grab the 5th line of text, would I tell Autoit (using StringRegExp) to grab the text between the 5th and 6th newline characters? If so, what would be the syntax for this? I've been messing around with different newline wildcards in StringRegExp and can't seem to get it to work.

If that's not the most efficient method of parsing a line of text, what is?

Thanks in advance.

Normally, the newline character is at the end of a particular line. So, if you want the 5th line you need to grab the text between the newline character at the end of the 4th line and the newline character at the end of line you want, the 5th newline character.

Try this on the text you managed to grab from the web page.

Local $sText = _
        "Line 1" & @CRLF & _
        "Line 2" & @CRLF & _
        "Line 3" & @CRLF & _
        "Line 4" & @CRLF & _
        "Line 5" & @LF & _
        "Line 6" & @LF & _
        "Line 7" & @LF & _
        "Line 8" & @LF & _
        "Line 9" & @LF & _
        "Line 10"

Local $iLineNum = 5
Local $sLine = StringRegExpReplace($sText, "(?s)(\V*\R){0," & ($iLineNum - 1) & "}(\V*)\R?.*$", "$2")

ConsoleWrite($sLine & @LF)

cag8f · February 2, 2014

Thanks to both of you.

Wayfarer, your explanation was pretty good and your code:

.+(?>\r\n)

accomplished what I needed. But I'm still trying to fully understand why. A few questions on this syntax:

1. Would R ("Matches any Unicode newline sequence by default") be suitable for use instead of rn? It yields the same results when I plug it in.

2. I'm having trouble understanding how a non-capturing group is working in this situation. I am definitely understanding something wrong, so bear with me. From the StringRegExp description, "Capturing groups remember the text they matched for use in backreferences and they populate the optionally returned array." Isn't this saying that every StringRegExp match is written into the returned array (the same array returned by setting flag=3)? If so, wouldn't a non-capturing group omit the writing-to-an-array part? If so, how is StringRegExp still returning an array of values if the non-capturing group isn't writing any of the values to the array? I'd just like some clarification for my own education.

Also thanks for the StringRegExp GUI tip--it really is handy.

jchd · February 2, 2014

~~0. (?> ) is NOT a non-capturing group but a look-ahead assertion. Relates to point 2.~~ (*)

1. Reading the end of the paragraph you cite from the help, one must consider the answer to be YES.

2. In absence of explicit capturing group(s), StringRegExp options 1 to 4 return match(es).

Edit

(*) One more indication I shouldn't post at 3:08 AM

Edited February 2, 2014 by jchd

cag8f · February 2, 2014

Thanks for the reply:

0. I referred to (?>...) as a non-capturing group because the StringRegExp help file refers to it as an "atomic non-capturing group."

1. I read that end paragraph but didn't quite understand the word 'unbreakable' in that context.

2. OK so in my case do I have 2 conflicting parameters? i.e. flag=3 tells StringRegExp to save matches to an array, while the (?>) tells StringRegExp to not save matches to an array? This probably is not what is happening, so what am I misinterpreting?

kylomas · February 2, 2014

Another alternative...

;
; script requires 3.3.10+
;

Local $bGenFile = true
If $bGenFile Then _gen_file()

Local $LineToReturn = 2345

ConsoleWrite(_get_line(@ScriptDir & '\test10.txt', $LineToReturn) & @LF)

Func _get_line($file, $line)

    Local $_str

    $_str = FileRead($file)
    $_str = StringRegExpReplace($_str, '\R', @LF)   ; normalize EOL's
    Return StringSplit($_str, @LF, 2)[$line - 1]    ; use function direct referencing to return line

EndFunc   ;==>_get_line

Func _gen_file()

    Local $str, $st = TimerInit()

    For $1 = 1 To 5000
        $str &= StringFormat('line %04i  ', $1)
        For $2 = 1 To Random(10, 100, 1)
            $str &= Chr(Random(65, 90, 1))
        Next
        $str &= (Random(0, 1, 1)) ? @CRLF : @LF
    Next

    FileDelete(@ScriptDir & '\test10.txt')
    FileWrite(@ScriptDir & '\test10.txt', $str)

    ConsoleWrite(StringFormat('Time to gen file = %2.4f seconds', TimerDiff($st) / 1000) & @LF)

EndFunc   ;==>_gen_file

kylomas

jchd · February 2, 2014

cag8f,

0. Sorry for misreading my own prose. As my edit above shows, I was half asleep and posted too fast.

1. unbreakable in "(?>rn|n|r)" means that it will match rn as a whole if this sequence is encountered (that is, the engine will not match only r in this case). Suppose your subject is "abc" then the pattern "(?>ab|a)b" will fail: "ab" in the subject will match "ab" in the atomic group and -thanks to atomic grouping) the engine will not backtrack, so that the last "b" in pattern will not match "c" in subject. I choose this wording for compactness: again the succint help on StringRegExp can't be substituted to the detailed reference PCRE documentation.

2. non-capturing groups are not saved, unless they are in the middle of a captured group. In your case .* will match the line content and atomic non capturing match rn will not make it in the output array.

cag8f · February 3, 2014

OK I think I'm getting there. So the newline wildcards, rn, will not be saved to the array, since they are within a non-captured group. But since the .+* is *not* inside a non-captured group, and flag=3, the rest of the line *is* saved to the array. Is this correct?

Edit: .+*, not .+

Edited February 3, 2014 by cag8f

jchd · February 3, 2014

Yep, you get it. In all cases, experimentation is always good for learning.

How to detect new lines?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members