Jump to content

How to detect new lines?


cag8f
 Share

Go to solution Solved by guestscripter,

Recommended Posts

I've managed to grab the text of a web page using 

_IEBodyReadText($oIE)

When I display this text (e.g. using ConsoleWrite), it is broken up into multiple lines of text.  How can I now grab a particular line of this text?  For example if I want to grab the 5th line of text, would I tell Autoit (using StringRegExp) to grab the text between the 5th and 6th newline characters?  If so, what would be the syntax for this?  I've been messing around with different newline wildcards in StringRegExp and can't seem to get it to work.

If that's not the most efficient method of parsing a line of text, what is?

Thanks in advance.

Link to comment
Share on other sites

Aahsoo....

  well, I think to manipulate the text line by line it will have to be in an array or file of some kind.

Edit:  BTW you can set it up that you not see any of the file process happening if that was your concern...

Edited by l3ill
Link to comment
Share on other sites

Not seeing the process isn't my concern.  This is more a learning exercise on my part than anything.  I'm able to parse this long string using text matches and such in StringRegExp().  Reading through the StringRegExp() help file I see that at least in some cases, StringRegExp() might be able to parse strings using newline matches.  I was trying extract text using such methods.  Maybe I need to start with a simpler example.

Edit:  To clarify, doesn't my long string have newline characters stored within?  If so, shouldn't I be able to parse this string using the newline character info and StringRegExp() (and other string functions)?

Edited by cag8f
Link to comment
Share on other sites

  • Solution

Here´s a snippet from a working function I have that uses StringRegExp on a string with several lines

#Region Qualify
    If Not StringRegExp($sSource, '\*\*\*.*\*\*\*\r\n.*\r\n\d+ [A-Z]*/EINH\.: \d*\r\n(?>INCENTIVE\(I\): 004=003;\r\n)*(?>\d+ NEU [A-Z, -]+ (?>M|F) \d+ \d+ \d{2}.\d{2}-\d{2}.\d{2} .*?\r\n)+', 0) Then Return False
#EndRegion Qualify

Notice the 

\r\n

which is Return+Linefeed i.e. equal to CR+LF i.e. @CRLF.

And to look for a variable amount of lines in something, I´ve used a "non-capturing group" like (simplified example) .*(?>.*rn)*.

...so try, if say your string is:

stuff in line 1
stuff in line 2
line 3 stuff
more stuff (line 4)
fifth line
WE HAVE A NEW LINE

the pattern (with $flag = 3)

.+(?>\r\n)*

would return an array with:

0 => stuff in line 1

1 => stuff in line 2

2 => line 3 stuff

3 => more stuff (line 4)

4 => fifth line

5 => WE HAVE A NEW LINE

P.S. I very firmly suggest, if you don´t yet, the use of the "StringRegExpGUI" (found in the helpfile, and/or the AutoIt Includes folder)

Edited by guestscripter
Link to comment
Share on other sites

I've managed to grab the text of a web page using 

_IEBodyReadText($oIE)

When I display this text (e.g. using ConsoleWrite), it is broken up into multiple lines of text.  How can I now grab a particular line of this text?  For example if I want to grab the 5th line of text, would I tell Autoit (using StringRegExp) to grab the text between the 5th and 6th newline characters?  If so, what would be the syntax for this?  I've been messing around with different newline wildcards in StringRegExp and can't seem to get it to work.

If that's not the most efficient method of parsing a line of text, what is?

Thanks in advance.

Normally, the newline character is at the end of a particular line. So, if you want the 5th line you need to grab the text between the newline character at the end of the 4th line and the newline character at the end of line you want, the 5th newline character.

Try this on the text you managed to grab from the web page.

Local $sText = _
        "Line 1" & @CRLF & _
        "Line 2" & @CRLF & _
        "Line 3" & @CRLF & _
        "Line 4" & @CRLF & _
        "Line 5" & @LF & _
        "Line 6" & @LF & _
        "Line 7" & @LF & _
        "Line 8" & @LF & _
        "Line 9" & @LF & _
        "Line 10"

Local $iLineNum = 5
Local $sLine = StringRegExpReplace($sText, "(?s)(\V*\R){0," & ($iLineNum - 1) & "}(\V*)\R?.*$", "$2")

ConsoleWrite($sLine & @LF)

 

Link to comment
Share on other sites

Thanks to both of you.

Wayfarer, your explanation was pretty good and your code:

.+(?>\r\n)

accomplished what I needed.  But I'm still trying to fully understand why.  A few questions on this syntax:

1.  Would R ("Matches any Unicode newline sequence by default") be suitable for use instead of rn?  It yields the same results when I plug it in.

2.  I'm having trouble understanding how a non-capturing group is working in this situation.  I am definitely understanding something wrong, so bear with me.  From the StringRegExp description, "Capturing groups remember the text they matched for use in backreferences and they populate the optionally returned array."  Isn't this saying that every StringRegExp match is written into the returned array (the same array returned by setting flag=3)?  If so, wouldn't a non-capturing group omit the writing-to-an-array part?  If so, how is StringRegExp still returning an array of values if the non-capturing group isn't writing any of the values to the array?  I'd just like some clarification for my own education.

Also thanks for the StringRegExp GUI tip--it really is handy.

Link to comment
Share on other sites

0. (?>   ) is NOT a non-capturing group but a look-ahead assertion. Relates to point 2. (*)

1. Reading the end of the paragraph you cite from the help, one must consider the answer to be YES.

2. In absence of explicit capturing group(s), StringRegExp options 1 to 4 return match(es).

Edit

(*) One more indication I shouldn't post at 3:08 AM

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Thanks for the reply:

0.  I referred to (?>...) as a non-capturing group because the StringRegExp help file refers to it as an "atomic non-capturing group."

1.  I read that end paragraph but didn't quite understand the word 'unbreakable' in that context.

2.  OK so in my case do I have 2 conflicting parameters?  i.e. flag=3 tells StringRegExp to save matches to an array, while the (?>) tells StringRegExp to not save matches to an array?  This probably is not what is happening, so what am I misinterpreting?

Link to comment
Share on other sites

Another alternative...

;
; script requires 3.3.10+
;

Local $bGenFile = true
If $bGenFile Then _gen_file()

Local $LineToReturn = 2345

ConsoleWrite(_get_line(@ScriptDir & '\test10.txt', $LineToReturn) & @LF)

Func _get_line($file, $line)

    Local $_str

    $_str = FileRead($file)
    $_str = StringRegExpReplace($_str, '\R', @LF)   ; normalize EOL's
    Return StringSplit($_str, @LF, 2)[$line - 1]    ; use function direct referencing to return line

EndFunc   ;==>_get_line

Func _gen_file()

    Local $str, $st = TimerInit()

    For $1 = 1 To 5000
        $str &= StringFormat('line %04i  ', $1)
        For $2 = 1 To Random(10, 100, 1)
            $str &= Chr(Random(65, 90, 1))
        Next
        $str &= (Random(0, 1, 1)) ? @CRLF : @LF
    Next

    FileDelete(@ScriptDir & '\test10.txt')
    FileWrite(@ScriptDir & '\test10.txt', $str)

    ConsoleWrite(StringFormat('Time to gen file = %2.4f seconds', TimerDiff($st) / 1000) & @LF)

EndFunc   ;==>_gen_file

kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

cag8f,

0. Sorry for misreading my own prose. As my edit above shows, I was half asleep and posted too fast.

1. unbreakable in "(?>rn|n|r)" means that it will match rn as a whole if this sequence is encountered (that is, the engine will not match only r in this case). Suppose your subject is "abc" then the pattern "(?>ab|a)b" will fail: "ab" in the subject will match "ab" in the atomic group and -thanks to atomic grouping) the engine will not backtrack, so that the last "b" in pattern will not match "c" in subject. I choose this wording for compactness: again the succint help on StringRegExp can't be substituted to the detailed reference PCRE documentation.

2. non-capturing groups are not saved, unless they are in the middle of a captured group. In your case .* will match the line content and atomic non capturing match rn will not make it in the output array.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

OK I think I'm getting there.  So the newline wildcards, rn, will not be saved to the array, since they are within a non-captured group.  But since the .+* is *not* inside a non-captured group, and flag=3, the rest of the line *is* saved to the array.  Is this correct? 

Edit:  .+*, not .+

Edited by cag8f
Link to comment
Share on other sites

Yep, you get it. In all cases, experimentation is always good for learning.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...