Jump to content

[Resolved] Faster way to count lines in a text file


Recommended Posts

I'm occasionally dealing with some seriously large (900+ MB) text files. So far I have been using FileOpen and FileReadLine to first get a count of the lines just so I can give some kind of progress on where in the file the "work portion" of my script is at. Just looping through the file and incrementing a counter in this method can take several minutes, and this is before I even start my search/data gather/whatever.

Has someone worked out a faster method to get a line count?

Edited by SpookMeister

[u]Helpful tips:[/u]If you want better answers to your questions, take the time to reproduce your issue in a small "stand alone" example script whenever possible. Also, make sure you tell us 1) what you tried, 2) what you expected to happen, and 3) what happened instead.[u]Useful links:[/u]BrettF's update to LxP's "How to AutoIt" pdfValuater's Autoit 1-2-3 Download page for the latest versions of Autoit and SciTE[quote]<glyph> For example - if you came in here asking "how do I use a jackhammer" we might ask "why do you need to use a jackhammer"<glyph> If the answer to the latter question is "to knock my grandmother's head off to let out the evil spirits that gave her cancer", then maybe the problem is actually unrelated to jackhammers[/quote]

Link to comment
Share on other sites

Guess that won't work/make it faster for +900MB files as the UDF contains this row

$sFileContent = StringStripWS(FileRead($hFile), 2)

Hmmm, and I fear there is no really superior method to count linebreaks, as they are just character occurences of @lf or chr(10)...

Cheers

Link to comment
Share on other sites

I did a little creative thinking and came up with a workable solution for my needs.

Because I am only needing the number of lines for progress information, and in my case this does not need to be exact, I grabbed a sample of the file and averaged out the number of characters per line in the sample. Then, by multiplying the current line by the average I am able to gauge where I am in the the file by comparing that to the file size.

Here is an example of searching a large file for a string of text.

#include <array.au3>
#include <string.au3>
HotKeySet("{ESC}", "Terminate")
Dim $a_Results[1]
$a_Results[0] = 0

; Select file
$Path = FileOpenDialog("Select the file to search", @WorkingDir & "\", "All Files (*.*)")
If @error Then
    MsgBox(0, "Error", "Failed to locate file")
    Exit
EndIf

; Request search string from the user
$s_SearchString = InputBox("Search String", "Enter the string that you want to search for:")
If @error Then
    MsgBox(0, "Error", "No search string entered")
    Exit
EndIf

; Find the Average number of Characters (Bytes) Per Line from a sample of the file
$i_FileSize = FileGetSize($Path)
$h_file = FileOpen($Path, 0)
If $h_file = -1 Then
    MsgBox(0, "Error", "Unable to open file.")
    Exit
EndIf
$i_bytes = 0
For $i = 1 To 50000 ; the 50k is arbitrary, but I found it took less than half a second to complete
    $line = FileReadLine($h_file)
    If @error = -1 Then ExitLoop
    $i_bytes += StringLen($line)
Next
FileClose($h_file)
$n_ABPL = $i_bytes / $i ; Average Bytes Per Line

; Re-open the file and begin search
$h_file = FileOpen($Path, 0)
$i_LineCount = 0
$i_sub = 0
While 1
    $line = FileReadLine($h_file)
    If @error = -1 Then ExitLoop
    $i_LineCount += 1
    $i_sub += 1
    Select
        Case StringInStr($line, $s_SearchString) ; if string is found add it to the array
            _ArrayAdd($a_Results, $line)
            $a_Results[0] += 1
        Case $i_sub >= 5000 ; every 5k lines update the tooltip
            $n_Estimate = Int($i_LineCount * $n_ABPL)
            $prog = _StringAddThousandsSep($n_Estimate) & " / " & _StringAddThousandsSep($i_FileSize)
            $msg = "Searching: " & @LF & _
                    $Path & @LF & @LF & _
                    "For the string:" & @LF & _
                    $s_SearchString & @LF & @LF & _
                    "Estimated Progress: " & $prog
            ToolTip($msg)
            $i_sub = 0
    EndSelect
WEnd
ToolTip("")
_ArrayDisplay($a_Results)

Func Terminate()
    MsgBox(0, "Abort", "Search aborted by user")
    Exit
EndFunc   ;==>Terminate
Edited by SpookMeister

[u]Helpful tips:[/u]If you want better answers to your questions, take the time to reproduce your issue in a small "stand alone" example script whenever possible. Also, make sure you tell us 1) what you tried, 2) what you expected to happen, and 3) what happened instead.[u]Useful links:[/u]BrettF's update to LxP's "How to AutoIt" pdfValuater's Autoit 1-2-3 Download page for the latest versions of Autoit and SciTE[quote]<glyph> For example - if you came in here asking "how do I use a jackhammer" we might ask "why do you need to use a jackhammer"<glyph> If the answer to the latter question is "to knock my grandmother's head off to let out the evil spirits that gave her cancer", then maybe the problem is actually unrelated to jackhammers[/quote]

Link to comment
Share on other sites

  • 11 months later...

I did a little creative thinking and came up with a workable solution for my needs.

<...>

Well, in that/such cases (huge text files; various bulk text/data processing) you should/could consider using special utilities,

such as word count (wc) utility and the others from gnuwin32 coreutils

(i don't think, that portability is the issue; AutoIt is not efficient (if suitable at all) for such cases either...)

Local $starttime = _Timer_Init()
RunWait(@ComSpec & " /c wc -l C:\tmp\test.txt  | cut -d ' ' -f 1 > C:\tmp\line_cnt.txt")
Local $fh = FileOpen("C:\tmp\line_cnt.txt", 0)
ConsoleWrite(@CRLF & "File size (MB): " & FileGetSize("C:\tmp\test.txt")/1048576)
ConsoleWrite(@CRLF & "Line count: " & FileReadLine($fh))
ConsoleWrite(@CRLF & "Count took (ms): " & _Timer_Diff($starttime) & @CRLF)

Results:

File size (MB): 888.109111785889

Line count: 71214500

Count took (ms): 107967.818535596

Note line count and the fact that test had run on the old laptop with a slow disk...

Overall speed increase should/could be up to 100x if the number of lines is around a few millions

and up to 10x in the "extreme" cases like in this test.

Edited by wiela
Link to comment
Share on other sites

I did a little creative thinking and came up with a workable solution for my needs.

Because I am only needing the number of lines for progress information, and in my case this does not need to be exact, I grabbed a sample of the file and averaged out the number of characters per line in the sample. Then, by multiplying the current line by the average I am able to gauge where I am in the the file by comparing that to the file size.

Here is an example of searching a large file for a string of text.

#include <array.au3>
#include <string.au3>
HotKeySet("{ESC}", "Terminate")
Dim $a_Results[1]
$a_Results[0] = 0

; Select file
$Path = FileOpenDialog("Select the file to search", @WorkingDir & "\", "All Files (*.*)")
If @error Then
    MsgBox(0, "Error", "Failed to locate file")
    Exit
EndIf

; Request search string from the user
$s_SearchString = InputBox("Search String", "Enter the string that you want to search for:")
If @error Then
    MsgBox(0, "Error", "No search string entered")
    Exit
EndIf

; Find the Average number of Characters (Bytes) Per Line from a sample of the file
$i_FileSize = FileGetSize($Path)
$h_file = FileOpen($Path, 0)
If $h_file = -1 Then
    MsgBox(0, "Error", "Unable to open file.")
    Exit
EndIf
$i_bytes = 0
For $i = 1 To 50000 ; the 50k is arbitrary, but I found it took less than half a second to complete
    $line = FileReadLine($h_file)
    If @error = -1 Then ExitLoop
    $i_bytes += StringLen($line)
Next
FileClose($h_file)
$n_ABPL = $i_bytes / $i ; Average Bytes Per Line

; Re-open the file and begin search
$h_file = FileOpen($Path, 0)
$i_LineCount = 0
$i_sub = 0
While 1
    $line = FileReadLine($h_file)
    If @error = -1 Then ExitLoop
    $i_LineCount += 1
    $i_sub += 1
    Select
        Case StringInStr($line, $s_SearchString) ; if string is found add it to the array
            _ArrayAdd($a_Results, $line)
            $a_Results[0] += 1
        Case $i_sub >= 5000 ; every 5k lines update the tooltip
            $n_Estimate = Int($i_LineCount * $n_ABPL)
            $prog = _StringAddThousandsSep($n_Estimate) & " / " & _StringAddThousandsSep($i_FileSize)
            $msg = "Searching: " & @LF & _
                    $Path & @LF & @LF & _
                    "For the string:" & @LF & _
                    $s_SearchString & @LF & @LF & _
                    "Estimated Progress: " & $prog
            ToolTip($msg)
            $i_sub = 0
    EndSelect
WEnd
ToolTip("")
_ArrayDisplay($a_Results)

Func Terminate()
    MsgBox(0, "Abort", "Search aborted by user")
    Exit
EndFunc   ;==>Terminate

That could have been done much easier using a Regular Expression.

Just writing it on the fly without testing even the SRE

#include<array.au3> ;; For _ArrayDisplay() only
$Path = FileOpenDialog("Select the file to search", @WorkingDir & "\", "All Files (*.*)")
If @error Then
    MsgBox(0, "Error", "Failed to locate file")
    Exit
EndIf
$s_SearchString = InputBox("Search String", "Enter the string that you want to search for:")
If @error Then
    MsgBox(0, "Error", "No search string entered")
    Exit
EndIf
$aFound = _BuildArray($s_SearchString)
If NOT @Error Then
    _ArrayDisplay($aFound)
Else
    MsgBox(0, "Ooops", "Something is amiss")
EndIf

Func _BuildArray($sFind, $icase = 0);; If not 0 then match is case-sensitive
    $sCase = "(?i)"
    If $iCase Then $sCase = ""
    $aResults = StringRegExp(FileRead($Path), $sCase & "(?m:^).*" & $sFind & ".*(?:\v|$)+", 3)
    If @Error Then Return SetError(1)
    Return $aResults
EndFunc

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...