Jump to content

Modified _FileCountLines()


Spiff59
 Share

Recommended Posts

I do understand that the behavior of the current routine is (vaguely) documented.

The docs probably ought to state that it removes all CR's, all LF's, all combinations of carriage control characters from the end of the file.

I agree with that and if no changes end up happening with your ticket I'll put in a request to guinness to make a correction in the doc because it is not just the final linefeed, its all trailing LF's, CR's, and NULL characters.

My argument is that the current routine is unconventional, an oddball. It is of little use in it's present state.

The result becomes less valuable when there is a built-in hidden edit removing an unknown amount of data from the end of the file. I don't think the function should treat a @CRLF as some sort of unwanted meaningless line terminator, when in fact it is a request to create a new line. _FCL ought to perform like other similar functions or editors out there.

I see what your saying here and don't necessarily disagree, but I also understand the concept of what the original authors where going for when they did that. Count all the lines up until the last line of relevant data.

Leaving the option of trimming whitespace to the scripters discretion would be my wish

Completely agree.

PS - I also experienced how tough it is getting some changes accepted back in 2009 when a bunch of us were hoping to get a backward-compatible recursive _FileListToArray() put in to replace the existing routine. Some of the final offerings in the 275-post thread are still excellent candidates, IMHO.

Now thats funny. I almost brought up that thread. I think it was Tlem who started it. It was my real first lesson on speed is the root of evil.

PPS - I've apparently also had enough success getting some tickets approved that I haven't abandoned the effort at making contributions :)

Well thats good because I defiantly don't want that. I think your good at this kind of stuff and have even thought about sending you a pm in the past for your thoughts about how I might make a certain function faster. If you do get shot down on this request (or any others in the future), I recommend you post it in the examples as "An Alternative **** ". You never know who might take advantage of the function or the ideas behind the design of you function. ;) Edited by Beege
Link to comment
Share on other sites

Hello. Sorry to continue the debate on this thread, but there's some points that I would like to clarified.

1st, in said :

I don't understand the logic in anyone preferring to strip trailing blank lines from the file.

Isn't the name of the function FileCountLines(), not FileStripTrailingBlankLinesAndThenCountLines()?

Notepad, Word, SciTE, DOS Edit, and every other editor in the world do not remove these lines.

_FileReadToArray(), _FileReadLine(), and _FileGetSize() read these lines.

After some testings, I'm not agree with that.

If we consider this example of file :

FileWrite("TestFile.txt", "Line 1" & @CRLF & _
     "Line 2" & @CRLF & _
     "Line 3" & @CRLF & _
     "Line 4" & @CRLF & _
     "Line 5" & @CRLF)

_FileReadToArray("TestFile.txt") report an array of 5 elements !

_FileReadLine("TestFile.txt", 6) report an error !

and finally, if we compare the file with and without @CRLF chars :

_FileGetSize("TestFile.txt") report only a 2 bytes difference due to the CR and LF chars that are missing on one file, but never indicate that there is 5 or 6 lines.

2nd, I'm agree with Spiff59 about text editors. They can show you an extra line in the end, but isn't it a point of view or maybe a tips to indicate that you are ready to create the new line by typing text ?

In text file, what indicate that there is a line ?

If you take my example file, the line 6 that you can see in Scite is empty, but what indicate it's a real line ?

Did a line without any chars or indications to said that is a blank line can be a right thing ?

Yes, editors report/show this "line", but most of Windows, Dos and Unix utilities don't report this extra line. ^^

So what's the good choice ?

Did the AutoIt devs are wrong for all functions, or can we consider that it's just a request for some particular situations ?

@Beege

Don't get me wrong, I support yours/jchd's changes to this function, but for one reason only. It removes the memory limitations. Thats something I did not know before reading this thread.

All functions that use FileRead() are subjected in the same problem of file size (And there is one certain number :) ). Edited by Tlem

Best Regards.Thierry

Link to comment
Share on other sites

@Beege

All functions that use FileRead() are subjected in the same problem of file size (And there is one certain number :) ).

No thats not true. If FileRead() is limited to only read in a set number of bytes (like jchd's version does), the memory problem can be avoided. ;)
Link to comment
Share on other sites

Right, but it's not the default use of this function when you want to read datas from files.

In most of case, users read the entire file to do their work. :)

Most of time it work, because they read files that not exceed the max size, but if by exception the work must be apply on large files, the script crash and report the memory error.

Maybe it will be good to indicate it on the FileRead() documentation or to add an error exception on file size. ^^

Best Regards.Thierry

Link to comment
Share on other sites

Right, but it's not the default use of this function when you want to read datas from files.

In most of case, users read the entire file to do their work. :)

Oh I know that and agree. It was just that you originally said "All functions that use FileRead()....."

Maybe it will be good to indicate it on the FileRead() documentation or to add an error exception on file size. ^^

Its already in the doc, but it doesn't specify any value for a size. It just says: A count value that is too large can lead to AutoIt stopping with a memory allocation failure.
Link to comment
Share on other sites

It was just that you originally said "All functions that use FileRead()....."

Yes. Sorry, I was not to much precise.

I talk about these functions that use FileRead() without buffering data that can cause memory allocation error on large files (files greater than 179639503 bytes from my testings) :

  • _FileReadToArray
  • _FileWriteLog
  • _FileWriteToLine
  • _ReplaceStringInFile
  • _SQLite_SQLiteExe
170 Mo file is a really great file, but actually a logfile can have this size without problems. :)

Best Regards.Thierry

Link to comment
Share on other sites

If we consider this example of file :

FileWrite("TestFile.txt", "Line 1" & @CRLF & _
     "Line 2" & @CRLF & _
     "Line 3" & @CRLF & _
     "Line 4" & @CRLF & _
     "Line 5" & @CRLF)

_FileReadToArray("TestFile.txt") report an array of 5 elements !

_FileReadLine("TestFile.txt", 6) report an error !

Correct.

I hope so: there is no line number 6.

2nd, I'm agree with Spiff59 about text editors. They can show you an extra line in the end, but isn't it a point of view or maybe a tips to indicate that you are ready to create the new line by typing text ?

If you take my example file, the line 6 that you can see in Scite is empty, but what indicate it's a real line ?

All current text editors I know of are behaving this way: an empty file is shown with cursor set at line #1 to ease entering new text. By the same logic, they show an extra line when the last line ends by a line terminator. But this is only by convenience and, if one think about it for a second, this is the exact interpretation of the control characters that implement the "normal" line termination (CRFL). From this point of view, the MS convention is historically correct. Don't forget that these control characters had a mechanical meaning for teletypewriters and printers: LF just cause the paper to advance one line while leaving the print head [aka carriage] at the same place. CR caused the carriage to home left but didn't advance the paper.

The only text editor of the "modern" era (CRT era) I remember of which considered an empty file as having zero line was:

IBM Personal Computers Professional Editor II

Version 2.00

Program by Walter J. Paul

*** IBM Internal Use Only ***

You had to enter "editing mode" to be able to type text. It displays line number as ***** when you get "past" the last actual last line. FYI my copy is dated 1987/06/26 and still offers features not found in modern editors (except emacs) that I'm quite happy to use once every blue moon.

Simple logic demands that an empty file has zero line. A file containing "abc" has one line. A file containing "<line terminator>" also has one line. A file containing "abc<line terminator>" also has one line.

From this point of view, I consider ignoring leading and/or trailing line terminators as a bug.

Now given that this function is of very rare use by itself, I wouldn't regard fixing it as a problematic script-breaking issue. It's much less script-breaking than all the "numeric" fixes that went with 3.3.8.0, all of them were badly needed.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Hello JC, you explain that to me, but from the start of this tread, I'm agree with that.

In the previous messages, I talk about that to explain my point of view to Spiff50. :)

You did too, with more old explanations that come from the origins. ^^

From the beginning my proposition was more simple to return number of line. The problem (for Spiff59) was just about this fu#!;% trailing blank line and for me the memory allocation error from the buit in function (that my function doesn't care).

Edited by Tlem

Best Regards.Thierry

Link to comment
Share on other sites

Hi Thierry,

I didn't quote you to show disagreement, but rather to build (again) on your own opinions.

That's why I said " Correct" a.s.o.

I know we fully agree on this.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

No harm done (yet).

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

It's funny because the only program I ever noticed that would not work if the "last line" ended with a new line was FAVC video converter. I used to rewrite the config.ini file to set the output directory without having to do the BrowseForFolder bit. The program would not function if I wrote a new line at the end of the file. Strange.

My other thought would be, instead of trying to suck the file into a variable and letting the allocation fail, a sanity check might be useful. For example if you know the system has x GB of physical ram you might do a FileGetSize(). Although I don't see in the docs if it supports huge files. I imagine it uses the 64 bit file size API these days though.

Link to comment
Share on other sites

My other thought would be, instead of trying to suck the file into a variable and letting the allocation fail, a sanity check might be useful. For example if you know the system has x GB of physical ram you might do a FileGetSize().

I'm unsure that would do much good. First, memory can get allocated or freed very fast by other services or programs under your feet and beyond your control, unless you isolate your code fragment in an exclusive portion (weird for general purpose). Then even knowing how much _free_ physical memory the system has at a given time doesn't always make you safe to allocate that much. You have to decide if memory fragmentation plays a role or not in your particular (under the hood) allocation scheme. That makes it very difficult to plan in the general case of simple functions for routine use.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I believe the following version is about as fast as pure AutoIt can be and it doesn't have any file size or memory limitation.

By default, it simply counts linefeeds only (this is compatible with both traditional Unix-like [LF] and DOS-Windows [CRLF] conventions) using a 8Mb buffer, which should cope with any decent PC RAM and hard disk or SSD buffer. Adjust buffer size to vary results sligthly.

One can change the line counting behavior by passing a non-zero second argument. It then counts carriage-returns only (suitable for traditional Mac text files convention [CR]).

Func _FileCountLines($sFilePath, $CRonly = 0)
    Local $hFile = FileOpen($sFilePath)
    If $hFile = -1 Then Return SetError(1, 0, 0)
Local $iLineCount = 0, $sBuffer, $iReadBytes, $bDone
Local Const $BUFFER_SIZE = 8 * 1024 * 1024
Local $sTermination = @LF
If $CRonly Then $sTermination = @CR
Do
  $sBuffer = FileRead($hFile, $BUFFER_SIZE)
  $bDone = (@extended <> $BUFFER_SIZE)
  StringRegExpReplace($sBuffer, $sTermination, "")
  $iLineCount += @extended
Until $bDone
    If FileGetPos($hFile) > 0 Then
        FileSetPos($hFile, -1, 1)
        If FileRead($hFile, 1) <> $sTermination Then $iLineCount += 1
    EndIf
    FileClose($hFile)
    Return $iLineCount
EndFunc   ;==>_FileCountLines

Edit: fixed empty file case. Oops.

This following version also doesn't have any file size or memory limitation.

Also, by utilizing @extended, this function will tell if there is a "trailing blank line" or not.

My xp gave the following results :-

Test results on 216.2Mb file with 12,299,150 lines:-

28.6 secs - this version.

82.4 secs - jchd's regexp version.

162.1 secs - Tlem's count FileReadLine version.

Test results on 3.8Mb file with 10,000 lines:-

0.335 secs - Tlem's count FileReadLine version.

0.567 secs - this version.

0.976 secs - jchd's regexp version.

1.135 secs - current _FileCountLines version.

Local $sFileName = "File_216Mb.txt" ;"EmptyFile.txt" ;"linesNoTrailingLF.txt" ;"lines.txt" ;

$begin = TimerInit()
Local $iNumLines = _FileCountLines($sFileName)
Local $iLastChar = @extended
ConsoleWrite("No. of Lines = " & $iNumLines & ".    Last character is CR or LF = " & ($iLastChar = 1) & @CRLF)
ConsoleWrite(TimerDiff($begin) & @CRLF & @CRLF)


; ================ _FileCountLines ===============================
; If the last character in the file is a vertical whitespace character, (@LF or @CR), then @extended = 1 is returned .
;  Otherwise, @extended = 0.
; Thanks to:-
; tylo - http://www.autoitscript.com/forum/topic/6330-new-fast-line-counter/page__view__findpost__p__44449
; jchd - http://www.autoitscript.com/forum/topic/137024-modified-filecountlines/page__view__findpost__p__958820

Func _FileCountLines($sFilePath)
    Local $iSizeRead, $hFile = FileOpen($sFilePath)
    If $hFile = -1 Then Return SetError(1, 0, 0)
    Local $iLineCount = 0, $iExtend = 0, $sBuffer
    Local Const $BUFFER_SIZE = 8 * 1024 * 1024
    Do
        $sBuffer = FileRead($hFile, $BUFFER_SIZE)
        $iSizeRead = @extended
        If StringInStr($sBuffer, @LF) Then
            $iLineCount += StringLen(StringAddCR($sBuffer)) - $iSizeRead ; Used when LF's are present.
        Else
            $iLineCount += $iSizeRead - StringLen(StringStripCR($sBuffer)) ; Used when only CR's are present.
        EndIf
    Until ($iSizeRead <> $BUFFER_SIZE)
    If $iLineCount > 0 And FileGetPos($hFile) > 0 Then
        FileSetPos($hFile, -1, 1)
        If StringRegExp(FileRead($hFile, 1), "\v") Then
            $iExtend = 1
        Else
            $iLineCount += 1 ; No EOF vertical whitespace character.
        EndIf
    EndIf
    FileClose($hFile)
    Return SetError(0, $iExtend, $iLineCount)
EndFunc   ;==>_FileCountLines
Link to comment
Share on other sites

You guys have certainly done better than what I offered in the initial post, having removed the memory limitation entirely (and additionally increased the performance). I guess I'm stubborn, because I'd still say that we should treat the following as true:

"I am a 1-line file"

"I am a 2-line filev"

"I am a 5-line filevvvv"

Link to comment
Share on other sites

  • Moderators

Hi,

Could I suggest replacing the StringRegExpReplace with a simple StringReplace using a casesense parameter of 2 (not case sensitive, using a basic/faster comparison)? The number of replacements is still returned in @extended. ;)

; StringRegExpReplace($sBuffer, $sTermination, "")
StringReplace($sBuffer, $sTermination, "", 0, 2)

I have found this to be significiantly faster in other scripts and my tests (Vista x32) show that it is here as well. :)

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

I get better results too.

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

I'm unsure that would do much good. First, memory can get allocated or freed very fast by other services or programs under your feet and beyond your control, unless you isolate your code fragment in an exclusive portion (weird for general purpose). Then even knowing how much _free_ physical memory the system has at a given time doesn't always make you safe to allocate that much. You have to decide if memory fragmentation plays a role or not in your particular (under the hood) allocation scheme. That makes it very difficult to plan in the general case of simple functions for routine use.

It can give you an idea how much memory allocation to ask for. Not necessarily the whole enchilada. Like if I have > 1 GB ask for x if > 2 GB ask for y etc.. For example I used a CRC32 function back in Win9x in Delphi where I had tiers of buffer sizes for the block file read depending on the physical memory in the system. A heuristic. Not just asking for all the ram I could get. It's an optimization over just saying "give me 8 KB."

Also could be a sanity check. If the file is > a GB and it's supposed to be a text file, chances are it's really a video. Who does flat text files > GB instead of using a formal database? I'd say chances of text files > 100 MB are close to zero unless somebody is using some really really old code.

Edited by MilesAhead
Link to comment
Share on other sites

of text files > 100 MB

I encounter those on occasion, maybe a data dump into a csv file, or a system log. Not sure I recall any of them breaking a GB to date.

He's defaulted the buffer to 8MB? One would think you could safely start larger than that, although I'm not sure the performance gain would be significant.

typo

Edited by Spiff59
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...