RegExp in large textfile

Tankbuster · July 27, 2009

I tried FS but I was not able to find a matching hint for my problem.

Let me first describe my problem:

- A very large file (let say:900MB) ascii/text with a line pattern like:

X1:11111111:0100

X1:22222222:0200

X1:33333333:0300

X2:11111111:0200

This is a simplified version of my file, but I guess you are able to see the format.

The first key is a sub key to the second field (example:Bindernumber) 1-9

The mayor key is the second field, a reference number (example Personal number)

My problem:

I would like to read from that file all lines that matches field two something like this

Example Code call:

$arrayMatchingLines=getFileContent("11111111")

and it should return:

[0]: X1:11111111:0100

[1]: X2:11111111:0200

As the file could be very large, I doubt that reading line by line with FileReadLine is a good idea, also reading the file with FileRead looks not promising.

Of course I tried it and it worked but as I read the complete file memory performance was not that good.

So is there a wonderful UDF out there, that is able to read in a fast and small way a file? A RegExpReadFileContent UDF?

For a different way of understanding: in UNIX I would do something like this (do not correct my syntax. I use this just for describing)

Example Unix process:
grep '11111111' file_a > tmpfile
foreach line ( tmpfile ) 
  do_some_thing_wonderfull_with $line
end

Maybe this makes it clear.

I tried FileReadLine and FileRead. But maybe I just missed a simple step to have a small and fast RegExpReadFile......

Thank you very much.

(I hope I described my problem clear enough...and if I missed FS just slap me for that)

Edited July 27, 2009 by Tankbuster

Rarst · July 27, 2009

I assume from format example all lines have fixed length? You can try to get handle with FileOpen, then read line by line with FileRead, using handle and count of characters equal to one line.

Authenticity · July 27, 2009

ProcessSetPriority(@AutoItPID, 3)

Global Const $sFile = @ScriptDir & '\test.txt'
; 32MB
Global const $iBuffSize = 0x100000*32
Global $hFile, $iSize, $iRead, $sText, $sTemp
Global $avArray[1] = [0], $aMatch

$iSize = FileGetSize($sFile)
$iRead = 0
$sTemp = ''
$sText = ''

$hFile = FileOpen($sFile, 0)

If $hFile <> -1 Then
    
    While $iRead < $iSize
        $sText = $sTemp & FileRead($hFile, $iBuffSize)
        If StringRight($sText, 2) <> @CRLF Then
            Local $iLen = StringLen($sText)
            Local $iPos = StringInStr($sText, @CRLF, 0, -1)
            
            $sTemp = StringRight($sText, $iLen - $iPos - 1)
            $sText = StringTrimRight($sTemp, $iLen - $iPos)
        Else
            $sTemp = ''
        EndIf
        
        $aMatch = StringRegExp($sText, '(?m)^[^:]++:(1+):', 3)
        If IsArray($aMatch) Then
            Local $iUpperBound = UBound($aMatch)
            ReDim $avArray[$avArray[0]+$iUpperBound+1]
            
            For $i = 1 To $iUpperBound-1
                $avArray[0] += 1
                $avArray[$avArray[0]] = $aMatch[$i]
            Next
        EndIf
        
        $iRead += $iBuffSize
    WEnd
    
    FileClose($hFile)
    ReDim $avArray[$avArray[0]+1]
EndIf

GEOSoft · July 27, 2009

I'm not sure if this will be faster on a large file or not but you can give it a try. C:\Test.txt was just your example.

;
#include<array.au3> ;; For _ArrayDisplay() only
$sFileIn = "C:\Test.txt"
$sFileOut = @ScriptDir & "\results.txt"
$sFind = "11111111"
$hRun = RunWait(@Comspec & " /c Findstr.exe " & '"' & $sFind & '" ' & '"' & _
      $sFileIn & '"' & ' > "' & $sFileOut & '"', @ScriptDir, @SW_Hide)
$sHold = FileRead($sFileOut)
FileDelete($sFileOut)
$aRegExp = StringRegExp($sHold, "(?m:^)(.*\d)", 3)
If NOT @Error Then
   _ArrayDisplay($aRegExp, "Results")
EndIf
;

I couldn't get StdOutRead() displaying the proper results with a Run() command so I did it this way (creating the file) instead.

Tankbuster · July 27, 2009

I assume from format example all lines have fixed length? You can try to get handle with FileOpen, then read line by line with FileRead, using handle and count of characters equal to one line.

Thx for the try. But reading line by line will be horror and a waste of time (at least in my example)

Yes, you are right filehandle is better than using the filename (i guess the filename opens and closes the file so even worse), but your guess will fit for smaller files (as far as I understood). thx for the helpinghand.

[autoit]ProcessSetPriority(@AutoItPID, 3)
...

This looks like a good approach. I will try it, by only taking a look to your code, it looks like your are reading a chunk of 32Mb and search for the pattern. Yes, that is maybe less fast for smaller files, but fast for big files (like I need it) it will fit perfectly.

It looks a good balance between speed and resource safe. Thank your very much. Maybe I will increase the chunk based on the local system memory.

I'm not sure if this will be faster on a large file or not but you can give it a try. C:\Test.txt was just your example.
...

I will give this a try too, but actually during my planing I skiped the RUN() command, I thought it will be slow. But hey, you wrote it, so I will test it.

Maybe I will post the speed result here when I'm done :-)

Thx to all!

Edited July 27, 2009 by Tankbuster

Tankbuster · July 29, 2009

So here is my interims result (based on my project), but it gives you some sort of comparison.

I used the two functions with the same file and the same calling functions (so my debug messages are equal to both.

I used a very small file for testing and I called it several times (because that is the real usage!) in a loop (but only the calling function name was changed)

Search for 13 different values:

Authenticity solution: Runtime: 0.322s

GEOSoft soltion: 1.193s

So when I'm calc it correct Authenticity beats the Run() command by 300%.

After this very fast test I tried it with the big file......Authenticity finished after 10 minutes and the RUN() command was aborted after 3 hours.

So my guess was correct, that on big files the solution from Authenticity gained more profit out of the chunck reading.

Example search for one value in a file

When you search only for one value in a big file:

Authenticity: 0.240s

GEOSoft soltion: 0.297s

But thanks for both ways, it was easier for me to compare.

GEOSoft · July 29, 2009

So here is my interims result (based on my project), but it gives you some sort of comparison.
I used the two functions with the same file and the same calling functions (so my debug messages are equal to both.
I used a very small file for testing and I called it several times (because that is the real usage!) in a loop (but only the calling function name was changed)
Search for 13 different values:
Authenticity solution: Runtime: 0.322s
GEOSoft soltion: 1.193s
So when I'm calc it correct Authenticity beats the Run() command by 300%.
After this very fast test I tried it with the big file......Authenticity finished after 10 minutes and the RUN() command was aborted after 3 hours.
So my guess was correct, that on big files the solution from Authenticity gained more profit out of the chunck reading.
Example search for one value in a file
When you search only for one value in a big file:
Authenticity: 0.240s
GEOSoft soltion: 0.297s
But thanks for both ways, it was easier for me to compare.

Good to know. Thanks

Tankbuster · July 29, 2009

But as I did further tests, it is not a global speed statement.

It really depends what you are trying to do.

Also in the accepted solution there is room for improvement.

(as I use fixed line length, i could improve the read size to fit to my lines, so I do not need to search from right for the next @CRLF)

But that of course fits only to my file, in case of not fixed length.

By Consolewrite I found that the backward string operation took most of the time.

Anyway:

I got some ideas when to use the one or the other function.

I found another thread, with some usefull stuff that maybe also affect this here:

http://www.autoitscript.com/forum/index.php?showtopic=97494

Edited July 29, 2009 by Tankbuster

Sign In

RegExp in large textfile

Recommended Posts

Tankbuster

Rarst

Authenticity

GEOSoft

Tankbuster

Tankbuster

GEOSoft

Tankbuster

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta