Jump to content

RegExp in large textfile


Recommended Posts

I tried FS but I was not able to find a matching hint for my problem.

Let me first describe my problem:

- A very large file (let say:900MB) ascii/text with a line pattern like:

X1:11111111:0100

X1:22222222:0200

X1:33333333:0300

X2:11111111:0200

This is a simplified version of my file, but I guess you are able to see the format.

The first key is a sub key to the second field (example:Bindernumber) 1-9

The mayor key is the second field, a reference number (example Personal number)

My problem:

I would like to read from that file all lines that matches field two something like this

Example Code call:

$arrayMatchingLines=getFileContent("11111111")

and it should return:

[0]: X1:11111111:0100

[1]: X2:11111111:0200

As the file could be very large, I doubt that reading line by line with FileReadLine is a good idea, also reading the file with FileRead looks not promising.

Of course I tried it and it worked but as I read the complete file memory performance was not that good.

So is there a wonderful UDF out there, that is able to read in a fast and small way a file? A RegExpReadFileContent UDF?

For a different way of understanding: in UNIX I would do something like this (do not correct my syntax. I use this just for describing)

Example Unix process:
grep '11111111' file_a > tmpfile
foreach line ( tmpfile ) 
  do_some_thing_wonderfull_with $line
end

Maybe this makes it clear.

I tried FileReadLine and FileRead. But maybe I just missed a simple step to have a small and fast RegExpReadFile......

Thank you very much.

(I hope I described my problem clear enough...and if I missed FS just slap me for that)

Edited by Tankbuster
Link to comment
Share on other sites

ProcessSetPriority(@AutoItPID, 3)

Global Const $sFile = @ScriptDir & '\test.txt'
; 32MB
Global const $iBuffSize = 0x100000*32
Global $hFile, $iSize, $iRead, $sText, $sTemp
Global $avArray[1] = [0], $aMatch

$iSize = FileGetSize($sFile)
$iRead = 0
$sTemp = ''
$sText = ''

$hFile = FileOpen($sFile, 0)

If $hFile <> -1 Then
    
    While $iRead < $iSize
        $sText = $sTemp & FileRead($hFile, $iBuffSize)
        If StringRight($sText, 2) <> @CRLF Then
            Local $iLen = StringLen($sText)
            Local $iPos = StringInStr($sText, @CRLF, 0, -1)
            
            $sTemp = StringRight($sText, $iLen - $iPos - 1)
            $sText = StringTrimRight($sTemp, $iLen - $iPos)
        Else
            $sTemp = ''
        EndIf
        
        $aMatch = StringRegExp($sText, '(?m)^[^:]++:(1+):', 3)
        If IsArray($aMatch) Then
            Local $iUpperBound = UBound($aMatch)
            ReDim $avArray[$avArray[0]+$iUpperBound+1]
            
            For $i = 1 To $iUpperBound-1
                $avArray[0] += 1
                $avArray[$avArray[0]] = $aMatch[$i]
            Next
        EndIf
        
        $iRead += $iBuffSize
    WEnd
    
    FileClose($hFile)
    ReDim $avArray[$avArray[0]+1]
EndIf

Link to comment
Share on other sites

I'm not sure if this will be faster on a large file or not but you can give it a try. C:\Test.txt was just your example.

;
#include<array.au3> ;; For _ArrayDisplay() only
$sFileIn = "C:\Test.txt"
$sFileOut = @ScriptDir & "\results.txt"
$sFind = "11111111"
$hRun = RunWait(@Comspec & " /c Findstr.exe " & '"' & $sFind & '" ' & '"' & _
      $sFileIn & '"' & ' > "' & $sFileOut & '"', @ScriptDir, @SW_Hide)
$sHold = FileRead($sFileOut)
FileDelete($sFileOut)
$aRegExp = StringRegExp($sHold, "(?m:^)(.*\d)", 3)
If NOT @Error Then
   _ArrayDisplay($aRegExp, "Results")
EndIf
;

I couldn't get StdOutRead() displaying the proper results with a Run() command so I did it this way (creating the file) instead.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

I assume from format example all lines have fixed length? You can try to get handle with FileOpen, then read line by line with FileRead, using handle and count of characters equal to one line.

Thx for the try. But reading line by line will be horror and a waste of time (at least in my example)

Yes, you are right filehandle is better than using the filename (i guess the filename opens and closes the file so even worse), but your guess will fit for smaller files (as far as I understood). thx for the helpinghand.

[autoit]ProcessSetPriority(@AutoItPID, 3)

...

This looks like a good approach. I will try it, by only taking a look to your code, it looks like your are reading a chunk of 32Mb and search for the pattern. Yes, that is maybe less fast for smaller files, but fast for big files (like I need it) it will fit perfectly.

It looks a good balance between speed and resource safe. Thank your very much. Maybe I will increase the chunk based on the local system memory.

I'm not sure if this will be faster on a large file or not but you can give it a try. C:\Test.txt was just your example.

...

I will give this a try too, but actually during my planing I skiped the RUN() command, I thought it will be slow. But hey, you wrote it, so I will test it.

Maybe I will post the speed result here when I'm done :-)

Thx to all!

Edited by Tankbuster
Link to comment
Share on other sites

So here is my interims result (based on my project), but it gives you some sort of comparison.

I used the two functions with the same file and the same calling functions (so my debug messages are equal to both.

I used a very small file for testing and I called it several times (because that is the real usage!) in a loop (but only the calling function name was changed)

Search for 13 different values:

Authenticity solution: Runtime: 0.322s

GEOSoft soltion: 1.193s

So when I'm calc it correct Authenticity beats the Run() command by 300%.

After this very fast test I tried it with the big file......Authenticity finished after 10 minutes and the RUN() command was aborted after 3 hours.

So my guess was correct, that on big files the solution from Authenticity gained more profit out of the chunck reading.

Example search for one value in a file

When you search only for one value in a big file:

Authenticity: 0.240s

GEOSoft soltion: 0.297s

But thanks for both ways, it was easier for me to compare.

Link to comment
Share on other sites

So here is my interims result (based on my project), but it gives you some sort of comparison.

I used the two functions with the same file and the same calling functions (so my debug messages are equal to both.

I used a very small file for testing and I called it several times (because that is the real usage!) in a loop (but only the calling function name was changed)

Search for 13 different values:

Authenticity solution: Runtime: 0.322s

GEOSoft soltion: 1.193s

So when I'm calc it correct Authenticity beats the Run() command by 300%.

After this very fast test I tried it with the big file......Authenticity finished after 10 minutes and the RUN() command was aborted after 3 hours.

So my guess was correct, that on big files the solution from Authenticity gained more profit out of the chunck reading.

Example search for one value in a file

When you search only for one value in a big file:

Authenticity: 0.240s

GEOSoft soltion: 0.297s

But thanks for both ways, it was easier for me to compare.

Good to know. Thanks

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

But as I did further tests, it is not a global speed statement.

It really depends what you are trying to do.

Also in the accepted solution there is room for improvement.

(as I use fixed line length, i could improve the read size to fit to my lines, so I do not need to search from right for the next @CRLF)

But that of course fits only to my file, in case of not fixed length.

By Consolewrite I found that the backward string operation took most of the time.

Anyway:

I got some ideas when to use the one or the other function.

I found another thread, with some usefull stuff that maybe also affect this here:

http://www.autoitscript.com/forum/index.php?showtopic=97494

Edited by Tankbuster
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...