Optimize a binary file edit and copy?

rmh · July 1, 2008

I would like to optimize my AutoIt code that repairs tens of thousands of 'FlateDecode'd (binary stream) PDF files. I need to perform this function in Windows, and as of right now, it looks like I might be limited to AutoIt.

I come from a *nix background, so I'm likely missing something simple. Hence my posting.

I do not need help doing this work, I need help optimizing the work I'm doing within AutoIt. AutoIt seems to crawl on binary file manipulation.

I need to edit each PDF then copy the PDF to another location.

I ran my C code on linux and it chews through about 80,000 files (~3GB of data) in under a minute on my laptop, since I could open each file for read *and* write at the same time, I could edit the files in place.

Cygwin doesn't implement fork() well at all, and bumps into resource limitations.

AutoIt, works just fine, albeit at a snails pace - less than one file per second.

Here's the bottleneck snippet that is very similar to my C code:

CODE

Func _RepairPDFHeader( $fullfilename)

Local $inFileH, $otFileH, $char, $charNum = 0, $lineNum = 0, $fixed = 0

Local $string = "%PDF-", $filesuffix = ".tmp"

Local $szDrive, $szDir, $szFName, $szExt

$inFileH = FileOpen( $fullfilename, 0)

If -1 = $inFileH Then

return False

EndIf

$otFileH = FileOpen( $fullfilename & $filesuffix, 2)

If -1 = $otFileH Then

return False

EndIf

_PathSplit( $fullfilename, $szDrive, $szDir, $szFName, $szExt)

While 1

$char = FileRead( $inFileH, 1)

If @error = -1 Then ExitLoop ; EOF reached

$charNum += 1

If @LF = $char Then ; find end of line 1

$lineNum += 1

EndIf

If (( "-" = $char) And ( $lineNum < 1)) Then ; Only fix the first line

FileWrite( $otFileH, "%")

FileWrite( $otFileH, "P")

FileWrite( $otFileH, "D")

FileWrite( $otFileH, "F")

FileWrite( $otFileH, "-")

$fixed = 1

ElseIf 0 = $fixed Then ; until we hit the first "-" on line 1

ContinueLoop ; write nothing to the output file

Else ; Else copy the file

FileWrite( $otFileH, $char)

EndIf

WEnd

FileClose( $inFileH)

FileClose( $otFileH)

FileMove( $fullfilename & $filesuffix, $workloaddir & $szFName & "." & $szExt, 1)

FileDelete( $fullfilename & $filesuffix)

return True

EndFunc

PsaltyDS · July 1, 2008

Is it really necessary to process the file one CHARACTER at a time?

goldenix · July 1, 2008

May I suggest something?

$char = FileRead( $inFileH, 1)

You read ony 1 char of binary data from the file at the time hire, what if you read the whole file at the same time?:

$char = FileRead( $inFileH)

zorphnog · July 1, 2008

I've optimized your code somewhat so that it doesn't read the whole file, just the first line. I don't have any test files to work with so I don't know how well it works.

Func _RepairPDFHeader( $fullfilename)
    Local $inFileH, $otFileH, $char, $charNum = 0, $lineNum = 0, $fixed = 0
    Local $string = "%PDF-", $filesuffix = ".tmp"
    Local $szDrive, $szDir, $szFName, $szExt

    $inFileH = FileOpen( $fullfilename, 0)
    If -1 = $inFileH Then Return False
    $otFileH = FileOpen( $fullfilename & $filesuffix, 2)
    If -1 = $otFileH Then Return False

    _PathSplit( $fullfilename, $szDrive, $szDir, $szFName, $szExt)

    While 1
        $char = FileRead( $inFileH, 1)
        If @error = -1 Then ExitLoop ; EOF reached
        $charNum += 1
        If @LF = $char Then $lineNum += 1
        If (( "-" = $char) And ( $lineNum < 1)) Then ; Only fix the first line
            FileWrite( $otFileH, "%")
            FileWrite( $otFileH, "P")
            FileWrite( $otFileH, "D")
            FileWrite( $otFileH, "F")
            FileWrite( $otFileH, "-")
            $fixed = 1
            $char = FileRead( $inFileH)
            FileWrite( $otFileH, $char)
            ExitLoop
        ElseIf $lineNum > 1 Then
            ExitLoop
        EndIf
    WEnd

    FileClose( $inFileH)
    FileClose( $otFileH)

    If $fixed = 1 Then FileMove( $fullfilename & $filesuffix, $workloaddir & $szFName & "." & $szExt, 1)
    FileDelete( $fullfilename & $filesuffix)
    
    return True

EndFunc

rmh · July 1, 2008

I've optimized your code somewhat so that it doesn't read the whole file, just the first line. I don't have any test files to work with so I don't know how well it works.

Func _RepairPDFHeader( $fullfilename)
    Local $inFileH, $otFileH, $char, $charNum = 0, $lineNum = 0, $fixed = 0
    Local $filesuffix = ".tmp"
    Local $szDrive, $szDir, $szFName, $szExt

    $inFileH = FileOpen( $fullfilename, 0)
    If -1 = $inFileH Then Return False
    $otFileH = FileOpen( $fullfilename & $filesuffix, 2)
    If -1 = $otFileH Then Return False

    _PathSplit( $fullfilename, $szDrive, $szDir, $szFName, $szExt)

    While 1
        $char = FileRead( $inFileH, 1)
        If @error = -1 Then ExitLoop ; EOF reached
        If @LF = $char Then $lineNum += 1
        If (( "-" = $char) And ( $lineNum < 1)) Then ; Only fix the first line
            FileWrite( $otFileH, "%")
            FileWrite( $otFileH, "P")
            FileWrite( $otFileH, "D")
            FileWrite( $otFileH, "F")
            FileWrite( $otFileH, "-")
            $fixed = 1
            $char = FileRead( $inFileH)            ; re-reads the entire input file?!
            FileWrite( $otFileH, $char)              ; appends the entire input file
            ExitLoop
        ElseIf $lineNum > 1 Then
            ExitLoop
        EndIf
    WEnd

    FileClose( $inFileH)
    FileClose( $otFileH)

    If $fixed = 1 Then FileMove( $fullfilename & $filesuffix, $workloaddir & $szFName & "." & $szExt, 1)
    FileDelete( $fullfilename & $filesuffix)
    
    return True

EndFunc

Thanks for the assistance all, unfortunately it appears that reading a file one byte at a time then reading the entire file doesn't keep track of the current position in the file being read. See my comments in the code snippet above.

The result is that the above code produces an output file that contains the *complete* inputfile appended to "%PDF-", not the edited file the original code produced.

e.g. if the original input file contains the following first line (the amount of junk before the hyophen varies):

089asdfhhEX;-1.4

the original code produced

%PDF-1.4

the optimized code produces

%PDF-089asdfhhEX;-1.4

PsaltyDS · July 1, 2008

Thanks for the assistance all, unfortunately it appears that reading a file one byte at a time then reading the entire file doesn't keep track of the current position in the file being read. See my comments in the code snippet above.

The result is that the above code produces an output file that contains the *complete* inputfile appended to "%PDF-", not the edited file the original code produced.

e.g. if the original input file contains the following first line (the amount of junk before the hyophen varies):

the original code produced

the optimized code produces

What's the largest file you are dealing with? If it's less than about 128MB, then you can read the entire file into a variable and tweak it in memory and just write to the disk once:

Func _RepairPDFHeader( $fullfilename)
    If Not FileExists($fullfilename) Then Return SetError(1, 0, 0)
    
    Local $sData = FileRead($fullfilename)
    $iDash = StringInStr($sData, "-")
    If $iDash Then
        $sData = StringReplace($sData, StringLeft($sData, $iDash), "%PDF-", 1)
        $hFile = FileOpen($fullfilename, 2)
        FileWrite($hFile, $sData)
        FileClose($hFile)
        Return 1
    Else
        Return SetError(2, 0, 0)
    EndIf
EndFunc

zorphnog · July 1, 2008

Hmm. The code worked for me. I edited a pdf to read "089asdfhhEX;-1.3" on the first line. Ran it through the script and it produced "%PDF-1.3".

Could you post an example file that is not working? And are you using the latest version of AutoIt?

Oh yeah I forgot to mention you can reduce your single FileWrite commands into one: FileWrite( $otFileH, "%PDF-")

rmh · July 1, 2008

What's the largest file you are dealing with? If it's less than about 128MB, then you can read the entire file into a variable and tweak it in memory and just write to the disk once:

Func _RepairPDFHeader( $fullfilename)
    If Not FileExists($fullfilename) Then Return SetError(1, 0, 0)
    
    Local $sData = FileRead($fullfilename)
    $iDash = StringInStr($sData, "-")
    If $iDash Then
        $sData = StringReplace($sData, StringLeft($sData, $iDash), "%PDF-", 1)
        $hFile = FileOpen($fullfilename, 2)
        FileWrite($hFile, $sData)
        FileClose($hFile)
        Return 1
    Else
        Return SetError(2, 0, 0)
    EndIf
EndFunc

Thanks. I will look into this.

My previous attempts at string manipulation of binary file data resulted in even more corrupt binary files when the binary bytes were converted to strings automatically, that's why I was avoiding string manipulation.

Zorphnog - Thanks for the update. AutoIT says it's version 3.2.2.0 on this computer, 3.2.10.0 on my other computer. That brings up a great point - I should have made sure I was on the latest version when I first started running into problems. Thanks for the pointer.

I'll upgrade my machines first, then go through this exercise again and post results tomorrow.

PsaltyDS · July 1, 2008

Thanks. I will look into this.
My previous attempts at string manipulation of binary file data resulted in even more corrupt binary files when the binary bytes were converted to strings automatically, that's why I was avoiding string manipulation.
Zorphnog - Thanks for the update. AutoIT says it's version 3.2.2.0 on this computer, 3.2.10.0 on my other computer. That brings up a great point - I should have made sure I was on the latest version when I first started running into problems. Thanks for the pointer.
I'll upgrade my machines first, then go through this exercise again and post results tomorrow.

I was wondering how you were managing to treat these PDFs as text files, and I guess you weren't!

Though the code may have to be changed, the principle is the same. Read the whole thing to a variable, which can be binary if you FileOPen() with mode 16 first. Then just manipulate the first few bytes as required, then write the whole thing (again, mode 16) in one shot.

LarryDalooza · July 1, 2008

for in place editting perhaps...

http://www.autoitscript.com/forum/index.ph...st&p=225490

Where API has been ported so that you can use _APIFileSetPos to move about within a file.

Lar.

PsaltyDS · July 1, 2008

for in place editting perhaps...
http://www.autoitscript.com/forum/index.ph...st&p=225490
Where API has been ported so that you can use _APIFileSetPos to move about within a file.
Lar.

Requires the written data to match length byte-for-byte doesn't it? Can you use that to replace "089asdfhhEX;-1.4" with "%PDF-1.4"?

rmh · July 2, 2008

I was wondering how you were managing to treat these PDFs as text files, and I guess you weren't!

Though the code may have to be changed, the principle is the same. Read the whole thing to a variable, which can be binary if you FileOPen() with mode 16 first. Then just manipulate the first few bytes as required, then write the whole thing (again, mode 16) in one shot.

Thanks to all who have provided suggestions and solutions.

I updated all machines to v3.2.12.1 and that alleviated some of the subtle wonky behavior I was experiencing.

The first solution (writing the temp file then moving that file to the target dir) processed nearly 46 PDF files per second (~14,000 in ~5min).

The final solution (using the Func below) processed nearly 50 files per second (no temp file).

The average size of the PDF files in this test sample is ~35KB (range: 20KB-320KB).

Func _RepairPDFHeader( $fullfilename, $targetdir)
    Local $otFileH, $char = "-", $sFix = "%PDF-", $sData
    Local $szDrive, $szDir, $szFName, $szExt

    _PathSplit( $fullfilename, $szDrive, $szDir, $szFName, $szExt)
    
    $sData = FileRead( $fullfilename)
    $charpos = StringInStr( $sData, $char)
    If $charpos Then
        $sData = StringReplace( $sData, StringLeft( $sData, $charpos), $sFix, 1)
        $otFileH = FileOpen( $targetdir &  $szFName & "." & $szExt, 2+16)               ; write-erase, binary
        If -1 = $inFileH Then Return 0
        FileWrite( $otFileH, $sData)
        FileClose( $otFileH)
        Return 1
    Else
        Return 0
    EndIf
    
EndFunc

Thanks again!

rmh · July 2, 2008

As an aside/update...

I have also optimized the function (below) that counts the number of pages in the PDF files, in order to perform some rudimentary validation that the files are actually what they are expected to be, and potentially not corrupted. I use $length = 4 since none of the PDF files is going to be larger than 9999 pages.

Func _CountPDFPages( $fullfilename)
    Local $strpos = 0, $findstr = "/Count ", $length = 4

    If Not FileExists( $fullfilename) Then Return SetError(1, 0, 0)

    $sData = FileRead( $fullfilename)
    $strpos = StringInStr( $sData, $findstr, 1, -1)
    If $strpos Then
        Return Int( StringMid( $sData, $strpos + StringLen( $findstr), $length))
    Else
        Return SetError(2, 0, 0)
    EndIf
EndFunc

Now I can process that same test dataset in under 90 seconds (nearly 168 files per second).

Sign In

Optimize a binary file edit and copy?

Recommended Posts

rmh

PsaltyDS

goldenix

zorphnog

rmh

PsaltyDS

zorphnog

rmh

PsaltyDS

LarryDalooza

PsaltyDS

rmh

rmh

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta