rmh Posted July 1, 2008 Share Posted July 1, 2008 I would like to optimize my AutoIt code that repairs tens of thousands of 'FlateDecode'd (binary stream) PDF files. I need to perform this function in Windows, and as of right now, it looks like I might be limited to AutoIt. I come from a *nix background, so I'm likely missing something simple. Hence my posting. I do not need help doing this work, I need help optimizing the work I'm doing within AutoIt. AutoIt seems to crawl on binary file manipulation. I need to edit each PDF then copy the PDF to another location. I ran my C code on linux and it chews through about 80,000 files (~3GB of data) in under a minute on my laptop, since I could open each file for read *and* write at the same time, I could edit the files in place. Cygwin doesn't implement fork() well at all, and bumps into resource limitations. AutoIt, works just fine, albeit at a snails pace - less than one file per second. Here's the bottleneck snippet that is very similar to my C code: CODEFunc _RepairPDFHeader( $fullfilename) Local $inFileH, $otFileH, $char, $charNum = 0, $lineNum = 0, $fixed = 0 Local $string = "%PDF-", $filesuffix = ".tmp" Local $szDrive, $szDir, $szFName, $szExt $inFileH = FileOpen( $fullfilename, 0) If -1 = $inFileH Then return False EndIf $otFileH = FileOpen( $fullfilename & $filesuffix, 2) If -1 = $otFileH Then return False EndIf _PathSplit( $fullfilename, $szDrive, $szDir, $szFName, $szExt) While 1 $char = FileRead( $inFileH, 1) If @error = -1 Then ExitLoop ; EOF reached $charNum += 1 If @LF = $char Then ; find end of line 1 $lineNum += 1 EndIf If (( "-" = $char) And ( $lineNum < 1)) Then ; Only fix the first line FileWrite( $otFileH, "%") FileWrite( $otFileH, "P") FileWrite( $otFileH, "D") FileWrite( $otFileH, "F") FileWrite( $otFileH, "-") $fixed = 1 ElseIf 0 = $fixed Then ; until we hit the first "-" on line 1 ContinueLoop ; write nothing to the output file Else ; Else copy the file FileWrite( $otFileH, $char) EndIf WEnd FileClose( $inFileH) FileClose( $otFileH) FileMove( $fullfilename & $filesuffix, $workloaddir & $szFName & "." & $szExt, 1) FileDelete( $fullfilename & $filesuffix) return True EndFunc Link to comment Share on other sites More sharing options...
PsaltyDS Posted July 1, 2008 Share Posted July 1, 2008 Is it really necessary to process the file one CHARACTER at a time? Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law Link to comment Share on other sites More sharing options...
goldenix Posted July 1, 2008 Share Posted July 1, 2008 May I suggest something?$char = FileRead( $inFileH, 1) You read ony 1 char of binary data from the file at the time hire, what if you read the whole file at the same time?:$char = FileRead( $inFileH) My Projects:[list][*]Guide - ytube step by step tut for reading memory with autoitscript + samples[*]WinHide - tool to show hide windows, Skinned With GDI+[*]Virtualdub batch job list maker - Batch Process all files with same settings[*]Exp calc - Exp calculator for online games[*]Automated Microsoft SQL Server 2000 installer[*]Image sorter helper for IrfanView - 1 click opens img & move ur mouse to close opened img[/list] Link to comment Share on other sites More sharing options...
zorphnog Posted July 1, 2008 Share Posted July 1, 2008 I've optimized your code somewhat so that it doesn't read the whole file, just the first line. I don't have any test files to work with so I don't know how well it works. expandcollapse popupFunc _RepairPDFHeader( $fullfilename) Local $inFileH, $otFileH, $char, $charNum = 0, $lineNum = 0, $fixed = 0 Local $string = "%PDF-", $filesuffix = ".tmp" Local $szDrive, $szDir, $szFName, $szExt $inFileH = FileOpen( $fullfilename, 0) If -1 = $inFileH Then Return False $otFileH = FileOpen( $fullfilename & $filesuffix, 2) If -1 = $otFileH Then Return False _PathSplit( $fullfilename, $szDrive, $szDir, $szFName, $szExt) While 1 $char = FileRead( $inFileH, 1) If @error = -1 Then ExitLoop ; EOF reached $charNum += 1 If @LF = $char Then $lineNum += 1 If (( "-" = $char) And ( $lineNum < 1)) Then ; Only fix the first line FileWrite( $otFileH, "%") FileWrite( $otFileH, "P") FileWrite( $otFileH, "D") FileWrite( $otFileH, "F") FileWrite( $otFileH, "-") $fixed = 1 $char = FileRead( $inFileH) FileWrite( $otFileH, $char) ExitLoop ElseIf $lineNum > 1 Then ExitLoop EndIf WEnd FileClose( $inFileH) FileClose( $otFileH) If $fixed = 1 Then FileMove( $fullfilename & $filesuffix, $workloaddir & $szFName & "." & $szExt, 1) FileDelete( $fullfilename & $filesuffix) return True EndFunc Link to comment Share on other sites More sharing options...
rmh Posted July 1, 2008 Author Share Posted July 1, 2008 I've optimized your code somewhat so that it doesn't read the whole file, just the first line. I don't have any test files to work with so I don't know how well it works. expandcollapse popupFunc _RepairPDFHeader( $fullfilename) Local $inFileH, $otFileH, $char, $charNum = 0, $lineNum = 0, $fixed = 0 Local $filesuffix = ".tmp" Local $szDrive, $szDir, $szFName, $szExt $inFileH = FileOpen( $fullfilename, 0) If -1 = $inFileH Then Return False $otFileH = FileOpen( $fullfilename & $filesuffix, 2) If -1 = $otFileH Then Return False _PathSplit( $fullfilename, $szDrive, $szDir, $szFName, $szExt) While 1 $char = FileRead( $inFileH, 1) If @error = -1 Then ExitLoop ; EOF reached If @LF = $char Then $lineNum += 1 If (( "-" = $char) And ( $lineNum < 1)) Then ; Only fix the first line FileWrite( $otFileH, "%") FileWrite( $otFileH, "P") FileWrite( $otFileH, "D") FileWrite( $otFileH, "F") FileWrite( $otFileH, "-") $fixed = 1 $char = FileRead( $inFileH) ; re-reads the entire input file?! FileWrite( $otFileH, $char) ; appends the entire input file ExitLoop ElseIf $lineNum > 1 Then ExitLoop EndIf WEnd FileClose( $inFileH) FileClose( $otFileH) If $fixed = 1 Then FileMove( $fullfilename & $filesuffix, $workloaddir & $szFName & "." & $szExt, 1) FileDelete( $fullfilename & $filesuffix) return True EndFunc Thanks for the assistance all, unfortunately it appears that reading a file one byte at a time then reading the entire file doesn't keep track of the current position in the file being read. See my comments in the code snippet above. The result is that the above code produces an output file that contains the *complete* inputfile appended to "%PDF-", not the edited file the original code produced. e.g. if the original input file contains the following first line (the amount of junk before the hyophen varies): 089asdfhhEX;-1.4the original code produced %PDF-1.4the optimized code produces %PDF-089asdfhhEX;-1.4 Link to comment Share on other sites More sharing options...
PsaltyDS Posted July 1, 2008 Share Posted July 1, 2008 Thanks for the assistance all, unfortunately it appears that reading a file one byte at a time then reading the entire file doesn't keep track of the current position in the file being read. See my comments in the code snippet above. The result is that the above code produces an output file that contains the *complete* inputfile appended to "%PDF-", not the edited file the original code produced. e.g. if the original input file contains the following first line (the amount of junk before the hyophen varies): the original code produced the optimized code produces What's the largest file you are dealing with? If it's less than about 128MB, then you can read the entire file into a variable and tweak it in memory and just write to the disk once: Func _RepairPDFHeader( $fullfilename) If Not FileExists($fullfilename) Then Return SetError(1, 0, 0) Local $sData = FileRead($fullfilename) $iDash = StringInStr($sData, "-") If $iDash Then $sData = StringReplace($sData, StringLeft($sData, $iDash), "%PDF-", 1) $hFile = FileOpen($fullfilename, 2) FileWrite($hFile, $sData) FileClose($hFile) Return 1 Else Return SetError(2, 0, 0) EndIf EndFunc Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law Link to comment Share on other sites More sharing options...
zorphnog Posted July 1, 2008 Share Posted July 1, 2008 Hmm. The code worked for me. I edited a pdf to read "089asdfhhEX;-1.3" on the first line. Ran it through the script and it produced "%PDF-1.3". Could you post an example file that is not working? And are you using the latest version of AutoIt? Oh yeah I forgot to mention you can reduce your single FileWrite commands into one: FileWrite( $otFileH, "%PDF-") Link to comment Share on other sites More sharing options...
rmh Posted July 1, 2008 Author Share Posted July 1, 2008 What's the largest file you are dealing with? If it's less than about 128MB, then you can read the entire file into a variable and tweak it in memory and just write to the disk once: Func _RepairPDFHeader( $fullfilename) If Not FileExists($fullfilename) Then Return SetError(1, 0, 0) Local $sData = FileRead($fullfilename) $iDash = StringInStr($sData, "-") If $iDash Then $sData = StringReplace($sData, StringLeft($sData, $iDash), "%PDF-", 1) $hFile = FileOpen($fullfilename, 2) FileWrite($hFile, $sData) FileClose($hFile) Return 1 Else Return SetError(2, 0, 0) EndIf EndFunc Thanks. I will look into this. My previous attempts at string manipulation of binary file data resulted in even more corrupt binary files when the binary bytes were converted to strings automatically, that's why I was avoiding string manipulation. Zorphnog - Thanks for the update. AutoIT says it's version 3.2.2.0 on this computer, 3.2.10.0 on my other computer. That brings up a great point - I should have made sure I was on the latest version when I first started running into problems. Thanks for the pointer. I'll upgrade my machines first, then go through this exercise again and post results tomorrow. Link to comment Share on other sites More sharing options...
PsaltyDS Posted July 1, 2008 Share Posted July 1, 2008 Thanks. I will look into this.My previous attempts at string manipulation of binary file data resulted in even more corrupt binary files when the binary bytes were converted to strings automatically, that's why I was avoiding string manipulation.Zorphnog - Thanks for the update. AutoIT says it's version 3.2.2.0 on this computer, 3.2.10.0 on my other computer. That brings up a great point - I should have made sure I was on the latest version when I first started running into problems. Thanks for the pointer.I'll upgrade my machines first, then go through this exercise again and post results tomorrow.I was wondering how you were managing to treat these PDFs as text files, and I guess you weren't!Though the code may have to be changed, the principle is the same. Read the whole thing to a variable, which can be binary if you FileOPen() with mode 16 first. Then just manipulate the first few bytes as required, then write the whole thing (again, mode 16) in one shot. Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law Link to comment Share on other sites More sharing options...
LarryDalooza Posted July 1, 2008 Share Posted July 1, 2008 for in place editting perhaps...http://www.autoitscript.com/forum/index.ph...st&p=225490Where API has been ported so that you can use _APIFileSetPos to move about within a file.Lar. AutoIt has helped make me wealthy Link to comment Share on other sites More sharing options...
PsaltyDS Posted July 1, 2008 Share Posted July 1, 2008 for in place editting perhaps...http://www.autoitscript.com/forum/index.ph...st&p=225490Where API has been ported so that you can use _APIFileSetPos to move about within a file.Lar.Requires the written data to match length byte-for-byte doesn't it? Can you use that to replace "089asdfhhEX;-1.4" with "%PDF-1.4"? Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law Link to comment Share on other sites More sharing options...
rmh Posted July 2, 2008 Author Share Posted July 2, 2008 I was wondering how you were managing to treat these PDFs as text files, and I guess you weren't! Though the code may have to be changed, the principle is the same. Read the whole thing to a variable, which can be binary if you FileOPen() with mode 16 first. Then just manipulate the first few bytes as required, then write the whole thing (again, mode 16) in one shot. Thanks to all who have provided suggestions and solutions. I updated all machines to v3.2.12.1 and that alleviated some of the subtle wonky behavior I was experiencing. The first solution (writing the temp file then moving that file to the target dir) processed nearly 46 PDF files per second (~14,000 in ~5min). The final solution (using the Func below) processed nearly 50 files per second (no temp file). The average size of the PDF files in this test sample is ~35KB (range: 20KB-320KB). Func _RepairPDFHeader( $fullfilename, $targetdir) Local $otFileH, $char = "-", $sFix = "%PDF-", $sData Local $szDrive, $szDir, $szFName, $szExt _PathSplit( $fullfilename, $szDrive, $szDir, $szFName, $szExt) $sData = FileRead( $fullfilename) $charpos = StringInStr( $sData, $char) If $charpos Then $sData = StringReplace( $sData, StringLeft( $sData, $charpos), $sFix, 1) $otFileH = FileOpen( $targetdir & $szFName & "." & $szExt, 2+16) ; write-erase, binary If -1 = $inFileH Then Return 0 FileWrite( $otFileH, $sData) FileClose( $otFileH) Return 1 Else Return 0 EndIf EndFunc Thanks again! Link to comment Share on other sites More sharing options...
rmh Posted July 2, 2008 Author Share Posted July 2, 2008 As an aside/update... I have also optimized the function (below) that counts the number of pages in the PDF files, in order to perform some rudimentary validation that the files are actually what they are expected to be, and potentially not corrupted. I use $length = 4 since none of the PDF files is going to be larger than 9999 pages. Func _CountPDFPages( $fullfilename) Local $strpos = 0, $findstr = "/Count ", $length = 4 If Not FileExists( $fullfilename) Then Return SetError(1, 0, 0) $sData = FileRead( $fullfilename) $strpos = StringInStr( $sData, $findstr, 1, -1) If $strpos Then Return Int( StringMid( $sData, $strpos + StringLen( $findstr), $length)) Else Return SetError(2, 0, 0) EndIf EndFunc Now I can process that same test dataset in under 90 seconds (nearly 168 files per second). Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now