squishlax Posted August 29, 2007 Posted August 29, 2007 ok, so i wrote a program that reads through a large text file (5 million+ lines, 3.2 gigs ) and extracts valid lines based on criteria. the problem is it is slooooooooooooooooow and i need to get this done for work and telling them that letting a machine run for 2 weeks straight wont even get through it is not acceptable. so im asking if anyone knows of a quicker way to get this done? i use fileopen and filereadline, split the string, check a couple fields and then either write the line to another file or moves on...
Zedna Posted August 29, 2007 Posted August 29, 2007 (edited) Never use filereadline() with "line" parameter if speed is main criterium!!Use FileOpen() and FileRead() or look at APIFileReadWrite UDFFrom a performance standpoint it is a bad idea to read line by line specifying "line" parameter whose value is incrementing by one. This forces AutoIt to reread the file from the beginning until it reach the specified line.For performance is better to read big chunk of data from file to memory in one read. Instead of 5 000 000 file reading (slow) operations (one read for one line) make reading for example read 3.2GB as 3200 times 1MB of data chunk to memory (where you parse each line)EDIT: correction about line parameter and big chunk reading Edited August 29, 2007 by Zedna Resources UDF ResourcesEx UDF AutoIt Forum Search
Sunaj Posted August 29, 2007 Posted August 29, 2007 (edited) Seriously, this is such a huge file that I would suggest using a dedicated tool like TextPipe Engine [http://pcwin.com/Software_Development/TextPipe_Engine/index.htm] or something similar - from the page description it seems you could also easily script this tool from AutoIt in case needed (command line support). I did not find any freeware/open source alternative during my brief search but this should be possible! Of course.. if this could be done in AutoIt it would be GREAT. Sunaj EDIT: removed wrong data EDIT2: http://gnuwin32.sourceforge.net/packages/gawk.htm ok, so i wrote a program that reads through a large text file (5 million+ line (...) Edited August 29, 2007 by Sunaj [list=1][*]Generic way to detect full path to default browser, List/ListView Events Using GuiRegisterMsg (detect doubleclick and much more)[*]Using dllcall for full control over fileopendialog, Make DirMove act somewhat normally (by circumventing it...)[*]Avoid problems with "&" (chr(38)) in code, Change desktop maximized area/workspace (fx to make deskbar type app)[*]Change focus behavior when buttons are clicked to work closer to 'standard windows' app[*](Context) Menus With Timed Tooltips, Fast Loops & Operators in AU3[*]Clipboard UDF, A clipboard change notification udf[/list]
Zedna Posted August 29, 2007 Posted August 29, 2007 (edited) Post code snippet related to your main reading loop and we will see what can be optimized ... Edited August 29, 2007 by Zedna Resources UDF ResourcesEx UDF AutoIt Forum Search
Zedna Posted August 29, 2007 Posted August 29, 2007 @Sunaj I think all these external tools internally must use the same file read API functions. So it will be much more effective/simpler to use these API directly through DllCall as it is done in APIFileReadWrite UDF. Resources UDF ResourcesEx UDF AutoIt Forum Search
randallc Posted August 29, 2007 Posted August 29, 2007 (edited) Hi, If you wrap DOS findstr may save too much hard work for a onnce off problem, eg; expandcollapse popup#include<Array.au3> ;~ #include"DOSComs.au3" Dim $s_AnswerFile $s_file=FileOpenDialog("Choose",@ScriptDir,"Files (*.txt)") ;$s_Searches="*line*.*" ;FINDSTR /B/R/I/N /c:"vb.*pt" "C:\Program Files\Macro Express3\hta4\DateForm4.vbs">"C:\Program Files\Macro Express3\Answer.txt" ;~ $s_Searches="li" $s_Searches="line" ;~ $s_Switches="/M " $s_Switches="/N " $t = TimerInit() ;Syntax; _DOSFindLines($s_file,$s_AnswerFile,$s_Searches,$s_Case,$s_Array,$i_Literal); $s_AnswerFile ByRef needs not be defined ;$ar_ArrayList=_FindLinesDOS2($s_file,$s_AnswerFile,$s_Searches,0,0,0); AnswerFile ByRef $ar_ArrayList=_FindFiles_LinesDOS2($s_file,$s_AnswerFile,$s_Searches,$s_Switches,0,1,0); AnswerFile ByRef ConsoleWrite(Timerdiff($t) & @CRLF) if IsArray($ar_ArrayList) and UBound($ar_ArrayList)<2000 then _ArrayDisplay($ar_ArrayList,"Answer Array") Else RunWait("Notepad.exe " & $s_AnswerFile,@ScriptDir,@SW_SHOW) EndIf Exit Func _FindFiles_LinesDOS2(ByRef $s_file, ByRef $s_AnswerFile, $s_Searches,$s_Switches="", $i_Case = 0, $i_Array = 0, $i_Literal = 0) ;Syntax; _DOSFindLines($s_file,$s_AnswerFile,$s_Searches,$i_Literal); $s_AnswerFile ByRef needs not be defined ; Parameters; $i_Literal=1 implies spaces are delimiters for a number of search strings in $s_Searches, rather than spaces included in search Local $asList If FileExists($s_file) Then If $i_Literal = 1 Then $i_Literal = "/c:" Else $i_Literal = "" EndIf If $i_Case = 1 Then $i_Case = "/I" Else $i_Case = "" EndIf $Position = StringInStr($s_file, "\", 0, -1) $s_Path = StringLeft($s_file, $Position) FileChangeDir($s_Path) $s_AnswerFile = $s_Path & "AnswerFindLines.txt" $s_Searches = StringReplace($s_Searches, ".", "\."); to use as RegExp in "FindLines" $s_Searches = StringReplace($s_Searches, "*", ".*"); to use as RegExp in "FindLines" $s_Searches = '"' & $s_Searches & '"' If StringInStr($s_file, " ") Then $s_file = '"' & $s_file & '"' If StringInStr($s_AnswerFile, " ") Then $s_AnswerFile = '"' & $s_AnswerFile & '"' ; Set the Command and Run Dos $s_Command = 'type | findstr /R' & $i_Case & ' ' & $i_Literal & ' ' & $s_Switches & $s_Searches & ' ' & $s_file & ' > ' & $s_AnswerFile; rem /c: for literal? MsgBox(0,"","$s_Command="&$s_Command) ConsoleWrite($s_Command) RunWait(@ComSpec & " /c " & $s_Command, @ScriptDir, @SW_HIDE) $s_AnswerFile = StringReplace($s_AnswerFile, '"', '') $s_file = StringReplace($s_file, '"', "") If $i_Array Then ;FileReadToArray($s_AnswerFile, $asList) $sList = FileRead($s_AnswerFile, FileGetSize($s_AnswerFile)) $sList = StringTrimRight(StringReplace($sList, @CRLF, @LF), 1) $asList = StringSplit($sList, @LF) EndIf Else SetError(1) EndIf Return $asList EndFunc ;==>_FindLinesDOSBest, randall PS there are a lot of switches and memory options here in dos, from memory; haven't looked at it for a while Edited August 29, 2007 by randallc ExcelCOM... AccessCom.. Word2... FileListToArrayNew...SearchMiner... Regexps...SQL...Explorer...Array2D.. _GUIListView...array problem...APITailRW
squishlax Posted August 31, 2007 Author Posted August 31, 2007 ok, ive been busy and havent had time to work on this project the past couple days... here what i can show you of what im trying to get done: expandcollapse popupOpt("WinWaitDelay",100) Opt("WinTitleMatchMode",4) Opt("WinDetectHiddenText",1) Opt("MouseCoordMode",0) #include <file.au3> #include <Array.au3> #include <Date.au3> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;FILE CONTROLS $Data_File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**; $Data_File = FileOpen($Data_File_Path, 1) ; Check if file opened for reading OK If $Data_File = -1 Then MsgBox(0, "Error", "Unable to open INF file.") Exit EndIf $File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**; ; Shows the filenames of all files in the current directory, note that "." and ".." are returned. $search = FileFindFirstFile("*.tsv") ; Check if the search was successful If $search = -1 Then MsgBox(0, "Error", "No files/directories matched the search pattern") Exit EndIf ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;END FILE CONTROLS ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;DEMAND VALIDATION While 1 $file = FileFindNextFile($search) If @error Then ExitLoop ;Open File $File_Opened = FileOpen($file, 0) ; Check if file opened for reading OK If $File_Opened = -1 Then MsgBox(0, "Error", "Unable to open file.") Exit EndIf $Read_Line = 2 While 1 ;Read Line $line = FileReadLine($File_Opened, $Read_Line) ;Split the Manufacturer from the Manufacturer Item ID $line_parts = StringSplit($line, @TAB, 1) $ACCT_DT = $line_parts[18] $MTR_STATUS = $line_parts[66] $DESC254 = $line_parts[69] $STATUS = $line_parts[70] If $MTR_STATUS =;**INVALID CODES CANT SHOW YOU**; And $DESC254 <>;**INVALID CODES CANT SHOW YOU**; And $DESC254 <>;**INVALID CODES CANT SHOW YOU**; And $STATUS <>;**INVALID CODES CANT SHOW YOU**; Then FileWriteLine($Data_File, $line) EndIf ;Increment $Read_Line -- loop variable $Read_Line = $Read_Line + 1 ;Reset Variables $line = "" $line_parts = "" $ACCT_DT = "" $NG_MTR_STATUS = "" $DESC254 = "" $STORMS_STATUS = "" Wend FileClose($File_Opened) WEnd FileClose($Data_File) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;END DEMAND VALIDATION the file is extracted from an extremely large transaction database i dont have access to, otherwise this would be simple.. as you can see by my $line_parts array there are over 70 columns on each line, separated by tabs i am not searching for a particular string, but making sure that certain parts of each line contain certain values... it would be nice to have it in a database but with the tools i have available the file is too large for, thats why i am tryin to extract only the relevant data.. ill try some of your other suggestions when i can get the time, thank you all for your help!!!!!
PsaltyDS Posted August 31, 2007 Posted August 31, 2007 (edited) First things first... This is opening the file for WRITE APPEND not READ (just the comment's wrong I guess): ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;FILE CONTROLS $Data_File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**; $Data_File = FileOpen($Data_File_Path, 1) ; Check if file opened for reading OK If $Data_File = -1 Then MsgBox(0, "Error", "Unable to open INF file.") Exit EndIf The next time you use that, it is for READ because you used 0 for the parameter. FileReadLine() will read the next line by default, and may be slowed down by specifying a line number, so DON'T $Read_Line = 2 While 1 ;Read Line $line = FileReadLine($File_Opened, $Read_Line) ; ... ;Increment $Read_Line -- loop variable $Read_Line = $Read_Line + 1 ; ... WEnd Do that this way instead: FileReadLine($File_Opened); Reads/skips first line While 1 ;Read Line $line = FileReadLine($File_Opened); Reads next line ; ... ; ... WEnd Edited August 31, 2007 by PsaltyDS Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now