Sign in to follow this  
Followers 0
squishlax

Reading LARGE txt file

9 posts in this topic

ok, so i wrote a program that reads through a large text file (5 million+ lines, 3.2 gigs :) ) and extracts valid lines based on criteria.

the problem is it is slooooooooooooooooow and i need to get this done for work and telling them that letting a machine run for 2 weeks straight wont even get through it is not acceptable.

so im asking if anyone knows of a quicker way to get this done?

i use fileopen and filereadline, split the string, check a couple fields and then either write the line to another file or moves on...

;)

Share this post


Link to post
Share on other sites

#2 ·  Posted (edited)

Never use filereadline() with "line" parameter if speed is main criterium!!

Use FileOpen() and FileRead() or look at APIFileReadWrite UDF

From a performance standpoint it is a bad idea to read line by line specifying "line" parameter whose value is incrementing by one. This forces AutoIt to reread the file from the beginning until it reach the specified line.

For performance is better to read big chunk of data from file to memory in one read. Instead of 5 000 000 file reading (slow) operations (one read for one line) make reading for example read 3.2GB as 3200 times 1MB of data chunk to memory (where you parse each line)

EDIT: correction about line parameter and big chunk reading

Edited by Zedna

Share this post


Link to post
Share on other sites

You colud use FINDSTR and pipe it out to newfile.

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Seriously, this is such a huge file that I would suggest using a dedicated tool like TextPipe Engine [http://pcwin.com/Software_Development/TextPipe_Engine/index.htm] or something similar - from the page description it seems you could also easily script this tool from AutoIt in case needed (command line support). I did not find any freeware/open source alternative during my brief search but this should be possible!

Of course.. if this could be done in AutoIt it would be GREAT.

Sunaj

EDIT: removed wrong data

EDIT2: http://gnuwin32.sourceforge.net/packages/gawk.htm

ok, so i wrote a program that reads through a large text file (5 million+ line (...)

Edited by Sunaj

Share this post


Link to post
Share on other sites

@Sunaj

I think all these external tools internally must use the same file read API functions.

So it will be much more effective/simpler to use these API directly through DllCall as it is done in APIFileReadWrite UDF.

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

Hi,

If you wrap DOS findstr may save too much hard work for a onnce off problem,

eg;

#include<Array.au3>
;~ #include"DOSComs.au3"
Dim $s_AnswerFile
$s_file=FileOpenDialog("Choose",@ScriptDir,"Files (*.txt)")
;$s_Searches="*line*.*"
;FINDSTR /B/R/I/N /c:"vb.*pt" "C:\Program Files\Macro Express3\hta4\DateForm4.vbs">"C:\Program Files\Macro Express3\Answer.txt"
;~ $s_Searches="li"
$s_Searches="line"
;~ $s_Switches="/M "
$s_Switches="/N "
$t = TimerInit()
;Syntax; _DOSFindLines($s_file,$s_AnswerFile,$s_Searches,$s_Case,$s_Array,$i_Literal); $s_AnswerFile ByRef needs not be defined
;$ar_ArrayList=_FindLinesDOS2($s_file,$s_AnswerFile,$s_Searches,0,0,0); AnswerFile ByRef
$ar_ArrayList=_FindFiles_LinesDOS2($s_file,$s_AnswerFile,$s_Searches,$s_Switches,0,1,0); AnswerFile ByRef
ConsoleWrite(Timerdiff($t) & @CRLF)
if IsArray($ar_ArrayList) and UBound($ar_ArrayList)<2000 then 
    _ArrayDisplay($ar_ArrayList,"Answer Array")
    Else
RunWait("Notepad.exe " & $s_AnswerFile,@ScriptDir,@SW_SHOW)
EndIf
Exit
Func _FindFiles_LinesDOS2(ByRef $s_file, ByRef $s_AnswerFile, $s_Searches,$s_Switches="", $i_Case = 0, $i_Array = 0, $i_Literal = 0)
    ;Syntax; _DOSFindLines($s_file,$s_AnswerFile,$s_Searches,$i_Literal); $s_AnswerFile ByRef needs not be defined
    ; Parameters; $i_Literal=1 implies spaces are delimiters for a number of search strings in $s_Searches, rather than spaces included in search
    Local $asList
    If FileExists($s_file) Then
        If $i_Literal = 1 Then
            $i_Literal = "/c:"
        Else
            $i_Literal = ""
        EndIf
        If $i_Case = 1 Then
            $i_Case = "/I"
        Else
            $i_Case = ""
        EndIf
        $Position = StringInStr($s_file, "\", 0, -1)
        $s_Path = StringLeft($s_file, $Position)
        FileChangeDir($s_Path)
        $s_AnswerFile = $s_Path & "AnswerFindLines.txt"
        $s_Searches = StringReplace($s_Searches, ".", "\."); to use as RegExp in "FindLines"
        $s_Searches = StringReplace($s_Searches, "*", ".*"); to use as RegExp in "FindLines"
        $s_Searches = '"' & $s_Searches & '"'
        If StringInStr($s_file, " ") Then $s_file = '"' & $s_file & '"'
        If StringInStr($s_AnswerFile, " ") Then $s_AnswerFile = '"' & $s_AnswerFile & '"'
        ; Set the Command and Run Dos
        $s_Command = 'type | findstr /R' & $i_Case & ' ' & $i_Literal & ' ' & $s_Switches & $s_Searches & ' ' & $s_file & ' > ' & $s_AnswerFile; rem /c: for literal?
        MsgBox(0,"","$s_Command="&$s_Command)
        ConsoleWrite($s_Command)
        RunWait(@ComSpec & " /c " & $s_Command, @ScriptDir, @SW_HIDE)
        
        $s_AnswerFile = StringReplace($s_AnswerFile, '"', '')
        $s_file = StringReplace($s_file, '"', "")
        If $i_Array Then
            ;FileReadToArray($s_AnswerFile, $asList)
            $sList = FileRead($s_AnswerFile, FileGetSize($s_AnswerFile))
            $sList = StringTrimRight(StringReplace($sList, @CRLF, @LF), 1)
            $asList = StringSplit($sList, @LF)
        EndIf
    Else
        SetError(1)
    EndIf
    Return $asList
EndFunc   ;==>_FindLinesDOS
Best, randall

PS there are a lot of switches and memory options here in dos, from memory; haven't looked at it for a while

Edited by randallc

Share this post


Link to post
Share on other sites

ok, ive been busy and havent had time to work on this project the past couple days... here what i can show you of what im trying to get done:

Opt("WinWaitDelay",100)
Opt("WinTitleMatchMode",4)
Opt("WinDetectHiddenText",1)
Opt("MouseCoordMode",0)
#include <file.au3>
#include <Array.au3>
#include <Date.au3>

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;FILE CONTROLS
$Data_File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**;
$Data_File = FileOpen($Data_File_Path, 1)

; Check if file opened for reading OK
If $Data_File = -1 Then
    MsgBox(0, "Error", "Unable to open INF file.")
    Exit
EndIf

$File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**;

; Shows the filenames of all files in the current directory, note that "." and ".." are returned.
$search = FileFindFirstFile("*.tsv")  

; Check if the search was successful
If $search = -1 Then
    MsgBox(0, "Error", "No files/directories matched the search pattern")
    Exit
EndIf
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;END FILE CONTROLS
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;DEMAND VALIDATION
While 1
    $file = FileFindNextFile($search) 
    If @error Then ExitLoop
        
    ;Open File
        $File_Opened = FileOpen($file, 0)
        
    ; Check if file opened for reading OK
        If $File_Opened = -1 Then
            MsgBox(0, "Error", "Unable to open file.")
            Exit
        EndIf
        
                    $Read_Line = 2          
            While 1
                                       
            ;Read Line
                $line = FileReadLine($File_Opened, $Read_Line)  
                
            ;Split the Manufacturer from the Manufacturer Item ID
                $line_parts = StringSplit($line, @TAB, 1)
                $ACCT_DT = $line_parts[18]
                $MTR_STATUS = $line_parts[66]
                $DESC254 = $line_parts[69]
                $STATUS = $line_parts[70]
                                
                If  $MTR_STATUS =;**INVALID CODES CANT SHOW YOU**; And $DESC254 <>;**INVALID CODES CANT SHOW YOU**; And $DESC254 <>;**INVALID CODES CANT SHOW YOU**; And $STATUS <>;**INVALID CODES CANT SHOW YOU**; Then
                    FileWriteLine($Data_File, $line)
                                        
                EndIf
                
                
            ;Increment $Read_Line  -- loop variable
                $Read_Line = $Read_Line + 1
                
            ;Reset Variables
                $line = ""
                $line_parts = ""
                $ACCT_DT = ""
                $NG_MTR_STATUS = ""
                $DESC254 = ""
                $STORMS_STATUS = ""
                
            Wend
    FileClose($File_Opened)

WEnd

FileClose($Data_File)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;END DEMAND VALIDATION

the file is extracted from an extremely large transaction database i dont have access to, otherwise this would be simple.. as you can see by my $line_parts array there are over 70 columns on each line, separated by tabs

i am not searching for a particular string, but making sure that certain parts of each line contain certain values...

it would be nice to have it in a database but with the tools i have available the file is too large for, thats why i am tryin to extract only the relevant data..

ill try some of your other suggestions when i can get the time, thank you all for your help!!!!!

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

First things first... This is opening the file for WRITE APPEND not READ (just the comment's wrong I guess):

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;FILE CONTROLS

$Data_File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**;

$Data_File = FileOpen($Data_File_Path, 1)

; Check if file opened for reading OK

If $Data_File = -1 Then

MsgBox(0, "Error", "Unable to open INF file.")

Exit

EndIf

The next time you use that, it is for READ because you used 0 for the parameter.

FileReadLine() will read the next line by default, and may be slowed down by specifying a line number, so DON'T

$Read_Line = 2

While 1

;Read Line

$line = FileReadLine($File_Opened, $Read_Line)

; ...

;Increment $Read_Line -- loop variable

$Read_Line = $Read_Line + 1

; ...

WEnd

Do that this way instead:

FileReadLine($File_Opened); Reads/skips first line
            While 1
                                      
          ;Read Line
            $line = FileReadLine($File_Opened); Reads next line

          ; ...

          ; ...

            WEnd

:)

Edited by PsaltyDS

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0