Jump to content

Reading LARGE txt file


Recommended Posts

ok, so i wrote a program that reads through a large text file (5 million+ lines, 3.2 gigs :) ) and extracts valid lines based on criteria.

the problem is it is slooooooooooooooooow and i need to get this done for work and telling them that letting a machine run for 2 weeks straight wont even get through it is not acceptable.

so im asking if anyone knows of a quicker way to get this done?

i use fileopen and filereadline, split the string, check a couple fields and then either write the line to another file or moves on...

;)

Link to comment
Share on other sites

Never use filereadline() with "line" parameter if speed is main criterium!!

Use FileOpen() and FileRead() or look at APIFileReadWrite UDF

From a performance standpoint it is a bad idea to read line by line specifying "line" parameter whose value is incrementing by one. This forces AutoIt to reread the file from the beginning until it reach the specified line.

For performance is better to read big chunk of data from file to memory in one read. Instead of 5 000 000 file reading (slow) operations (one read for one line) make reading for example read 3.2GB as 3200 times 1MB of data chunk to memory (where you parse each line)

EDIT: correction about line parameter and big chunk reading

Edited by Zedna
Link to comment
Share on other sites

Seriously, this is such a huge file that I would suggest using a dedicated tool like TextPipe Engine [http://pcwin.com/Software_Development/TextPipe_Engine/index.htm] or something similar - from the page description it seems you could also easily script this tool from AutoIt in case needed (command line support). I did not find any freeware/open source alternative during my brief search but this should be possible!

Of course.. if this could be done in AutoIt it would be GREAT.

Sunaj

EDIT: removed wrong data

EDIT2: http://gnuwin32.sourceforge.net/packages/gawk.htm

ok, so i wrote a program that reads through a large text file (5 million+ line (...)

Edited by Sunaj
Link to comment
Share on other sites

Hi,

If you wrap DOS findstr may save too much hard work for a onnce off problem,

eg;

#include<Array.au3>
;~ #include"DOSComs.au3"
Dim $s_AnswerFile
$s_file=FileOpenDialog("Choose",@ScriptDir,"Files (*.txt)")
;$s_Searches="*line*.*"
;FINDSTR /B/R/I/N /c:"vb.*pt" "C:\Program Files\Macro Express3\hta4\DateForm4.vbs">"C:\Program Files\Macro Express3\Answer.txt"
;~ $s_Searches="li"
$s_Searches="line"
;~ $s_Switches="/M "
$s_Switches="/N "
$t = TimerInit()
;Syntax; _DOSFindLines($s_file,$s_AnswerFile,$s_Searches,$s_Case,$s_Array,$i_Literal); $s_AnswerFile ByRef needs not be defined
;$ar_ArrayList=_FindLinesDOS2($s_file,$s_AnswerFile,$s_Searches,0,0,0); AnswerFile ByRef
$ar_ArrayList=_FindFiles_LinesDOS2($s_file,$s_AnswerFile,$s_Searches,$s_Switches,0,1,0); AnswerFile ByRef
ConsoleWrite(Timerdiff($t) & @CRLF)
if IsArray($ar_ArrayList) and UBound($ar_ArrayList)<2000 then 
    _ArrayDisplay($ar_ArrayList,"Answer Array")
    Else
RunWait("Notepad.exe " & $s_AnswerFile,@ScriptDir,@SW_SHOW)
EndIf
Exit
Func _FindFiles_LinesDOS2(ByRef $s_file, ByRef $s_AnswerFile, $s_Searches,$s_Switches="", $i_Case = 0, $i_Array = 0, $i_Literal = 0)
    ;Syntax; _DOSFindLines($s_file,$s_AnswerFile,$s_Searches,$i_Literal); $s_AnswerFile ByRef needs not be defined
    ; Parameters; $i_Literal=1 implies spaces are delimiters for a number of search strings in $s_Searches, rather than spaces included in search
    Local $asList
    If FileExists($s_file) Then
        If $i_Literal = 1 Then
            $i_Literal = "/c:"
        Else
            $i_Literal = ""
        EndIf
        If $i_Case = 1 Then
            $i_Case = "/I"
        Else
            $i_Case = ""
        EndIf
        $Position = StringInStr($s_file, "\", 0, -1)
        $s_Path = StringLeft($s_file, $Position)
        FileChangeDir($s_Path)
        $s_AnswerFile = $s_Path & "AnswerFindLines.txt"
        $s_Searches = StringReplace($s_Searches, ".", "\."); to use as RegExp in "FindLines"
        $s_Searches = StringReplace($s_Searches, "*", ".*"); to use as RegExp in "FindLines"
        $s_Searches = '"' & $s_Searches & '"'
        If StringInStr($s_file, " ") Then $s_file = '"' & $s_file & '"'
        If StringInStr($s_AnswerFile, " ") Then $s_AnswerFile = '"' & $s_AnswerFile & '"'
        ; Set the Command and Run Dos
        $s_Command = 'type | findstr /R' & $i_Case & ' ' & $i_Literal & ' ' & $s_Switches & $s_Searches & ' ' & $s_file & ' > ' & $s_AnswerFile; rem /c: for literal?
        MsgBox(0,"","$s_Command="&$s_Command)
        ConsoleWrite($s_Command)
        RunWait(@ComSpec & " /c " & $s_Command, @ScriptDir, @SW_HIDE)
        
        $s_AnswerFile = StringReplace($s_AnswerFile, '"', '')
        $s_file = StringReplace($s_file, '"', "")
        If $i_Array Then
            ;FileReadToArray($s_AnswerFile, $asList)
            $sList = FileRead($s_AnswerFile, FileGetSize($s_AnswerFile))
            $sList = StringTrimRight(StringReplace($sList, @CRLF, @LF), 1)
            $asList = StringSplit($sList, @LF)
        EndIf
    Else
        SetError(1)
    EndIf
    Return $asList
EndFunc   ;==>_FindLinesDOS
Best, randall

PS there are a lot of switches and memory options here in dos, from memory; haven't looked at it for a while

Edited by randallc
Link to comment
Share on other sites

ok, ive been busy and havent had time to work on this project the past couple days... here what i can show you of what im trying to get done:

Opt("WinWaitDelay",100)
Opt("WinTitleMatchMode",4)
Opt("WinDetectHiddenText",1)
Opt("MouseCoordMode",0)
#include <file.au3>
#include <Array.au3>
#include <Date.au3>

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;FILE CONTROLS
$Data_File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**;
$Data_File = FileOpen($Data_File_Path, 1)

; Check if file opened for reading OK
If $Data_File = -1 Then
    MsgBox(0, "Error", "Unable to open INF file.")
    Exit
EndIf

$File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**;

; Shows the filenames of all files in the current directory, note that "." and ".." are returned.
$search = FileFindFirstFile("*.tsv")  

; Check if the search was successful
If $search = -1 Then
    MsgBox(0, "Error", "No files/directories matched the search pattern")
    Exit
EndIf
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;END FILE CONTROLS
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;DEMAND VALIDATION
While 1
    $file = FileFindNextFile($search) 
    If @error Then ExitLoop
        
    ;Open File
        $File_Opened = FileOpen($file, 0)
        
    ; Check if file opened for reading OK
        If $File_Opened = -1 Then
            MsgBox(0, "Error", "Unable to open file.")
            Exit
        EndIf
        
                    $Read_Line = 2          
            While 1
                                       
            ;Read Line
                $line = FileReadLine($File_Opened, $Read_Line)  
                
            ;Split the Manufacturer from the Manufacturer Item ID
                $line_parts = StringSplit($line, @TAB, 1)
                $ACCT_DT = $line_parts[18]
                $MTR_STATUS = $line_parts[66]
                $DESC254 = $line_parts[69]
                $STATUS = $line_parts[70]
                                
                If  $MTR_STATUS =;**INVALID CODES CANT SHOW YOU**; And $DESC254 <>;**INVALID CODES CANT SHOW YOU**; And $DESC254 <>;**INVALID CODES CANT SHOW YOU**; And $STATUS <>;**INVALID CODES CANT SHOW YOU**; Then
                    FileWriteLine($Data_File, $line)
                                        
                EndIf
                
                
            ;Increment $Read_Line  -- loop variable
                $Read_Line = $Read_Line + 1
                
            ;Reset Variables
                $line = ""
                $line_parts = ""
                $ACCT_DT = ""
                $NG_MTR_STATUS = ""
                $DESC254 = ""
                $STORMS_STATUS = ""
                
            Wend
    FileClose($File_Opened)

WEnd

FileClose($Data_File)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;END DEMAND VALIDATION

the file is extracted from an extremely large transaction database i dont have access to, otherwise this would be simple.. as you can see by my $line_parts array there are over 70 columns on each line, separated by tabs

i am not searching for a particular string, but making sure that certain parts of each line contain certain values...

it would be nice to have it in a database but with the tools i have available the file is too large for, thats why i am tryin to extract only the relevant data..

ill try some of your other suggestions when i can get the time, thank you all for your help!!!!!

Link to comment
Share on other sites

First things first... This is opening the file for WRITE APPEND not READ (just the comment's wrong I guess):

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;FILE CONTROLS

$Data_File_Path =;**CANT SHOW YOU THE FILE PATH OR NAME**;

$Data_File = FileOpen($Data_File_Path, 1)

; Check if file opened for reading OK

If $Data_File = -1 Then

MsgBox(0, "Error", "Unable to open INF file.")

Exit

EndIf

The next time you use that, it is for READ because you used 0 for the parameter.

FileReadLine() will read the next line by default, and may be slowed down by specifying a line number, so DON'T

$Read_Line = 2

While 1

;Read Line

$line = FileReadLine($File_Opened, $Read_Line)

; ...

;Increment $Read_Line -- loop variable

$Read_Line = $Read_Line + 1

; ...

WEnd

Do that this way instead:

FileReadLine($File_Opened); Reads/skips first line
            While 1
                                      
          ;Read Line
            $line = FileReadLine($File_Opened); Reads next line

          ; ...

          ; ...

            WEnd

:)

Edited by PsaltyDS
Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...