Sign in to follow this  
Followers 0
KeeWay

Split Text Files Speed Help

12 posts in this topic

I have a basic script that i created that will read a text file and it takes the text between a start and end string and writes to a separate file. currently the small test file i have takes about 1 minute to read 22k lines. can anyone show/recommend a way to speed up the routine? the files may get upwards of 100k lines..

here is what i have working so far:

$file = "C:\SBT\Incoming\SBTFILE.txt"
$newfile = "C:\SBT\Work\SBTFILE-"
$filecount = 0
_FileReadToArray($file, $FileArray)

ProgressOn("Processing SBT File", "Reading The File...", "0 Lines")
For $i = 1 To $FileArray[0]

        If StringInStr($FileArray[$i], "ISA*") Then
                ;MsgBox(0, " ", $FileArray[$i])
                $filecount = $filecount + 1
                FileWriteLine($newfile & $filecount & ".txt", $FileArray[$i] & @CRLF)
        Else
            FileWriteLine($newfile & $filecount & ".txt", $FileArray[$i] & @CRLF)

        EndIf
         $Percent = Int(($i / $FileArray[0]) * 100)


        ProgressSet($Percent, $Percent & " Percent Complete")
Next
        ProgressSet(100, "Done", "Complete")
        Sleep(1000)
        ProgressOff()

Thanks

James

Share this post


Link to post
Share on other sites



KeeWay,

One of the reasons this might be slow is the fact that you use FileWriteLine. This opens and closes the file each time you use it - and for a 22k line file that us a lot of opening and closing. :

I have rewritten the code so that it stores each ISA* line and all lines until the next ISA* line (which is what I understood you wanted to have in each new file) in a string, whch we write to file whenever a new ISA* line is found:

#include <File.au3>

Global $FileArray
$file = @ScriptDir & "\test.txt" ;"C:\SBT\Incoming\SBTFILE.txt"
$newfile = @ScriptDir & "\split-" ; "C:\SBT\Work\SBTFILE-"

_FileReadToArray($file, $FileArray)

$filecount = 0
$sNewFile_Text = ""

ProgressOn("Processing SBT File", "Reading The File...", "0 Lines")

For $i = 1 To $FileArray[0]

    If StringInStr($FileArray[$i], "ISA*") Then
        ; We need to start a new file
        ; So write the existing one unless we have yet to start
        If $filecount > 0 Then FileWrite($newfile & $filecount & ".txt", $sNewFile_Text)
        ; And start a new one
        $filecount = $filecount + 1
        $sNewFile_Text = $FileArray[$i] & @CRLF
    Else
        $sNewFile_Text &= $FileArray[$i] & @CRLF
    EndIf

    $Percent = Int(($i / $FileArray[0]) * 100)
    ProgressSet($Percent, $Percent & " Percent Complete")

Next

; Now write the final file!
FileWrite($newfile & $filecount & ".txt", $sNewFile_Text)

ProgressSet(100, "Done", "Complete")
Sleep(1000)
ProgressOff()

It works fine on my short 20 line test file - I am not going to write a 22k one so I test it on one of the size you want. :huggles: Over to you!

I hope it helps - come back if it does not do what you want. :D

m23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

KeeWay,

One of the reasons this might be slow is the fact that you use FileWriteLine. This opens and closes the file each time you use it - and for a 22k line file that us a lot of opening and closing. :

I have rewritten the code so that it stores each ISA* line and all lines until the next ISA* line (which is what I understood you wanted to have in each new file) in a string, whch we write to file whenever a new ISA* line is found:

#include <File.au3>

Global $FileArray
$file = @ScriptDir & "\test.txt" ;"C:\SBT\Incoming\SBTFILE.txt"
$newfile = @ScriptDir & "\split-" ; "C:\SBT\Work\SBTFILE-"

_FileReadToArray($file, $FileArray)

$filecount = 0
$sNewFile_Text = ""

ProgressOn("Processing SBT File", "Reading The File...", "0 Lines")

For $i = 1 To $FileArray[0]

    If StringInStr($FileArray[$i], "ISA*") Then
        ; We need to start a new file
        ; So write the existing one unless we have yet to start
        If $filecount > 0 Then FileWrite($newfile & $filecount & ".txt", $sNewFile_Text)
        ; And start a new one
        $filecount = $filecount + 1
        $sNewFile_Text = $FileArray[$i] & @CRLF
    Else
        $sNewFile_Text &= $FileArray[$i] & @CRLF
    EndIf

    $Percent = Int(($i / $FileArray[0]) * 100)
    ProgressSet($Percent, $Percent & " Percent Complete")

Next

; Now write the final file!
FileWrite($newfile & $filecount & ".txt", $sNewFile_Text)

ProgressSet(100, "Done", "Complete")
Sleep(1000)
ProgressOff()

It works fine on my short 20 line test file - I am not going to write a 22k one so I test it on one of the size you want. :huggles: Over to you!

I hope it helps - come back if it does not do what you want. :D

m23

that worked out great thanks :D , it cut the same file to about 8 sec. i am still new to the autoit program and working with arrays. i see that you just wrote the found array to the file at once, i guess since i don't have a clear understanding of the whole array stuff i thought i had to write it out one line at at time..

anyway thanks again this is great..

Share this post


Link to post
Share on other sites

KeeWay,

i see that you just wrote the found array to the file at once

Not quite. What I did was read the array created with _FileReadToArray line by line and then save the relevant lines into one long string - which I then wrote to a file in one pass with FileWrite when I wanted to start a new file. I created no arrays, although it could be done that way if you wanted to. :huggles:

M23

P.S. When you reply please use the "Add Reply" button at the top and bottom of the page rather then the "Reply" button in the post itself. That way you do not get the contents of the previous post quoted in your reply and the whole thread becomes easier to read. :D


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

Melba23 thanks i see now :D ...

sorry i will remember to hit the correct reply button next time...

thanks again

James

Share this post


Link to post
Share on other sites

I know this is an old topic, but the script here works extremely well, so much so that it trumps all 3rd-party text-file splitters (and there are a lot of them.) But is it possible to make a modification to this script so that it would take the specified search term (in the example, it is ISA*) and put it at the BOTTOM of the split files? Right now, it puts the search term at the top. The text splitting that I am doing is such that I need the word boundary to be at the bottom of each of the split files.

To clarify: the text boundary that I am using is ****** END OF REPORT ******

The current script produces this in each of the split pages:

****** END OF REPORT ******

(the rest of the text in the file)

What I need is for the script to accomplish the following in each of the split files:

(text in the file)

****** END OF REPORT ******

Basically, I need the delimited line to appear at the bottom of each of the split pages. The script would hunt for the delimited term, and when it finds it, it would split the file right there, with the delimited line at the bottom. Then it would continue through the text document for all other instances of the ****** END OF REPORT ****** delimiter, split the file, put the delimiter at the bottom, etc. etc.

Thanks for any help that someone can provide. I am most grateful already for this script, as it has saved me a lot of time in processing files; it can save me even more with this modification.

 

 

Share this post


Link to post
Share on other sites

mjfoxtrot,

Welcome to the AutoIt forum. :)

Using this file:

Line 1                           | File 1
Line 2                           |
****** END OF REPORT ******
Line 4                           | File 2
Line 5                           |
****** END OF REPORT ******
Line 7                           | File 3
Line 8                           |
Line 9                           |
Line 10                          |
Line 11                          |
****** END OF REPORT ******
Line 13                          | File 4
Line 14                          |
Line 15                          |
****** END OF REPORT ******
Line 17
Line 19
Line 19
the following code splits it as indicated:

#include <File.au3>

Global $FileArray
$file = @ScriptDir & "\test.txt" ;"C:\SBT\Incoming\SBTFILE.txt"
$newfile = @ScriptDir & "\split-" ; "C:\SBT\Work\SBTFILE-"

_FileReadToArray($file, $FileArray)

$filecount = 1
$sNewFile_Text = ""

ProgressOn("Processing SBT File", "Reading The File...", "0 Lines")

For $i = 1 To $FileArray[0]

    If StringInStr($FileArray[$i], "****** END OF REPORT ******") Then
        ; We need to write the current file
        FileWrite($newfile & $filecount & ".txt", $sNewFile_Text & $FileArray[$i])
        ; And start a new one
        $filecount = $filecount + 1
        $sNewFile_Text = ""
    Else
        ; Add line to string
        $sNewFile_Text &= $FileArray[$i] & @CRLF
    EndIf

    $Percent = Int(($i / $FileArray[0]) * 100)
    ProgressSet($Percent, $Percent & " Percent Complete")

Next

ProgressSet(100, "Done", "Complete")
Sleep(1000)
ProgressOff()
Is that what you are looking for? If not then please post a test file showing how it should be split and I can modify the code. :)

M23

1 person likes this

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

I made a script that splits a file in binary or normal mode. You can use an interface, or just integrate the function into another script. If you use the interface, you can drag a file into the first input control to get the file path, and drag a folder into the second input control to set the destination folder for the resulting files. You can type in a prefix that all the resulting files will share in the 3rd input control, and the delimiter you want to use the split the file in the 4th input control.
 
If you leave the 2nd and 3rd input controls blank, the program will fill in default values based on the given file path. If you have any suggestions, or find bugs, let me know. Here is the code:

#AutoIt3Wrapper_Run_Au3Stripper=y
#Au3Stripper_Parameters=/RM /SF = 1 /SV = 1 /PE
#EndRegion ;**** Directives created by AutoIt3Wrapper_GUI ****
Opt("MustDeclareVars",1)
;These includes are only needed for the interface.
#include <GUIConstantsEx.au3>
#include <WindowsConstants.au3>
#include <StaticConstants.au3>
;_Example1()
_Example2()
Func _Example1();Splitting a file without a user interface
    Local $FileName = ""
    Local $Folder = ""
    Local $Prefix = ""
    Local $Delimiter = "****** END OF REPORT ******"
    ;_SplitFileHex($FileName,$Folder,$Prefix,$Delimiter)
    _SplitFile($FileName,$Folder,$Prefix,$Delimiter)
EndFunc
Func _Example2();Splitting a file with a user interface
    Local $GUI = GUICreate("File Splitter",300,130,-1,-1,-1,$WS_EX_ACCEPTFILES)
    Local $Label = GUICtrlCreateLabel("File:",10,10,68,20,$SS_RIGHT)
    GUICtrlSetFont($Label,12)
    Local $Input = GUICtrlCreateInput("",88,10,202,20)
    Local $Label2 = GUICtrlCreateLabel("Folder:",10,40,68,20,$SS_RIGHT)
    GUICtrlSetFont($Label2,12)
    Local $Input2 = GUICtrlCreateInput("",88,40,202,20)
    Local $Label3 = GUICtrlCreateLabel("Prefix:",10,70,68,20,$SS_RIGHT)
    GUICtrlSetFont($Label3,12)
    Local $Input3 = GUICtrlCreateInput("",88,70,202,20)
    Local $Label4 = GUICtrlCreateLabel("Delimiter:",10,100,68,20,$SS_RIGHT)
    GUICtrlSetFont($Label4,12)
    Local $Input4 = GUICtrlCreateInput("",88,100,202,20)
    Local $NextButton = GUICtrlCreateButton("",0,0,0,0)
    GUICtrlSetState($NextButton,$GUI_HIDE)
    Local $GUIAccelerators[1][2] = [["{ENTER}",$NextButton]]
    GUISetAccelerators($GUIAccelerators,$GUI)
    GUICtrlSetState($Input,$GUI_DROPACCEPTED)
    GUICtrlSetState($Input2,$GUI_DROPACCEPTED)
    GUISetState()
    Local $File, $FileName, $Prefix, $Delimiter, $A, $B, $Folder, $Text, $Continue = True
    While 1
        Switch GUIGetMsg()
            Case -3
                ExitLoop
            Case $NextButton
                $FileName = GUICtrlRead($Input)
                If $FileName == "" Then
                    MsgBox(0,"Error","You must type in a file name, or drag a file into the first input box before you continue.")
                ElseIf FileExists($FileName) Then
                    ConsoleWrite($FileName & @CR)
                    $A = StringInStr($FileName,"\",0,-1)
                    $Folder = GUICtrlRead($Input2)
                    $Continue = True
                    If $Folder == "" Then
                        $Folder = StringLeft($FileName,$A)
                        GUICtrlSetData($Input2,$Folder)
                        $Continue = False
                        ConsoleWrite($Folder & @CR)
                    EndIf
                    $Prefix = GUICtrlRead($Input3)
                    If $Prefix == "" Then
                        $A += 1
                        $B = StringInStr($FileName,".",0,-1)
                        ConsoleWrite($A & @CR & $B & @CR & StringLen($FileName) & @CR)
                        $Prefix = StringMid($FileName,$A,$B - $A)
                        GUICtrlSetData($Input3,$Prefix)
                        $Continue = False
                    EndIf
                    If $Continue Then
                        $Delimiter = GUICtrlRead($Input4)
                        ConsoleWrite("$Delimiter = " & $Delimiter & @CR)
                        If $Delimiter == "" Then
                            MsgBox(0,"Error","You did not type in a delimiter. Where do you want this program to split the file?")
                        Else
                            If MsgBox(4,"Question","Do you want to open the file in binary mode?") = 6 Then
                                _SplitFileHex($FileName,$Folder,$Prefix,$Delimiter)
                            Else
                                _SplitFile($FileName,$Folder,$Prefix,$Delimiter)
                            EndIf
                        EndIf
                    EndIf
                Else
                    MsgBox(0,"Error","I could not find the file. Please check the file path, and try again.")
                EndIf
        EndSwitch
    WEnd
    Exit
EndFunc
Func _SplitFileHex($FileName,$Folder,$Prefix,$Delimiter);The main function, Hex version
    ProgressOn("File Splitter","Processing Your File","0%",-1,-1,18)
    Local $File = FileOpen($FileName,16), $Text = FileRead($File)
    FileClose($File)
    Local $TextLen = StringLen($Text)
    If StringRight($Folder,1) <> "\" Then $Folder &= "\"
    DirCreate($Folder)
    If StringLeft($Delimiter,2) <> "0x" Then $Delimiter = StringToBinary($Delimiter)
    $Delimiter = StringTrimLeft($Delimiter,2)
    Local $Suffix = StringMid($FileName,StringInStr($FileName,"."))
    $FileName = $Folder & $Prefix & " - "
    Local $DelimiterLen = StringLen($Delimiter), $String = "", $Start = 3, $End, $Count
    Local $F = 0, $Percent
    While 1
        $F += 1
        $End = StringInStr($Text,$Delimiter,2,1,$Start)
        If $End Then
            $End += $DelimiterLen
            $File = FileOpen($FileName & $F & $Suffix, 26)
            $Count = $End - $Start
            FileWrite($File,"0x" & StringMid($Text,$Start,$Count))
            FileClose($File)
            $Start += $Count
            $Percent = Round(($Start/$TextLen)*100,2)
            ProgressSet($Percent,$Percent & "%","File " & $F)
        Else
            ExitLoop
        EndIf
    WEnd
    ProgressSet(100,"100%","Done!")
    ProgressOff()
    MsgBox(0,"Done","I've split your file.")
EndFunc
Func _SplitFile($FileName,$Folder,$Prefix,$Delimiter);The main function
    ProgressOn("File Splitter","Processing Your File","0%",-1,-1,18)
    Local $File = FileOpen($FileName), $Text = FileRead($File)
    FileClose($File)
    Local $TextLen = StringLen($Text)
    If StringRight($Folder,1) <> "\" Then $Folder &= "\"
    DirCreate($Folder)
    Local $Suffix = StringMid($FileName,StringInStr($FileName,"."))
    $FileName = $Folder & $Prefix & " - "
    Local $DelimiterLen = StringLen($Delimiter), $String = "", $Start = 1, $End, $Count
    Local $F = 0, $Percent
    While 1
        $F += 1
        $End = StringInStr($Text,$Delimiter,1,1,$Start)
        If $End Then
            $End += $DelimiterLen
            $File = FileOpen($FileName & $F & $Suffix, 26)
            $Count = $End - $Start
            FileWrite($File,StringMid($Text,$Start,$Count))
            FileClose($File)
            $Start += $Count
            $Percent = Round(($Start/$TextLen)*100,2)
            ProgressSet($Percent,$Percent & "%","File " & $F)
        Else
            ExitLoop
        EndIf
    WEnd
    ProgressSet(100,"100%","Done!")
    ProgressOff()
    MsgBox(0,"Done","I've split your file.")
EndFunc

Let me know what you think.

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

Melba23,

I can't thank you enough. That is EXACTLY what I was looking for. I just gave your script a test run and it works perfectly for what I need. I was able to improvise the original script to work by putting in some clunky lines of code myself (I had to add the delimiter to the top of the file, then do the splitting, then add the "END OF REPORT" line back to the bottom. But now I don't have to do that anymore because your code does it all for me ;)

I also would like to thank you for the nice welcome message. Yes, this was my first posting. AutoIt is an amazing tool and for a few months now I have been perusing the site, picking up bits and pieces on how to use it effectively. I really appreciate the help from an advanced user such as yourself. It is interesting to see such a useful script that can do a fast, clean, straight-forward text split via a word boundary; I looked around the internet quite a bit for a useful application that does this, and the quality options are extremely scarce.

Oscis: I have not tried out your suggestion, but I will, and I appreciate you providing it. I will let you know how it works.

Edited by mjfoxtrot

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

Oscis,

I tried out your script. It's very nice; I like the front-end you built on to it. It makes things very convenient for a splitting operation that a user intends to run just once or twice. For everyday splitting operations, it's easy enough to integrate into another script, as you mentioned. But a couple of points, if I may:

1. Is it possible to make the script so that it is a bit more forgiving about the delimiter . . . for instance, I noticed that some of my "****** END OF REPORT ******" lines in the source file have an extra space or two in them (i.e., they look like this: ****** END OF REPORT   ******"). When I use Melba23's script, I simply use the term "END OF REPORT" as the delimiter and it seeks out any line that has that phrase, and it strips the entire line, no matter what the other characters are. That is what I need for my file splitting operations; maybe there is a way to build the function and/or front end so that a user could specify whether the delimiter phrase is verbatim (literal) or just a key phrase? In the event of a key phrase, the script would find the term in any line, make the split at that point, and put the entire line at the bottom of the split page.

2. This is more of a pie-in-the-sky request than anything else: can the splitting operation be controlled via a .txt file that acts as an index and has the page breaks clearly defined by FIRST LINE - LAST LINE? In other words, the text file will map out the splitting process, page by page. I realize the preparation to make the .txt file index is more time-consuming, but for more complex splitting operations, this would be a godsend. I've looked high and low for this kind of "split by index" functionality and not found it any place.

Anyway, I really appreciate your script that you sent, it is a nice piece of programming and does exactly what I had specified. Sorry to ask for more, but it seems like what you have built has tremendous possibilities for me and others.

Edited by mjfoxtrot

Share this post


Link to post
Share on other sites

I'm working on a second version of this program now. I'll see what I can do.

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

Thanks, Oscis. Whatever you can do is much appreciated. I'll play around with the code as well, I am an amateur at this point but it is fun to try ;)

Edited by Melba23
Fixed font size

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0