Jump to content

Split text file based on the first two characters (script is slow)


Recommended Posts

Hello

I created a script to split a text file to multiple files based on the first two characters of each line, Example:

ORIGINAL.txt:

about my brother and me.
About me?
Naturally, you can't know.
Nature must take her course!

The result of this example will be two files:

AB.txt

about my brother and me.
About me?

NA.txt

Naturally, you can't know.
Nature must take her course!

As you see the first two characters will be the file name.

My script does the job.

So, What's the problem?

The problem is that my script is so slow with big files.

I tried it with a text file with 1,000,000 lines and it took about half an hour to finish 20% only.

 

Here is my script:

#include <Array.au3>
#include <AutoItConstants.au3>
#include <File.au3>
#include <scriptingdic.au3>
;Download from: https://www.autoitscript.com/forum/topic/182334-scripting-dictionary-modified/

Global $Lines    
_FileReadToArray("ORIGINAL.txt", $Lines, $FRTA_NOCOUNT)  
Global $initArr = ["----"]

Global $dict = _InitDictionary()

$Total = UBound($Lines)
$LastRound = 0

For $i = 0 To UBound($Lines)-1  Step +1 
    ;Extract the first two characters of the current line
    $FirstTwoChar =  StringMid($Lines[$i], 1, 2)
    
    ;Replace symbols that are not valid for file names
    $FirstTwoChar = StringReplace($FirstTwoChar, " ", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "<", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, ">", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "?", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, '"', "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "|", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, ":", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "\", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "/", "_")

    ;Add the first two characters as a key in the dictionary with an array as its value.
    if not _ItemExists($dict, $FirstTwoChar) then
        $initArr[0] = $FirstTwoChar
        _AddItem($dict, $FirstTwoChar, $initArr)
    EndIf
    
    ;Add the current line to the array.
    $tmpArray = _Item($dict, $FirstTwoChar)
    _ArrayAdd($tmpArray ,$Lines[$i])
    _ChangeItem($dict, $FirstTwoChar, $tmpArray)

    ;Show progress on the screen
    $Percent = $i / $Total * 100
    if round($Percent) <> $LastRound then 
        ToolTip('...'&round($Percent)&"%",0,5)
        $LastRound = round($Percent)
    EndIf
Next

;Save each array as text file
DirCreate("result")
For $Key In $Dict
    $FinalArray = _Item($dict, $Key)
    $FileName = $FinalArray[0]
    _ArrayDelete($FinalArray, 0)
    ;_ArrayDisplay($FinalArray)
    _FileWriteFromArray("result\"&$FileName&".txt", $FinalArray)
Next


My limitations:
* Lines must stay in the same order. You can't change lines order while processing the file.

Any idea to make this script fast?

Thanks.

Example of ORIGINAL text file.rar

Edited by MajKSA
Link to comment
Share on other sites

I expect your slow down is from _ArrayAdd. Each time you use it AutoIt is re-dimensioning the array. Can you define the array size at the start and then assign values? 

Edit: Or burn some memory and set up large empty arrays then assign to them.

Edited by SlackerAl

Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Link to comment
Share on other sites

Thanks, SlackerAl for reply.

I changed the following  lines :
Global $initArr = ["----"]
Global $initArr[2500]

_ArrayAdd($tmpArray ,$Lines[$i])
 _ArrayPush($tmpArray, $Lines[$i])

(and added a new line after _ArrayPush :   "$tmpArray[0] = $FirstTwoChar" to make sure the first item is always the filename)
But with no luck, the script still slow.

Edited by MajKSA
Link to comment
Share on other sites

I'm not sure of the cost of a push (that's still an index change to everything in the array). Can you not re-work your code to directly assign your values to the array(s)? How many possible 2 letter combos are you expecting? Is it the full 26^2 or just a small subset of that? Could you collect each combo in its own array with direct assignment?

 

Edit: OK I see you have a large number of possible pairs.... I'll think about it for a bit

Edited by SlackerAl

Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Link to comment
Share on other sites

Try this baby :)

#include <Array.au3>
#include <AutoItConstants.au3>
#include <File.au3>

Global $Lines
_FileReadToArray("ORIGINAL.txt", $Lines, $FRTA_NOCOUNT)

Global $oDict = ObjCreate("Scripting.Dictionary")

Local $Total = UBound($Lines), $LastRound = 0, $FirstTwoChar

For $i = 0 To $Total - 1
  ;Extract the first two characters of the current line
  $FirstTwoChar = StringInStr(' <>?"|:\/', StringMid($Lines[$i], 1, 1)) ? "_" : StringMid($Lines[$i], 1, 1)
  $FirstTwoChar &= StringInStr(' <>?"|:\/', StringMid($Lines[$i], 2, 1)) ? "_" : StringMid($Lines[$i], 2, 1)

  ;Add the first two characters as a key in the dictionary with its value.
  If Not $oDict.Exists($FirstTwoChar) Then
    $oDict.Add($FirstTwoChar, $Lines[$i] & @CRLF)
  Else   ;Add the current line to the dict.
    $oDict.Item($FirstTwoChar) = $oDict.Item($FirstTwoChar) & $Lines[$i] & @CRLF
  EndIf

  ;Show progress on the screen
  If Not Mod($i, 100) Then
    $Percent = $i / $Total * 100
    If Round($Percent) <> $LastRound Then
      ToolTip('...' & Round($Percent) & "%", 0, 5)
      $LastRound = Round($Percent)
    EndIf
  EndIf
Next

;Save each item as text file
DirCreate("result")
For $Key In $oDict
  FileWrite("result\" & $Key & ".txt", $oDict.Item ($Key))
Next

Not fully tested but I believe it is quite close of what you are looking for...

Edited by Nine
Link to comment
Share on other sites

#include<File.au3>

$str = "about my brother and me." & @LF & _
"About me?" & @LF & _
"Babout my brother and me." & @LF & _
"BAbout me?" & @LF & _
"Naturally, you can't know." & @LF & _
"Nature must tabke her course!"


Do

    $a = stringregexp($str , "(?:\A|\R)(?im:" & stringleft($str , 2) & ".*)" , 3)

    _FileWriteFromArray(stringleft($str , 2) & ".txt" , $a)

    $str = stringstripws(stringregexpreplace($str , "(?:\A|\R)(?im:" & stringleft($str , 2) & ".*)" , "") , 1)


Until $str = ""

 

 

Edited by iamtheky
testing edges

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

2 hours ago, Nine said:

 

 

 

2 hours ago, iamtheky said:

 

 

 

@Nine @iamtheky @SlackerAl
Hello guys and thanks for your contributes.
After testing,
Nine's code was the fastest and after comparing it with my old code, it was a huge improvement.
Took only one minute to finish 100000 lines and this is great compared to the old one. 👍

I can work with this for now.
Thanks, everyone.
🙂

 

Edited by MajKSA
Link to comment
Share on other sites

28 minutes ago, MajKSA said:

Nine's code was the fastest and after comparing it with my old code, it was a huge improvement.
Took only one minute to finish 100000 lines and this is great compared to the old one.

:thumbsup:

Link to comment
Share on other sites

20 hours ago, Exit said:

So it's really time to introduce Maps in the production version as well.

They don't work correctly yet, so they won't be added. Scripting Dictionary can be put into a UDF to do the same, or almost the same, thing so it's not really that much of a rush to add broken implementations.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

1 hour ago, BrewManNH said:

Scripting Dictionary can be put into a UDF to do the same,

There is already a UDF.  But it is useless.  Most functions replace a one liner by another one liner...

Link to comment
Share on other sites

the unofficial udf is pretty sexy tho, and nobody is making a better one.  😎

 

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

3 minutes ago, Nine said:

There is already a UDF.  But it is useless.  Most functions replace a one liner by another one liner...

Exactly.

Beside scripting dictionary objects being more verbose to use, they need care to generalize since they don't handle int64.  See notes in _ArrayUnique() help, for instance.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

1 hour ago, Nine said:

Most functions replace a one liner by another one liner...

A lot of the Misc.au3 functions are like that, look at RunDos, replaces a Run statement for the lazy. It's mainly a documentation issue.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...