Jump to content

Split text file based on the first two characters (script is slow)


Recommended Posts

Hello

I created a script to split a text file to multiple files based on the first two characters of each line, Example:

ORIGINAL.txt:

about my brother and me.
About me?
Naturally, you can't know.
Nature must take her course!

The result of this example will be two files:

AB.txt

about my brother and me.
About me?

NA.txt

Naturally, you can't know.
Nature must take her course!

As you see the first two characters will be the file name.

My script does the job.

So, What's the problem?

The problem is that my script is so slow with big files.

I tried it with a text file with 1,000,000 lines and it took about half an hour to finish 20% only.

 

Here is my script:

#include <Array.au3>
#include <AutoItConstants.au3>
#include <File.au3>
#include <scriptingdic.au3>
;Download from: https://www.autoitscript.com/forum/topic/182334-scripting-dictionary-modified/

Global $Lines    
_FileReadToArray("ORIGINAL.txt", $Lines, $FRTA_NOCOUNT)  
Global $initArr = ["----"]

Global $dict = _InitDictionary()

$Total = UBound($Lines)
$LastRound = 0

For $i = 0 To UBound($Lines)-1  Step +1 
    ;Extract the first two characters of the current line
    $FirstTwoChar =  StringMid($Lines[$i], 1, 2)
    
    ;Replace symbols that are not valid for file names
    $FirstTwoChar = StringReplace($FirstTwoChar, " ", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "<", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, ">", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "?", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, '"', "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "|", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, ":", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "\", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "/", "_")

    ;Add the first two characters as a key in the dictionary with an array as its value.
    if not _ItemExists($dict, $FirstTwoChar) then
        $initArr[0] = $FirstTwoChar
        _AddItem($dict, $FirstTwoChar, $initArr)
    EndIf
    
    ;Add the current line to the array.
    $tmpArray = _Item($dict, $FirstTwoChar)
    _ArrayAdd($tmpArray ,$Lines[$i])
    _ChangeItem($dict, $FirstTwoChar, $tmpArray)

    ;Show progress on the screen
    $Percent = $i / $Total * 100
    if round($Percent) <> $LastRound then 
        ToolTip('...'&round($Percent)&"%",0,5)
        $LastRound = round($Percent)
    EndIf
Next

;Save each array as text file
DirCreate("result")
For $Key In $Dict
    $FinalArray = _Item($dict, $Key)
    $FileName = $FinalArray[0]
    _ArrayDelete($FinalArray, 0)
    ;_ArrayDisplay($FinalArray)
    _FileWriteFromArray("result\"&$FileName&".txt", $FinalArray)
Next


My limitations:
* Lines must stay in the same order. You can't change lines order while processing the file.

Any idea to make this script fast?

Thanks.

Example of ORIGINAL text file.rar

Edited by MajKSA
Link to post
Share on other sites

I expect your slow down is from _ArrayAdd. Each time you use it AutoIt is re-dimensioning the array. Can you define the array size at the start and then assign values? 

Edit: Or burn some memory and set up large empty arrays then assign to them.

Edited by SlackerAl

Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Link to post
Share on other sites

Thanks, SlackerAl for reply.

I changed the following  lines :
Global $initArr = ["----"]
Global $initArr[2500]

_ArrayAdd($tmpArray ,$Lines[$i])
 _ArrayPush($tmpArray, $Lines[$i])

(and added a new line after _ArrayPush :   "$tmpArray[0] = $FirstTwoChar" to make sure the first item is always the filename)
But with no luck, the script still slow.

Edited by MajKSA
Link to post
Share on other sites

I'm not sure of the cost of a push (that's still an index change to everything in the array). Can you not re-work your code to directly assign your values to the array(s)? How many possible 2 letter combos are you expecting? Is it the full 26^2 or just a small subset of that? Could you collect each combo in its own array with direct assignment?

 

Edit: OK I see you have a large number of possible pairs.... I'll think about it for a bit

Edited by SlackerAl

Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Link to post
Share on other sites

Try this baby :)

#include <Array.au3>
#include <AutoItConstants.au3>
#include <File.au3>

Global $Lines
_FileReadToArray("ORIGINAL.txt", $Lines, $FRTA_NOCOUNT)

Global $oDict = ObjCreate("Scripting.Dictionary")

Local $Total = UBound($Lines), $LastRound = 0, $FirstTwoChar

For $i = 0 To $Total - 1
  ;Extract the first two characters of the current line
  $FirstTwoChar = StringInStr(' <>?"|:\/', StringMid($Lines[$i], 1, 1)) ? "_" : StringMid($Lines[$i], 1, 1)
  $FirstTwoChar &= StringInStr(' <>?"|:\/', StringMid($Lines[$i], 2, 1)) ? "_" : StringMid($Lines[$i], 2, 1)

  ;Add the first two characters as a key in the dictionary with its value.
  If Not $oDict.Exists($FirstTwoChar) Then
    $oDict.Add($FirstTwoChar, $Lines[$i] & @CRLF)
  Else   ;Add the current line to the dict.
    $oDict.Item($FirstTwoChar) = $oDict.Item($FirstTwoChar) & $Lines[$i] & @CRLF
  EndIf

  ;Show progress on the screen
  If Not Mod($i, 100) Then
    $Percent = $i / $Total * 100
    If Round($Percent) <> $LastRound Then
      ToolTip('...' & Round($Percent) & "%", 0, 5)
      $LastRound = Round($Percent)
    EndIf
  EndIf
Next

;Save each item as text file
DirCreate("result")
For $Key In $oDict
  FileWrite("result\" & $Key & ".txt", $oDict.Item ($Key))
Next

Not fully tested but I believe it is quite close of what you are looking for...

Edited by Nine
Link to post
Share on other sites
#include<File.au3>

$str = "about my brother and me." & @LF & _
"About me?" & @LF & _
"Babout my brother and me." & @LF & _
"BAbout me?" & @LF & _
"Naturally, you can't know." & @LF & _
"Nature must tabke her course!"


Do

    $a = stringregexp($str , "(?:\A|\R)(?im:" & stringleft($str , 2) & ".*)" , 3)

    _FileWriteFromArray(stringleft($str , 2) & ".txt" , $a)

    $str = stringstripws(stringregexpreplace($str , "(?:\A|\R)(?im:" & stringleft($str , 2) & ".*)" , "") , 1)


Until $str = ""

 

 

Edited by iamtheky
testing edges

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to post
Share on other sites
2 hours ago, Nine said:

 

 

 

2 hours ago, iamtheky said:

 

 

 

@Nine @iamtheky @SlackerAl
Hello guys and thanks for your contributes.
After testing,
Nine's code was the fastest and after comparing it with my old code, it was a huge improvement.
Took only one minute to finish 100000 lines and this is great compared to the old one. 👍

I can work with this for now.
Thanks, everyone.
🙂

 

Edited by MajKSA
Link to post
Share on other sites
28 minutes ago, MajKSA said:

Nine's code was the fastest and after comparing it with my old code, it was a huge improvement.
Took only one minute to finish 100000 lines and this is great compared to the old one.

:thumbsup:

Link to post
Share on other sites
20 hours ago, Exit said:

So it's really time to introduce Maps in the production version as well.

They don't work correctly yet, so they won't be added. Scripting Dictionary can be put into a UDF to do the same, or almost the same, thing so it's not really that much of a rush to add broken implementations.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to post
Share on other sites
1 hour ago, BrewManNH said:

Scripting Dictionary can be put into a UDF to do the same,

There is already a UDF.  But it is useless.  Most functions replace a one liner by another one liner...

Link to post
Share on other sites

the unofficial udf is pretty sexy tho, and nobody is making a better one.  😎

 

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to post
Share on other sites
3 minutes ago, Nine said:

There is already a UDF.  But it is useless.  Most functions replace a one liner by another one liner...

Exactly.

Beside scripting dictionary objects being more verbose to use, they need care to generalize since they don't handle int64.  See notes in _ArrayUnique() help, for instance.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites
1 hour ago, Nine said:

Most functions replace a one liner by another one liner...

A lot of the Misc.au3 functions are like that, look at RunDos, replaces a Run statement for the lazy. It's mainly a documentation issue.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By WhaleJesus
      #include <FileConstants.au3> #include <MsgBoxConstants.au3> #include <file.au3> ; Create Data Folder if it doesn't exist yet If FileExists(@ScriptDir & "\Data") Then Else ShellExecute(@ScriptDir) DirCreate(@ScriptDir & "\Data") EndIf ; Playlist Name & location input Global $playlistnameinput = InputBox("Playlist", "Enter The playlist name", _ "Name") Global $playlistlocationinput = InputBox("Location", "Specify where you would like the playlist folder to be stored", @ScriptDir & "\Playlists\" & $playlistnameinput) ; Create file in Data folder and other vars Global $sDataFile = @ScriptDir & "\Data\Data.txt" Global $DataHandle = FileOpen($sDataFile, 1) Global $DataFileLine = FileReadLine($sDataFile, 1) FileClose($DataFileLine) MsgBox(0, "", $DataFileLine, 10) ; Prove it exists If FileExists($sDataFile) Then _FileWriteToLine($DataHandle, $DataFileLine, $playlistnameinput, True, True) $DataFileLine += 1 _FileWriteToLine($DataHandle, 1, $DataFileLine, True) Else MsgBox($MB_SYSTEMMODAL, "Error", "File " & $sDataFile & "Does not exist") EndIf Global $sPDataFile = @ScriptDir & "\Data\" & $playlistnameinput & "_Data.txt" Global $PDataHandle = FileOpen($sPDataFile, 1) If FileExists($sPDataFile) Then _FileWriteToLine($PDataHandle, 1, $playlistnameinput, True, True) _FileWriteToLine($PDataHandle, 2, $playlistlocationinput, True, True) Else MsgBox($MB_SYSTEMMODAL, "Error", "File " & $sPDataFile & "Does not exist") EndIf _FileWriteToLine stopped working and i don't know what it is in my code that's causing this, please help
    • By DannyJ
      $sCommands1 = 'powershell.exe Get-ChildItem' $iPid = run($sCommands1   , @WorkingDir , @SW_SHOW , 0x2) $sOutput = ""  While 1     $sOutput &= StdoutRead($iPID)         If @error Then             ExitLoop         EndIf  WEnd ;~ msgbox(0, '' , $sOutput) ConsoleWrite("$sOutput") ConsoleWrite($sOutput) ConsoleWrite(@CRLF) $aOutput = stringsplit($sOutput ,@LF , 2) For $i=0 To  UBound($aOutput) - 1 Step 1     ConsoleWrite($aOutput[$i]) Next The script above reads the whole directory into a one dimensional array, but I need to work with the array, so I need to split the array into multiple dimensions.
      I have already read some forum answers here, and I have already tried these commands:
       
      Are there any way to use the $aOutput variable like in PowerShell:
      PowerShell:
      $a = Get-ChildItem $a.Mode I imagine this in AutoIt  $aOutput
      ConsoleWrite($aOutput[i].Mode) Or if I split this command into 2 dimension like:
      For $i To UBound($aOutput)-1 Step 1 ConsoleWrite($aOutput[$i][1]) ConsoleWrite($aOutput[$i][2]) Next  
    • By DannyJ
      If I run this code, it works perfectly
      $CmdPid = Run("C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -noexit " & 'Get-ChildItem',@DesktopDir, @SW_SHOW) But this code
      $CmdPid = Run("C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -noexit " & 'Get-RDUserSession',@DesktopDir, @SW_SHOW) I get this error:
      Get-RDUserSession : The term 'Get-RDUserSession' is not recognized as the name of a cmdlet, function, script file, or o perable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try aga in. If I try run the command Get-RDUserSession  in normal PowerShell (started from windows start menu) the command works perfectly.
      But If I run with AutoIt I get the above mentioned error .
      Any ideas?
    • By Automania
      Hi all,
      I haven't used AutoIt in more than 10 years and I am sure a lot has improved since that long time. I hope you can give me some suggestions on my approach.
      Task: I need to extract user data (for around 1700 users) from a website tool. That tool shows an output in a table on the website. However, no export feature is available and I need the data in an Excel file, such as:
      username, serial number (of a laptop), ID number (of laptop) and some more
       
      With my knowledge from 2009 I would do this:
      1) use _IEextract with each username in the url to get the whole source code of the website with the user's data summary
      2) Work with lots of regexpressions to extract each data piece, save them into variables/array
      3) Write variable values into an Excel file
      4) rinse repeat 1700 times
       
      The relevant line for step 3 looks like this:
      <td class="resultcell"><span class="new">2021-03-23 11:05:00</span></td><td class="resultcell">Hostname-1234</td><td class="resultcell"><a href="?&Search=Search&result=summarized%20history&field=serial%20numbers&criteria=123456">123456</a></td><td class="resultcell">0987654/td><td class="resultcell"><a href="?&Search=Search&result=summarized%20history&field=usernames&criteria=myusername">myusername</a> and so on.. so here it would be Hostname-1234, 0987654 and myusername that I would need to extract.

      Although this may work it does not appear very efficient and would take a while. So I am happy for an alternate approach. Preferably, without using additional exe binary files due to company policies besides AutoIt itself.
    • By SEuBo
      Hi!
      I am just getting started with C and C++. I have created a pretty simple C code which is calling a dll function.
      When I compile and run, I get the appropriate Output. So it works fine.

       
      Now I would want to transform that to AutoIt. -> I would like to call the "RfcOpenConnection" function from AutoIt - but whatever I try with DLLCall, I can not get it to work. 
      Can someone point me in the right direction? DLL, C Sourcecode and compiled exe are attached too large to be attached, so they're uploaded here: 
      https://drive.google.com/file/d/12CUSsISl0mojiMCNxKjps1Sdoox3JlCX/view?usp=sharing
       
      Thanks a bunch!
×
×
  • Create New...