Jump to content

Quickest way to add data to data (i.e. collect extracted text into a single collection)


Recommended Posts

Hello everyone

[TL;DR: what is the most efficient way to collect and save extracted text into a new text file?]

I want to extract information from large tab-delimited text files. (The files typically contain about 50-200 000 records (i.e. lines) and each record contains 7-10 fields (i.e. bits of text with tabs between them), and I would want to save the text from the 5th and 7th field to a separate text file.)

Anyway, I was experimenting with different ways of speeding up this process, and I found that (as one would expect) the specific chosen method of adding new data to existing data has an effect on the speed of processing.  I used a test file of 20 000 lines.

The method that I used originally was to simply write the extracted data to a text file immediately after extracting it:

For $i = 1 to 20000
; then extract the relevant text and add it to variable $extractedtext
FileWrite ($outputfile, $extractedtext & @CRLF)
Next

and it took 31 seconds to process the entire file.

Then I thought to myself: surely writing data to a file 20 000 times in 30 seconds would not be the most efficient way (I do not close and re-open the file after write).  I then tried the following, thinking that it should be faster:

For $i = 1 to 20000
; then extract the relevant text and add it to variable $extractedtext
$outputfilecontent = $outputfilecontent & $extractedtext & @CRLF
Next
FileWrite ($outputfile, $outputfilecontent)

but then it took 60 seconds to process the test file. 

I speculated that it must have taken longer because the script was repeatedly re-reading the entire content of the collected data up to that point when adding new data to it.  So then I tried using an array, thinking that it would just add new data to the collection without re-reading the existing collection each time:

Global $array[1]
For $i = 1 to 20000
; then extract the relevant text and add it to variable $extractedtext
_ArrayAdd ($array, $extractedtext)
Next
$outputfilecontent = _ArrayToString ($array, @CRLF)
FileWrite ($outputfile, $outputfilecontent)

and it took 176 seconds.

But I quickly realised my mistake, and found that defining the array size beforehand solved that bit of a problem:

Global $array[20000]
For $i = 0 to 19999
; then extract the relevant text and add it to variable $extractedtext
$array[$i] = $extractedtext
Next
$outputfilecontent = _ArrayToString ($array, @CRLF)
FileWrite ($outputfile, $outputfilecontent)

which took 29 seconds.

I'm still baffled about why using file-write is one of the fastest ways of collecting data into a single collection.

Anyway, to my question: do you know of another method to add data to data (i.e. to collect the extracted text in one place) that is or could possibly be even faster?  I don't mind if the new data is added to the start or the end of the existing data.

Thanks

Samuel

Edited by leuce
Link to post
Share on other sites

Could you test this ?

#cs
$txt = ""
For $i = 1 to 20000
    $line = ""
    For $j = 1 to 10
        $line &= "field " & $i & "-" & $j & @tab
    Next
    $txt &= StringTrimRight($line, 1) & @crlf
Next
FileWrite("test.txt", $txt)
#ce

$s = FileRead("test.txt")
$ex = StringRegExpReplace($s, '(?m)^(?:[^\t]+\t){4}([^\t]+)\t[^\t]+\t([^\t]+).*', "$1,$2")
FileWrite("result.txt", $ex)

Edit
Just tested with a 200k lines txt file, I get 1.35 sec on my antique PC  :)

Edited by mikell
Link to post
Share on other sites

Thanks, Mikell, for thinking outside the box like that.  Your "method" takes 0.15 seconds to extract columns 5 and 7 from my 20 000-line file.

I neglected to mention (for the sake of simplicity) that I do additional processing of the text before writing it (e.g. I perform a word count of the extracted text and then decide whether to include it in the collection, based on the word count), which is not directly possible with your method, but I'll certainly investigate how to use your method to my advantage.  Thanks again.

Link to post
Share on other sites

Hello Mikell

Thank you for your effort, but the original idea with my script was basically to satisfy my curiosity about the average sentence length in the file.

My output file was tab-delimited too, with four columns: the text from the 5th column, a word count for that column, the text from the 7th column, and a word count for that column.  I performed the word count by counting spaces using StringReplace and then adding 1.  (A character count would also have been somewhat useful, if a word count weren't possible.)  Then the plan was to copy the results to Excel and do various things with the data 🙂

 Here is my original script (using StringSplit to find the text in each line):

$time = TimerInit ()

$fileopen = FileOpen ("foo.txt", 32)
$filewrite = FileOpen ("foo_write.txt", 33)
$fileread = FileRead ($fileopen)
$lines = StringSplit ($fileread, @CRLF, 1)

For $i = 1 to $lines[0]

$tabs = StringSplit ($lines[$i], @TAB, 1)

If $tabs[0] > 6 Then
$sl = $tabs[5]
$tl = $tabs[7]

$slspacecount = StringReplace ($sl, " ", " ")
$slcount = @extended + 1
$tlspacecount = StringReplace ($tl, " ", " ")
$tlcount = @extended + 1

FileWrite ($filewrite, $sl & @TAB & $slcount & @TAB & $tl & @TAB & $tlcount & @CRLF)

EndIf

Next

MsgBox (0, "", TimerDiff ($time), 0)

And I tried using StringMid instead of StringSplit to find the text, too, but found that the speed was about the same:

$time = TimerInit ()

$fileopen = FileOpen ("foo.txt", 32)
$filewrite = FileOpen ("foo_write.txt", 33)
$fileread = FileRead ($fileopen)
$lines = StringSplit ($fileread, @CRLF, 1)

For $i = 1 to $lines[0]

If StringInStr ($lines[$i], @TAB, 0, 7) Then

$tab4 = StringInStr ($lines[$i], @TAB, 0, 4)
$tab5 = StringInStr ($lines[$i], @TAB, 0, 5)
$tab6 = StringInStr ($lines[$i], @TAB, 0, 6)
$tab7 = StringInStr ($lines[$i], @TAB, 0, 7)

$sl = StringMid ($lines[$i], $tab4 + 1, $tab5 - $tab4 - 1)
$tl = StringMid ($lines[$i], $tab6 + 1, $tab7 - $tab6 - 1)

$slspacecount = StringReplace ($sl, " ", " ")
$slcount = @extended + 1
$tlspacecount = StringReplace ($tl, " ", " ")
$tlcount = @extended + 1

FileWrite ($filewrite, $sl & @TAB & $slcount & @TAB & $tl & @TAB & $tlcount & @CRLF)

EndIf

Next

MsgBox (0, "", TimerDiff ($time), 0)

I can do character count fairly easily in Excel, too: so I can use your method to reduce the text to the relevant columns, and use a =len formula in Excel.

Doing a word count (or even a character count) using just regex would be an accomplishment 🙂

Samuel

Edited by leuce
Link to post
Share on other sites
15 hours ago, leuce said:

I'm still baffled about why using file-write is one of the fastest ways of collecting data into a single collection.

It’s because, assuming you use a handle obtained with Fileopen(), it’s only doing a buffered write every time, with the physical disk not being updated until

1) you close the file with FileClose()

2) you explicitly call FileFlush()

3 the program terminates, which closes the file

 

Code hard, but don’t hard code...

Link to post
Share on other sites
9 minutes ago, JockoDundee said:

It’s because, <snip> it’s only doing a buffered write every time, with the physical disk not being updated...

<brag mode on>

In 1992, under DOS 6.22, I wrote a Z80 interpreter so that I could interpret in-house software currently running on a multi-user Z80 operating system on imported hardware to run on an IBM compatible. (A lot of layers there. We imported a multi-user Z80 based system and wrote code for it. With the interpreter we could run the code on a PC.)

As much as we hated M$ I was amazed at the efficiency of the OS and the way that it buffered disk writes, delaying them only until local buffers needed to be flushed and performing intensive disk IO entirely within RAM wherever possible. I found there was no need for me to perform any disk buffering in my interpreter because DOS was taking care of everything. It was marvelous. I also hooked the clock interrupt and used this for the partition switching that the emulated OS required.

I was even more amazed later on when one of my colleagues reported that he had my interpreter running successfully under an early version of Windows, something which I thought impossible given that my assembly code was performing direct BIOS disk reads and writes and talking directly to the serial port ACIA hardware and timer.

In the years that followed I have discovered that same basic functionality from DOS 6.22 has persisted through all the versions of Windoze which followed. I have probably mentioned the buffering in other posts.

<brag mode off>

 

 

Phil Seakins

Link to post
Share on other sites
3 hours ago, pseakins said:

I wrote a Z80 interpreter so that I could interpret in-house software currently running on a multi-user Z80 operating system on imported hardware to run on an IBM compatible.

So you wrote an Z80 assembly/machine code interpreter for the 8088/8086 ?

3 hours ago, pseakins said:

…my assembly code was performing direct BIOS disk reads and writes…

As in 13h etc? But I thought you were using DOS for the writes ?

Fwiw, I still have my IBM PC Bios hardcover reference manual.  Back in the day we worshipped it like was a black art :)

Edited by JockoDundee

Code hard, but don’t hard code...

Link to post
Share on other sites
8 hours ago, JockoDundee said:

It’s because, assuming you use a handle obtained with Fileopen(), it’s only doing a buffered write every time, with the physical disk not being updated until...

Actually, I wasn't even thinking of the disk writing.  I just thought that the act of handing data over to the OS would be slower than processing the data the script's own memory space, so to speak.

Anyway, you're right about the OS not writing stuff to the disk immediately -- I added a FileGetSize line to the repetition and let the size be added to the end of each line in the output file, and the file size increments only every few hundred iterations. (This experiment depends, of course, on the assumption that FileGetSize can be used at this speed -- perhaps the size that FileGetSize gets is updated less frequently than the script is cycling through lines.)

Link to post
Share on other sites
On 7/19/2021 at 2:31 PM, JockoDundee said:

So you wrote an Z80 assembly/machine code interpreter for the 8088/8086 ?

I wrote 80286 assembly language code which interpreted Z80 machine code.  There were 6166 lines of code. My bible was a book by Leo Scanlon, "80286 Assembly Language on MS-DOS Computers" and of course a Zilog Z80 data book which is not where it should be on my bookshelf. Someone on fidonet had a comprehensive DOS interrupt manual which I constantly referred to. Can't remember the name but was well famous at the time. EDIT2: It was Ralf Brown.

 

On 7/19/2021 at 2:31 PM, JockoDundee said:

As in 13h etc? But I thought you were using DOS for the writes ?

Well caught. You are correct, I was not using int 13h, I used int 21h to access the disk. I used a various other interrupts for the Wyse 50 terminal emulator, keyboard,  physical ports and other stuff.

Of course the disks had to be virtual too and physically were just DOS read\writes ie int 21h 3fh (_readh) to a preassigned and formatted file which contained the target operating system boot images and system files.

Edited by pseakins
change host to target. edit2: Added Ralf Brown's name.

Phil Seakins

Link to post
Share on other sites
17 hours ago, leuce said:

I just thought that the act of handing data over to the OS would be slower than processing the data the script's own memory space, so to speak.

I like the way you think - the OS is like government bureaucracy; the fastest way to deal with it is to have all your paperwork done, and leave nothing to chance ;)

Code hard, but don’t hard code...

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...