Sign in to follow this  
Followers 0
BlueScreen

Reading from files and writting to files algorithm

16 posts in this topic

#1 ·  Posted (edited)

Hi Guys,

First, thanks for your help.

I have written a function which receives as parameters 2 files and remove from the Source file the Data which exists on the DataFile.

All works fine, but very low (5 minutes). Is my algorithm not efficient enough?

Here is what I did inside the function:

1) Read the Data file into an array using _FileReadToArray

2) Read the SRC file into another array using _ FileReadToArray

3) Openning a temp file

4) Running in a Loop (as many lines there is in the SRC file) and checking (for each SRC line) using StringInStr(in a while) if the line in the SRC file, contains strings from the Data file.

5) If all the lines in the Data file was read and there is no match, than the line can be written in a temp file.

6) All this is running till there is no more lines in the SRC file.

7) Closing the temp file, deleting the SRC, Moving the Temp file to SRC

8) Arriving here, there are no Strings from the DATA file into the SRC file. Continuing with the next Data file

Now, my SRC file contains around 8000 lines. I have also 6 Data file of 30 lines each. So, in order to go over all the lines (around 1440000), it takes about 5 minutes

Is there a way to do it better? Here is my code:

For $w=1 to $NumOfDataFiles
     RemoveDatafromSrc ($SRCfile,$Data[$w-1][0])
Next

Dim $Temp[1]
Dim $SrcValue
Dim $DataValue
Global $LineInData=1
Global $TempID=1
#include <file.au3>

Func RemoveDatafromSrc ($SrcFile,$DataFile)
    
    If Not _FileReadToArray($DataFile,$DataValue) Then Exit
        
    If Not _FileReadToArray($SrcFile,$SrcValue) Then Exit
        
    For $e=1 to $DataValue[0]
        $DataValue[$e]=StringLeft($DataValue[$e],4); I need only the 4 left chars
    Next
    
    $TmpFile = FileOpen ("temp.tmp",2)
    
    For $LineInSrc=1 to $SrcValue[0]; Lines in SRC
        While $LineInData <= $DataValue[0]; for each Src line, need to check all DATA line
            If StringinStr ($SrcValue[$LineInSrc], $DataValue[$LineInData] & ":") <> 0 then
   ;Data line found
                ExitLoop
            EndIf
            If $LineInData=$DataValue[0] Then
                FileWriteLine($TmpFile,$SrcValue[$LineInSrc] & @LF)
                $LineInData=1
                ExitLoop
            Else
                $LineInData=$LineInData+1
            EndIf
        WEnd
    Next
    FileClose ($TmpFile)
    FileDelete ($SrcFile)
    FileMove ("temp.tmp", $SrcFile,1)
EndFunc
Edited by BlueScreen

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

As I recall, the _Array* UDFs weren't the fastest things in the world back when I looked at them.

Try to get the files into memory and operate on them there.

I use (untested / no error checking pseudo pseudo code!)

$fh = FileOpen($fn,0)
$contents = FileRead($fh,FileGetSize($fn)); use handles instead of filenames
FileClose($fh)
;Depending on filesize, StringSplit() or StringLeft() may be faster; i.e.
;test in your environment and let us know what you find.

$line = StringSplit($contents,@LF)
For $i = 1 to $line[0]
;_DoStuff($line[$i])
Next

or :

While StringLen($contents) > 0
 $line = StringLeft($contents,StringInStr($contents,@LF))
 $contents = StringTrimLeft($contents,StringInStr($contents,@LF)
 _DoStuff($line)
Wend

Wend

Edited by flyingboz

Reading the help file before you post... Not only will it make you look smarter, it will make you smarter.

Share this post


Link to post
Share on other sites

As I recall, the _Array* UDFs weren't the fastest things in the world back when I looked at them.

You mean the "_FileReadToArray"? ;)

I don't get it... o:) Where to put the StringSplit? :lmao:

Share this post


Link to post
Share on other sites

You mean the "_FileReadToArray"? ;)

mebbe i should have said _*array*(*) ...clearer?

I don't get it... o:) Where to put the StringSplit? :lmao:

In Example 1, the StringSplit is there - example shows using the builtin Function to create an array of a variable.

In example 2, but StringInStr(), StringLeft() and StringTrimLeft() are used to get the data stringsplit is not required - another way of skinning the same cat, maybe it's faster, maybe it ain't.

If your data is fixed length (e.g.. each line is 80 chars long), you could use something like this:

$line_length = 80
$file_pos
While $file_pos < $file_size
  $line = StringMid($line_num,$line_length)
  _DoStuff($line)
  $line_pos = $line_pos + $line_length + 1
Wend

Reading the help file before you post... Not only will it make you look smarter, it will make you smarter.

Share this post


Link to post
Share on other sites

In Example 1, the StringSplit is there - example shows using the builtin Function to create an array of a variable.

In example 2, but StringInStr(), StringLeft() and StringTrimLeft() are used to get the data stringsplit is not required - another way of skinning the same cat, maybe it's faster, maybe it ain't.

If your data is fixed length (e.g.. each line is 80 chars long), you could use something like this:

$line_length = 80
$file_pos
While $file_pos < $file_size
  $line = StringMid($line_num,$line_length)
  _DoStuff($line)
  $line_pos = $line_pos + $line_length + 1
Wend
Sorry, i've not had any negative experiences with the array udf's myself, and my solution uses them also; but should be considerably faster, and i'll explain why. First, here's my code to replace your function:

Dim $SrcValue
Dim $DataValue
#include <file.au3>
#include<array.au3>
Func RemoveDatafromSrc ($SrcFile,$DataFile)
    
    If Not _FileReadToArray($DataFile,$DataValue) Then Exit
        
    If Not _FileReadToArray($SrcFile,$SrcValue) Then Exit
    $StringOfSrc = _ArrayToString($SrcValue,"$",1,$SrcValue[0])
    For $e=1 to $DataValue[0]
    if StringInStr($StringOfSrc,StringLeft($DataValue[$e],4)) Then
        For $f = 1 To $SrcFile[0]
            If StringInStr($SrcValue[$f],StringLeft($DataValue[$e],4) & ":") Then _ArrayDelete($SrcValue,$f)
        Next
    EndIf
    Next
    FileDelete ($SrcFile)
_FileWriteFromArray($SrcFile,$SrcValue,1,UBound($SrcValue))
EndFunc

Now a few things that i did to speed it up. First i took out your For loop that was replacing the DataValue array elements with only the first 4 characters of each element. That's making alot of unnecessary assignments because we can search for just the substring that you want and ignore the rest of the line without making any assignments. Then, you were searching each line of the the data for each line of the source, which works out to Source(lines) X Data(lines) comparisons. To trim the fat on that one, i made a string from all of the elements in the Source data, and only did line by line comparisons if the data i'm searching for was already confirmed to be in the data searching by the StringInStr(). So worst case scenario, with 8000 lines of source, and 6000 lines of data, if there are NO values that are in both, yours is doing 48,000,000 evaluations, where mine does 6000. I also changed the way that the output is created, removing the need for a temp file. By deleting the lines that contain the data we don't want from the array, we're creating an array of good data which at the end would contain all of the data that we want in the end file. So continuing the example above, if there were NO values present in each of the arrays with the sizes given, you'd be writing to the temp file 8000 times, then copying that file over the original source. The way that i've changed it, there is a single file write at the end, regardless of how many hits there were.

The changes i've made should be enough to see a good cut in execution time, but this is not the only way that you could achieve the same result.


1100111 00001011101111 00011101101111 00010111100100 00001111110100 00110111110010 00101101111001 0011100i didn't make up this form of encryption, but i like it.credit to the lvl 6 challenge on arcanum.co.nz

Share this post


Link to post
Share on other sites

@cameronsdad,

Take a look at my post in this related (duplicate?) thread:

http://www.autoitscript.com/forum/index.ph...ndpost&p=140602

The first post in that other thread mentions a rather simple task:

File in:

This is line number 1

This is line number 2

This is line number 3

File out:

This is line number 1

This is line number 3

Task = remove all lines with 'This is line number 2' and the CR, LF or CRLF

For that task, I posted this possible solution:

http://www.autoitscript.com/forum/index.ph...ndpost&p=140987

I see no reason/need to use arrays here, just loop thru all of the StringReplace statements that you want and output the file once. Am I missing something? Is the file too big to put into one variable?


[size="1"][font="Arial"].[u].[/u][/font][/size]

Share this post


Link to post
Share on other sites

@cameronsdad,

Take a look at my post in this related (duplicate?) thread:

http://www.autoitscript.com/forum/index.ph...ndpost&p=140602

The first post in that other thread mentions a rather simple task:

File in:

This is line number 1

This is line number 2

This is line number 3

File out:

This is line number 1

This is line number 3

Task = remove all lines with 'This is line number 2' and the CR, LF or CRLF

For that task, I posted this possible solution:

http://www.autoitscript.com/forum/index.ph...ndpost&p=140987

I see no reason/need to use arrays here, just loop thru all of the StringReplace statements that you want and output the file once. Am I missing something? Is the file too big to put into one variable?

I actually thought of the same approach, but decided against it, as lines in the source will almost definitely vary in length (that is actually just an assumption on my part, that the lengths will vary), and because he wants to remove the whole line, that could work out to more work. That was the way i was thinking of going at first, but decided against it because i don't know what his data looks like, and wanted to make sure that my solution worked without much follow up. That's also why i wanted to make sure to explain to him that the method suggested wasn't the only way to do it, but could give him ideas to better tune his script to his specific data.

1100111 00001011101111 00011101101111 00010111100100 00001111110100 00110111110010 00101101111001 0011100i didn't make up this form of encryption, but i like it.credit to the lvl 6 challenge on arcanum.co.nz

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

I actually thought of the same approach, but...

...you were brighter than that.

Apparently I payed too much attention to the input/output in the first post of that other thread without noticing that string being searched for could be (and probably is) a subset of one line in the input file and not the entire line itself... chased the wrong rabbit again!

Edited by herewasplato

[size="1"][font="Arial"].[u].[/u][/font][/size]

Share this post


Link to post
Share on other sites

Hi,

Once again, this is way, way, quicker with DOS if the files are large in particular.

You can get the second file lines into a long string, or tell DOS the file with the exclusion lines.

See "_DeleteFoundLineDOS" script in link from my signature, either dOsComs or my bookmarks.

Best, Randall

Share this post


Link to post
Share on other sites

Hi,

Once again, this is way, way, quicker with DOS if the files are large in particular.

You can get the second file lines into a long string, or tell DOS the file with the exclusion lines.

See "_DeleteFoundLineDOS" script in link from my signature, either dOsComs or my bookmarks.

Best, Randall

looking at your approach, i'm not sure how it would be faster to parse the 6000 4 character strings into exclude parameters then check each line in the new file for each of those parameters. could you write up an example using your UDF's to do as you're suggesting? I'm not disagreeing that your way may be faster, i just don't see a way to implement it for this situation that would be faster, and would be interested to see it in action.

1100111 00001011101111 00011101101111 00010111100100 00001111110100 00110111110010 00101101111001 0011100i didn't make up this form of encryption, but i like it.credit to the lvl 6 challenge on arcanum.co.nz

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

hey,

Did you try the Example I have linked to?

Alternately, you would still need to nominate your main file and your delete file;

try

"_DOSDeleteFoundLineEx2.au3"

Best, Randall

;_DOSDeleteFoundLineEx2.au3

;to Delete all file lines containing any of Data1, Data2 ...etc [separated in the string by spaces; or..

; Else look into "findstr" in Dos and retrieve the strings to avoid from a file instead (/G:[Filename])

; 80Mb file in 40secs

#include<DOSComs.au3>

;$s_FileOpen1-FileOpenDialog("Choose file",@ScriptDir,"Images (*.jpg;*.bmp)", 1 + 4 )

$s_DeletemarkerStringsFile=@ScriptDir&"\DeleteLines.txt"

$s_DeleteFile=@ScriptDir&"\Table.txt"

_DeleteFound($s_DeleteFile,$s_DeletemarkerStringsFile)

func _DeleteFound($s_DeleteFile,$s_DeletemarkerStringsFile)

$s_FileOpen=FileOpen($s_DeletemarkerStringsFile)

While 1

$s_DelString &= FileReadLine($s_FileOpen)&" "

If @error = -1 Then ExitLoop

WEnd

FileClose($s_FileOpen)

;$s_Exclude="Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 Data10 Data11"

_DeleteFoundLineDOS($s_DeleteFile,$s_DelString)

if @error then

MsgBox(0,"","Error, So FileName not Exists="&@error)

Else

RunWait("Notepad.exe " & @ScriptDir&"\Table1.txt",@ScriptDir,@SW_SHOW)

EndIf

EndFunc ;==>_DeleteFound

Exit

fileLineDelete2.au3 Edited by randallc

Share this post


Link to post
Share on other sites

Hi Cameronsdad,

Thanks Thanks Thanks for your support.

Regarding your code you have posted:

Dim $SrcValue
Dim $DataValue
#include <file.au3>
#include<array.au3>
Func RemoveDatafromSrc ($SrcFile,$DataFile)
    
    If Not _FileReadToArray($DataFile,$DataValue) Then Exit
        
    If Not _FileReadToArray($SrcFile,$SrcValue) Then Exit
    $StringOfSrc = _ArrayToString($SrcValue,"$",1,$SrcValue[0])
    For $e=1 to $DataValue[0]
    if StringInStr($StringOfSrc,StringLeft($DataValue[$e],4)) Then
        For $f = 1 To $SrcFile[0]
            If StringInStr($SrcValue[$f],StringLeft($DataValue[$e],4) & ":") Then _ArrayDelete($SrcValue,$f)
        Next
    EndIf
    Next
    FileDelete ($SrcFile)
_FileWriteFromArray($SrcFile,$SrcValue,1,UBound($SrcValue))
EndFunc
Does this line shouldn't be
For $f = 1 To $SrcValue[0]
instead of

For $f = 1 To $SrcFile[0]

?

Also, it doesnt seem to work I cannot see why.

Attached my code, my source file and my data file

C:\parser.au3 (20) : ==> Array variable has incorrect number of subscripts or subscript dimension range exceeded.:

If StringInStr($SrcValue[$f],StringLeft($DataValue[$e],4) & ":") Then _ArrayDelete($SrcValue,$f)

If StringInStr(^ ERROR

Helllllppppppppp :lmao:

source.txt

data.txt

Share this post


Link to post
Share on other sites

BlueScreen,

This is one line from your data file:

1111: 0000 0000 0000 0000 0000 0000 0000 0000

That one line is also in your source file:

1111: 0000 0000 0000 0000 0000 0000 0000 0000

For each complete line in the data file - you want to remove that entire line from the source file. Right?

Please let me know....


[size="1"][font="Arial"].[u].[/u][/font][/size]

Share this post


Link to post
Share on other sites

Exactly.

The issue is that suggesting that

<<1111: 0000 0000 0000 0000 0000 0000 0000 0000>> is a line in the data file and

<<2222: 1111 0000 0000 0000 0000 0000 0000 0000>> is a line in the source file,

I will NOT want to remove it from the source file, since what's interesting me is the address (1111) and not all the stuff after it. This is why I have also added the ":" to the StringInStr

Share this post


Link to post
Share on other sites

#15 ·  Posted (edited)

Edit: The code below now works, but only for files that terminates each line of data with CRCRLF like the source file from post 12 seems to do - as shown by SciTE with EOL turned on.

[Thanks Valik for your help on this, it was driving me crazy - I know, short trip.]

;read the entire contents of the source file into the variable
$SourceInfo = FileRead('c:\temp\source.txt', FileGetSize('c:\temp\source.txt'))

;open the data file
$DataFile = FileOpen("c:\temp\data.txt", 0)

;read in lines of the data file until the EOF is reached
While 1
    $ReplaceIt = FileReadLine($DataFile) & @CR & CRLF
    If @error = -1 Then ExitLoop
 ;MsgBox(0,"To be replaced",$ReplaceIt)
    $SourceInfo = StringReplace($SourceInfo, $ReplaceIt, "")
    MsgBox(0, "Lines replaced", @extended)
WEnd

FileClose($DataFile)

FileOpen('c:\temp\source.txt', 2)
FileWrite('c:\temp\source.txt', $SourceInfo)
FileClose('c:\temp\source.txt')
It should be faster than using arrays. Not that I'm against using arrays, but the original post asked for faster code. This method should be much faster.

The code above is just to show the concept - you need to add more error checking and use filehandles where I used full paths.

Edited by herewasplato

[size="1"][font="Arial"].[u].[/u][/font][/size]

Share this post


Link to post
Share on other sites

Hi Cameronsdad,

Thanks Thanks Thanks for your support.

Regarding your code you have posted:

Does this line shouldn't be

For $f = 1 To $SrcValue[0]
instead of ?

Also, it doesnt seem to work I cannot see why.

Attached my code, my source file and my data file

C:\parser.au3 (20) : ==> Array variable has incorrect number of subscripts or subscript dimension range exceeded.:

If StringInStr($SrcValue[$f],StringLeft($DataValue[$e],4) & ":") Then _ArrayDelete($SrcValue,$f)

If StringInStr(^ ERROR

Helllllppppppppp :lmao:

you're right about $SrcValue instead of File, sorry on that. What's going on is that as lines are removed, the UBound of the array changes, but the end value of the for loop doesn't. so say once you remove a single line, the last iteration of the for loop will fail. poor practice on my side there. what we should do, is change it to a while loop instead of a for. like so:

Dim $SrcValue
Dim $DataValue
#include <file.au3>
#include<array.au3>
Func RemoveDatafromSrc ($SrcFile,$DataFile)
    
    If Not _FileReadToArray($DataFile,$DataValue) Then Exit
        
    If Not _FileReadToArray($SrcFile,$SrcValue) Then Exit
    $StringOfSrc = _ArrayToString($SrcValue,"$",1,$SrcValue[0])
    For $e=1 to $DataValue[0]
    if StringInStr($StringOfSrc,StringLeft($DataValue[$e],4)) Then
        $f = 1
        While $f <= UBound($SrcValue)
            If StringInStr($SrcValue[$f],StringLeft($DataValue[$e],4) & ":") Then 
                _ArrayDelete($SrcValue,$f)
            Else
                $f = $f + 1
            EndIf
        WEnd
    EndIf
    Next
    FileDelete ($SrcFile)
_FileWriteFromArray($SrcFile,$SrcValue,1,UBound($SrcValue))
EndFunc

1100111 00001011101111 00011101101111 00010111100100 00001111110100 00110111110010 00101101111001 0011100i didn't make up this form of encryption, but i like it.credit to the lvl 6 challenge on arcanum.co.nz

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0