How to merge files?

ter-pierre · February 14, 2005

Hi guys.

I have posted a topic but i think that i cant show (or express) corectly...

I have 2 files.

the first file contains ID;USER (4400 lines)

the second fiel contains USER;GROUPMEMBER (more than 65000 lines)

I need to merge this 2 fiels on a file with ID;USER;GROUPMEMBER

i try to read line by line the file 1 and find on the file 2 the field USER, but this job takes too long (more than 40 hours), but is functional. I use the code above:

$file2=FileOpen("C:\tmp1-2.txt",0)
While 1
   $ID_USER=FileReadLine($FILE2)
   If @error = -1 Then ExitLoop
   $SPLIT=StringSplit($CAD_USER,",")
   $ID=$SPLIT[2]
   $USER=$SPLIT[1]
   GRUPO($USER)
WEnd
Exit

Func GRUPO($USER)
   $file1=FileOpen("C:\tmp1-4.txt",0)
   While 1
      $USER_GRP=FileReadLine($FILE1)
      If @error = -1 Then ExitLoop
      $SPLIT2=StringSplit($USER_GRP,";")
      $USER1=$SPLIT2[1]
      $GROUP=$SPLIT2[2]
      If $USER=$USER1 Then FileWriteLine("C:\TMP3-1.TXT",$ID&"|"&$USER&"|"&$GROUP)
   WEnd
   FileClose($file1)
EndFunc

Some one have a better idea? :lmao:

Thanks

Andre · February 14, 2005

Hi,

I did not try this myself but try the _FileReadToArray function???

Andre

ter-pierre · February 14, 2005

FileReadToArray is a UDF?

where can i find?

Andre · February 14, 2005

Hi,

#include <file.au3>

_FileReadToArray($sFilePath, ByRef $aArray)

Is included in the last AutoIt version.

Andre

ter-pierre · February 14, 2005

Thanks Andre, but using this funtion takes to long so (or more).

i try using this code:

$file1=FileOpen("C:\tmp1-2.txt",0)
#include "FILE.AU3"
Dim $FILE
_FileReadToArray("C:\tmp1-4.txt",$FILE)
While 1
   $USER_ID=FileReadLine($FILE1)
   If @error = -1 Then ExitLoop
   $SPLIT2=StringSplit($USER_ID,",")
   $USER=$SPLIT2[1]
   For $n = 1 to $FILE[0]
      $FILE[$n]=  StringReplace($FILE[$n],$USER, $USER_ID)
      FileWrite("c:\tmp6.txt",$FILE[$n])
      IF $n < $FILE[0] THEN FileWrite("tempfilename.txt",@lf&@CR)
   Next
   If $USER="aabreu" Then ExitLoop
WEnd
Exit

that is the right way to do?

have another mode to use the function ?

Thanks

Andre · February 14, 2005

Hi,

As far as i can see this is correctly done.

Perhaps someone else on the Forum has experience with such large files.

Andre

lupusbalo · February 14, 2005

in fact y'oure "reading" more an more of file 2 for each file1 record which is inefficient

I think you're loking for an algorythm!! i'm i right?

are your files sorted??

if not, the very first clue is TO SORT the 2 files with same "key". Here the master key will be USER: sort both files on that key

when (or if it's already) done, what you need then, is a very classical algorythm to "merge" 2 sequential files

- you can proceed using fileread or filereadtoarray

- file1 will be "master file"

- i assumed there is at least one but only one record for each "ID/USER" on file1 (if not you need to review the USER_file1>USER_file2 case)

pseudocode for algorythm:

open "result file" or create "new result array"
initialize EOF flags
initialize "other processing" (log file, counters...)

access first record of each file/array; either read (or set $ifile1=1 $ifile2=1)

while not (end of both files/arrays)
    Select
      case USER_file1=USER_file2; "normal case" 
             write correct data (agregated from both files) to fileresult (or array)
             "other prcessing"  ; eg increment counter for USER_file1 record#
             access new record in file2 (or increment $ifile2)
             if no more record in file 2 then set "EOF file2" flag
      case USER_file1<USER_file2    ; no (no more) data to process for USER_file1
             some "other processing"    ;(eg log number of fileresult records for USER_file1 key)
             access new record in file1 (or increment $ifile1)
             if no more record in file 1 then set "EOF file1" flag
             else initialize "other processing" for new USER key from file1
      case USER_file1>USER_file2; should not probably occur
             if error issue error message and exit or continue (ignore see note 1)
                        write msg to log, USER_File2 key  with "no record in file 1" situation
             access new record in file2 (or increment $ifile2)
             if no more record in file 2 then set "EOF file2" flag
    End select
Wend
write final info (eg counters....) to log etc....

EDIT> * note 1: depending on "quality" of your files i suggest that you proceed with next records, this will give you oportunity to browse the 2 files entirely and to find potential "structure errors in them) <EDIT

hope this helps

Edited February 14, 2005 by lupusbalo

ter-pierre · February 14, 2005

yes. the files are sorted.

atthach a file with parts os the 2 files. the first 5 lines are the first five lines of tmp1-2.txt, and the follow lines are the relative portion of tmp1-4.txt

thanks lupusbalo

tmp.txt

ter-pierre · February 14, 2005

my problem is that my tmp1-2.txt file have just 1 line with each USER, and tmp1-4.txt file have more than 1 line with each USER.

lupusbalo · February 14, 2005

my problem is that my tmp1-2.txt file have just 1 line with each USER, and tmp1-4.txt file have more than 1 line with each USER.
<{POST_SNAPBACK}>

OK so the algorythm is fine

file1 (or array) should be your TMP1-2

file2 (or array) should be your TMP1-4

let me explain a litle bit more why your algo is so loooooooooooooong

in your algorythm for each record (1 then 2 then ...... ) in TMP1-2, you read 1 then 2 then 3 .......... then 59998 then 59999 then 60000 records in fileTMP1-4 just to skip to the right record

which means at the End of the day roughly 1800 MILLIONS READ/ARRAY ACCES EDIT> which is JUST the sum of the known SERIE: of first 60000 integers: SIGMA(i, i=1 to 60000) <EDIT

you probably understand why it could be long!!!!!!!!!!!!! :lmao: o:)

the algo i gave you reads only once, each record of file2

Edited February 14, 2005 by lupusbalo

ter-pierre · February 14, 2005

Luposalo, thank you very much, but i dont now about EOF flags, i dont know about array, i dont know about string... :lmao:

realy i have no experience with this kind of codes...

if you can help me with this code, when you come i pay the beer... o:)

please :">

Jos · February 14, 2005

when you come i pay the beer...
please :">
<{POST_SNAPBACK}>

Just 1 ??

Func _Appendfile($Source, $Target)
    FileWriteLine($Target, FileRead($Source, FileGetSize($Source)))
EndFunc  ;==>_Appendfile

normeus · February 14, 2005

so here is the deal:

I figured that since it took 40 hours to run your program. You wouldn't mind doing a little extra work (learning something new).

to run your merge in 10 seconds ( for 200,000 records on a 500 mhz computer)

download GAWK ( it is a UNIX tool ported to windows a.k.a. AWK)

type this program and save it as "match101.awk"

MAKE SURE YOU CHANGE THE INPUT LINE FROM "C:/TMP1.TXT" to your file

containing "a.ribeiro,1001020558" etc..

MAKE SURE YOU CHANGE THE OUTPUT LINE FROM "C:/TMP3.TXT" (front slash)

run the program like this (from a dos box if you installed awk in gnu and

your program was saved in c: and the input file is c:/tmp2.txt )

"c:\gnu\gawk -fc:/match101.awk c:/tmp2.txt"

# code starts here

BEGIN { 
       print "BEGIN TIME.."strftime("%H:%M:%S") # TIME FOR YOUR BENCHMARK
       #This will load file 1 into memory 20,000 names in ram shouldn't be a big deal
       while (( getline loadrec < "C:/TMP1.TXT") > 0 )
          {recnum++
          split(loadrec,temparr,",") #I am using "," to split the record
          table[ temparr[1]] = temparr[2]
          }
      }
{
numfield = split($0,temparr,";") #using ";" to split the record change as you need it
if (temparr[1] in table )
   {
   line= table[temparr[1]]";"temparr[1]";"temparr[2]
   print line > "C:/TMP3.TXT"
   }
else
   {badname++
   print $0
   }

} 
END {
     print "temp1.txt records " recnum
     print "temp2.txt records " NR
     print "names not found   " badname+0
     print "END TIME......... "strftime("%H:%M:%S")# TIME FOR YOUR BENCHMARK
     }

# code ends here

if you are willing to do this (should be about an hour ) you will be doing your

match in seconds

I LOVE AUTOIT3 BUT THERE ARE TOOLS THAT HANDLE FILES BETTER. :lmao:

edit: used code to make it pretty

Edited February 14, 2005 by normeus

ter-pierre · February 14, 2005

Thanks you all

Andre

Lupusbalo

JdeB

normeus

Isolve my problem with your help.

above the code that i use (sure not the better, but...)

Dim $ARRAY12, $ARRAY14
$n=1
$nn=1
#include <file.au3>
_FileReadToArray("C:\tmp1-2.txt", $ARRAY12)
_FileReadToArray("C:\tmp1-4.txt", $ARRAY14)
$file3=FileOpen("C:\tmp6.txt",2)
For $n="1" To $ARRAY12[0]
   $SPLIT_FILE12=StringSplit($ARRAY12[$n],";")
   $user_filE12=$split_file12[1]
   $SPLIT_FILE14=StringSplit($ARRAY14[$nn],";")
   $user_filE14=$split_file14[1]
    Select
    case $user_filE12=$user_filE14
      While $nn<=$ARRAY14[0]
         FileWriteLine($file3,$split_file12[2]&";"&$split_file12[1]&";"&$split_file14[2]&";")
         $nn=$nn+1
         $SPLIT_FILE14=StringSplit($ARRAY14[$nn],";")
         $user_filE14=$split_file14[1]
         If $user_filE12<>$user_filE14 Then ExitLoop
      WEnd
    case $user_filE12<$user_filE14
      While 1
         $nn=$nn+1
         If $user_filE12<>$user_filE14 Then ExitLoop
      WEnd
    case $user_filE12>$user_filE14
         MsgBox(0,"test","ERRO!!!")
    Endselect
Next

When some you guys come, i pay all beers !!! :lmao: o:)

lupusbalo · February 14, 2005

merge_stuff.zip@ter-pierre

EDIT> I didn't see your last post before "sending" mine!

1- congratulations

2- the following may help you anyway??

3- i forgot the beer!! (should be many!!!!)

4- discard the immediate folowing line, as you seem to actually understand something about programing, i'm sorry

<EDIT

it would be rather difficult, if you don't know anything about programing :">

so, it should be unusual but i'll provide the solution.

the final script to merge, a script to generate test files, final result and associated logfile.

files were 5000 and 75000 records respectively (approximately YOUR figures)

the final merge runs in a bit more than a minute on a P4 2,6Gh!!!! far from your 40 hours :lmao:

actually file generation is 4 times longer, because of a lot "random" calls (could have been made better but.......that's it for now o:) !!)

EDIT> I used file processing (vs arrays) because it can be used in case of HUGE files (10 exp 6 or more records) where array processing will probably lack of memory (algorythm are identical - need only change how to get data) <EDIT

attached

- test files generation script

- extract of test file file1 (out of 5000 records)

- extract of test file file2 (out of 75000 records)

- merge script

- extract of resulting file

- extract of the logfile

Edited February 14, 2005 by lupusbalo

lupusbalo · February 15, 2005

not a "common day to day" need, but just for fun: , test merge on big files:

start logfile>

2005-02-15 00:05:16 : for key:AAADIUSKAU records processed: 12

2005-02-15 00:05:16 : for key:AAAFUMZOGSQJFZ records processed: 26

.................... rest of logfile

2005-02-15 00:39:04 : for key:ZZZNYLUSFDUNZ records processed: 29

2005-02-15 00:39:04 : for key:ZZZOBGOEUOIWVAK records processed: 4

2005-02-15 00:39:04 : for key:ZZZORBMJCONR records processed: 22

2005-02-15 00:39:04 : for key:ZZZSBUSKP records processed: 3

2005-02-15 00:39:05 : C:\Mes documents\Auto_IT scripts\TMP1-2.txt: 149 999 records processed (*)

2005-02-15 00:39:15 : C:\Mes documents\Auto_IT scripts\TMP1-4.txt: 2 249 048 records processed (*)

2005-02-15 00:39:15 : merge duration: 2028 SEC. (*)

<End Logfile

(*) "real file", only some editing (red, bold, 000's) to improve clarity

2,4 Millions records ~34MIN :lmao:

who says AutoIt is Slow?? o:)

Edited February 15, 2005 by lupusbalo

SlimShady · February 15, 2005

Every (big) script can be speeded up by coding it efficiently.

Sign In

How to merge files?

Recommended Posts

ter-pierre

Andre

ter-pierre

Andre

ter-pierre

Andre

lupusbalo

ter-pierre

ter-pierre

lupusbalo

ter-pierre

Jos

normeus

ter-pierre

lupusbalo

lupusbalo

SlimShady

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta