Sign in to follow this  
Followers 0
ter-pierre

How to merge files?

17 posts in this topic

Hi guys.

I have posted a topic but i think that i cant show (or express) corectly...

I have 2 files.

the first file contains ID;USER (4400 lines)

the second fiel contains USER;GROUPMEMBER (more than 65000 lines)

I need to merge this 2 fiels on a file with ID;USER;GROUPMEMBER

i try to read line by line the file 1 and find on the file 2 the field USER, but this job takes too long (more than 40 hours), but is functional. I use the code above:

$file2=FileOpen("C:\tmp1-2.txt",0)
While 1
   $ID_USER=FileReadLine($FILE2)
   If @error = -1 Then ExitLoop
   $SPLIT=StringSplit($CAD_USER,",")
   $ID=$SPLIT[2]
   $USER=$SPLIT[1]
   GRUPO($USER)
WEnd
Exit

Func GRUPO($USER)
   $file1=FileOpen("C:\tmp1-4.txt",0)
   While 1
      $USER_GRP=FileReadLine($FILE1)
      If @error = -1 Then ExitLoop
      $SPLIT2=StringSplit($USER_GRP,";")
      $USER1=$SPLIT2[1]
      $GROUP=$SPLIT2[2]
      If $USER=$USER1 Then FileWriteLine("C:\TMP3-1.TXT",$ID&"|"&$USER&"|"&$GROUP)
   WEnd
   FileClose($file1)
EndFunc

Some one have a better idea? :lmao:

Thanks

Share this post


Link to post
Share on other sites



Hi,

I did not try this myself but try the _FileReadToArray function???

Andre


What about Windows without using AutoIt ?It would be the same as driving a car without an steering Wheel!

Share this post


Link to post
Share on other sites

FileReadToArray is a UDF?

where can i find?

Share this post


Link to post
Share on other sites

Hi,

#include <file.au3>

_FileReadToArray($sFilePath, ByRef $aArray)

Is included in the last AutoIt version.

Andre


What about Windows without using AutoIt ?It would be the same as driving a car without an steering Wheel!

Share this post


Link to post
Share on other sites

Thanks Andre, but using this funtion takes to long so (or more).

i try using this code:

$file1=FileOpen("C:\tmp1-2.txt",0)
#include "FILE.AU3"
Dim $FILE
_FileReadToArray("C:\tmp1-4.txt",$FILE)
While 1
   $USER_ID=FileReadLine($FILE1)
   If @error = -1 Then ExitLoop
   $SPLIT2=StringSplit($USER_ID,",")
   $USER=$SPLIT2[1]
   For $n = 1 to $FILE[0]
      $FILE[$n]=  StringReplace($FILE[$n],$USER, $USER_ID)
      FileWrite("c:\tmp6.txt",$FILE[$n])
      IF $n < $FILE[0] THEN FileWrite("tempfilename.txt",@lf&@CR)
   Next
   If $USER="aabreu" Then ExitLoop
WEnd
Exit

that is the right way to do?

have another mode to use the function ?

Thanks

Share this post


Link to post
Share on other sites

Hi,

As far as i can see this is correctly done.

Perhaps someone else on the Forum has experience with such large files.

Andre


What about Windows without using AutoIt ?It would be the same as driving a car without an steering Wheel!

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

in fact y'oure "reading" more an more of file 2 for each file1 record which is inefficient

I think you're loking for an algorythm!! i'm i right?

are your files sorted??

if not, the very first clue is TO SORT the 2 files with same "key". Here the master key will be USER: sort both files on that key

when (or if it's already) done, what you need then, is a very classical algorythm to "merge" 2 sequential files

- you can proceed using fileread or filereadtoarray

- file1 will be "master file"

- i assumed there is at least one but only one record for each "ID/USER" on file1 (if not you need to review the USER_file1>USER_file2 case)

pseudocode for algorythm:

open "result file" or create "new result array"
initialize EOF flags
initialize "other processing" (log file, counters...)

access first record of each file/array; either read (or set $ifile1=1 $ifile2=1)

while not (end of both files/arrays)
    Select
      case USER_file1=USER_file2; "normal case" 
             write correct data (agregated from both files) to fileresult (or array)
             "other prcessing"  ; eg increment counter for USER_file1 record#
             access new record in file2 (or increment $ifile2)
             if no more record in file 2 then set "EOF file2" flag
      case USER_file1<USER_file2    ; no (no more) data to process for USER_file1
             some "other processing"    ;(eg log number of fileresult records for USER_file1 key)
             access new record in file1 (or increment $ifile1)
             if no more record in file 1 then set "EOF file1" flag
             else initialize "other processing" for new USER key from file1
      case USER_file1>USER_file2; should not probably occur
             if error issue error message and exit or continue (ignore see note 1)
                        write msg to log, USER_File2 key  with "no record in file 1" situation
             access new record in file2 (or increment $ifile2)
             if no more record in file 2 then set "EOF file2" flag
    End select
Wend
write final info (eg counters....) to log etc....

EDIT> * note 1: depending on "quality" of your files i suggest that you proceed with next records, this will give you oportunity to browse the 2 files entirely and to find potential "structure errors in them) <EDIT

hope this helps

Edited by lupusbalo

Share this post


Link to post
Share on other sites

yes. the files are sorted.

atthach a file with parts os the 2 files. the first 5 lines are the first five lines of tmp1-2.txt, and the follow lines are the relative portion of tmp1-4.txt

thanks lupusbalo

tmp.txt

Share this post


Link to post
Share on other sites

my problem is that my tmp1-2.txt file have just 1 line with each USER, and tmp1-4.txt file have more than 1 line with each USER.

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

my problem is that my tmp1-2.txt file have just 1 line with each USER, and tmp1-4.txt file have more than 1 line with each USER.

<{POST_SNAPBACK}>

OK so the algorythm is fine

file1 (or array) should be your TMP1-2

file2 (or array) should be your TMP1-4

let me explain a litle bit more why your algo is so loooooooooooooong :)

in your algorythm for each record (1 then 2 then ...... ) in TMP1-2, you read 1 then 2 then 3 .......... then 59998 then 59999 then 60000 records in fileTMP1-4 just to skip to the right record

which means at the End of the day roughly 1800 MILLIONS READ/ARRAY ACCES EDIT> which is JUST the sum of the known SERIE: of first 60000 integers: SIGMA(i, i=1 to 60000) <EDIT

you probably understand why it could be long!!!!!!!!!!!!! :lmao:o:):)

the algo i gave you reads only once, each record of file2 :)

Edited by lupusbalo

Share this post


Link to post
Share on other sites

Luposalo, thank you very much, but i dont now about EOF flags, i dont know about array, i dont know about string... :lmao:

realy i have no experience with this kind of codes... :)

if you can help me with this code, when you come i pay the beer... o:)

please :">

Share this post


Link to post
Share on other sites

when you come i pay the beer... :lmao:

please :">

<{POST_SNAPBACK}>

Just 1 ?? o:)

Func _Appendfile($Source, $Target)
    FileWriteLine($Target, FileRead($Source, FileGetSize($Source)))
EndFunc  ;==>_Appendfile

Visit the SciTE4AutoIt3 Download page for the latest versions  - Beta files                                How to post scriptsource        Forum Rules
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Share this post


Link to post
Share on other sites

#13 ·  Posted (edited)

so here is the deal:

I figured that since it took 40 hours to run your program. You wouldn't mind doing a little extra work (learning something new).

to run your merge in 10 seconds ( for 200,000 records on a 500 mhz computer)

download GAWK ( it is a UNIX tool ported to windows a.k.a. AWK)

type this program and save it as "match101.awk"

MAKE SURE YOU CHANGE THE INPUT LINE FROM "C:/TMP1.TXT" to your file

containing "a.ribeiro,1001020558" etc..

MAKE SURE YOU CHANGE THE OUTPUT LINE FROM "C:/TMP3.TXT" (front slash)

run the program like this (from a dos box if you installed awk in gnu and

your program was saved in c: and the input file is c:/tmp2.txt )

"c:\gnu\gawk -fc:/match101.awk c:/tmp2.txt"

# code starts here

BEGIN { 
       print "BEGIN TIME.."strftime("%H:%M:%S") # TIME FOR YOUR BENCHMARK
       #This will load file 1 into memory 20,000 names in ram shouldn't be a big deal
       while (( getline loadrec < "C:/TMP1.TXT") > 0 )
          {recnum++
          split(loadrec,temparr,",") #I am using "," to split the record
          table[ temparr[1]] = temparr[2]
          }
      }
{
numfield = split($0,temparr,";") #using ";" to split the record change as you need it
if (temparr[1] in table )
   {
   line= table[temparr[1]]";"temparr[1]";"temparr[2]
   print line > "C:/TMP3.TXT"
   }
else
   {badname++
   print $0
   }

} 
END {
     print "temp1.txt records " recnum
     print "temp2.txt records " NR
     print "names not found   " badname+0
     print "END TIME......... "strftime("%H:%M:%S")# TIME FOR YOUR BENCHMARK
     }

# code ends here

if you are willing to do this (should be about an hour ) you will be doing your

match in seconds

I LOVE AUTOIT3 BUT THERE ARE TOOLS THAT HANDLE FILES BETTER. :lmao:

edit: used code to make it pretty

Edited by normeus

Share this post


Link to post
Share on other sites

Thanks you all

Andre

Lupusbalo

JdeB

normeus

Isolve my problem with your help.

above the code that i use (sure not the better, but...)

Dim $ARRAY12, $ARRAY14
$n=1
$nn=1
#include <file.au3>
_FileReadToArray("C:\tmp1-2.txt", $ARRAY12)
_FileReadToArray("C:\tmp1-4.txt", $ARRAY14)
$file3=FileOpen("C:\tmp6.txt",2)
For $n="1" To $ARRAY12[0]
   $SPLIT_FILE12=StringSplit($ARRAY12[$n],";")
   $user_filE12=$split_file12[1]
   $SPLIT_FILE14=StringSplit($ARRAY14[$nn],";")
   $user_filE14=$split_file14[1]
    Select
    case $user_filE12=$user_filE14
      While $nn<=$ARRAY14[0]
         FileWriteLine($file3,$split_file12[2]&";"&$split_file12[1]&";"&$split_file14[2]&";")
         $nn=$nn+1
         $SPLIT_FILE14=StringSplit($ARRAY14[$nn],";")
         $user_filE14=$split_file14[1]
         If $user_filE12<>$user_filE14 Then ExitLoop
      WEnd
    case $user_filE12<$user_filE14
      While 1
         $nn=$nn+1
         If $user_filE12<>$user_filE14 Then ExitLoop
      WEnd
    case $user_filE12>$user_filE14
         MsgBox(0,"test","ERRO!!!")
    Endselect
Next

When some you guys come, i pay all beers !!! :lmao:o:):)

Share this post


Link to post
Share on other sites

#15 ·  Posted (edited)

merge_stuff.zip@ter-pierre

EDIT> I didn't see your last post before "sending" mine!

1- congratulations

2- the following may help you anyway??

3- i forgot the beer!! (should be many!!!!) :)

4- discard the immediate folowing line, as you seem to actually understand something about programing, i'm sorry :)

<EDIT

it would be rather difficult, if you don't know anything about programing :">

so, it should be unusual but i'll provide the solution.

the final script to merge, a script to generate test files, final result and associated logfile.

files were 5000 and 75000 records respectively (approximately YOUR figures)

the final merge runs in a bit more than a minute on a P4 2,6Gh!!!! far from your 40 hours :lmao:

actually file generation is 4 times longer, because of a lot "random" calls (could have been made better but.......that's it for now o:) !!)

EDIT> I used file processing (vs arrays) because it can be used in case of HUGE files (10 exp 6 or more records) where array processing will probably lack of memory (algorythm are identical - need only change how to get data) <EDIT

attached

- test files generation script

- extract of test file file1 (out of 5000 records)

- extract of test file file2 (out of 75000 records)

- merge script

- extract of resulting file

- extract of the logfile

Edited by lupusbalo

Share this post


Link to post
Share on other sites

#16 ·  Posted (edited)

not a "common day to day" need, but just for fun: :) , test merge on big files:

start logfile>

2005-02-15 00:05:16 : for key:AAADIUSKAU records processed: 12

2005-02-15 00:05:16 : for key:AAAFUMZOGSQJFZ records processed: 26

.................... rest of logfile

2005-02-15 00:39:04 : for key:ZZZNYLUSFDUNZ records processed: 29

2005-02-15 00:39:04 : for key:ZZZOBGOEUOIWVAK records processed: 4

2005-02-15 00:39:04 : for key:ZZZORBMJCONR records processed: 22

2005-02-15 00:39:04 : for key:ZZZSBUSKP records processed: 3

2005-02-15 00:39:05 : C:\Mes documents\Auto_IT scripts\TMP1-2.txt: 149 999 records processed (*)

2005-02-15 00:39:15 : C:\Mes documents\Auto_IT scripts\TMP1-4.txt: 2 249 048 records processed (*)

2005-02-15 00:39:15 : merge duration: 2028 SEC. (*)

<End Logfile

(*) "real file", only some editing (red, bold, 000's) to improve clarity

2,4 Millions records ~34MIN :lmao:

who says AutoIt is Slow?? o:)

Edited by lupusbalo

Share this post


Link to post
Share on other sites

Every (big) script can be speeded up by coding it efficiently.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0