Sign in to follow this  
Followers 0
custos

HELP sorting Large CSV files

22 posts in this topic

Firstly, awesome s/w, I am an old school(rusty) C developer and had been looking for a year + for Rapid App Dev platform to drive windows apps and finally found AutoIT, last week.

Thank god, cause always found too much complexity w/ MS Visual Express stuff, THANKS for your great s/w! I'm actually accomplishing a lot already.

PROBLEM-

I have large 200 MB csv files - could grow to 1-2 Gb in the future. So using Excel can become very painful (and would not prefer to use unless last resort) & suck at sql so don't wanna deal w/ Access either.

I have 13+ files each with a "Date Time" column with data like: 7/22/2011 15:10

This data is to be used as an index to match up all 13+ files and produce a "Master" File if you will with only 1 "Date Time" column.

The issue is all files have some "Date Time" data missing, so 10 might have "7/22/2011 15:03" but other wont. So that data needs to be ignored.

Only data that has the same "Date Time" cell values in All files must be used to produce the Master File.

Can somebody pls help with a function or program etc on how to do this natively in AUTOIt?

Any and all help will be appreciated

PS - I have already tried using _ExcelReadArray etc to do this but it is extremely slow, also the array stuff clobbered the date format when it read it

Share this post


Link to post
Share on other sites



You should show us snippets of the files.

So, we could see what other columns are in the files.


My UDFs, Abbrevs and Snippets
If you like my post, just click the like button :) 

Share this post


Link to post
Share on other sites

If yur files are 200 MB at this stage and will grow even more (you said 1-2 GB) you can forget about processing them with AutoIt.

They are simply too big and being text files mainly, if you try to read them into an array, you will reach fast the limit for arrays.

It might work at this point but it will get slower and slower in time because you can't read only a part of a file.

If you use FileReadLine, that will read the whole file and go to that specific line number - doesn't matter that it returns only a line of text, it will read the whole file every time.

I remember trying to read some 500 MB files and running into memory problems.

The best you can hope is to get the file content into one big array - one line per element - and do your processing on that. That way you read the file only once.

About using Excel to deal with them ... If I'm not wrong, you can have a maximum of about 65k lines in Excel - any more than that and it might not work.

I am not an expert and I might be wrong but all my attempts to deal with files so large, failed.


SNMP_UDF ... for SNMPv1 and v2c so far, GetBulk and a new example script

wannabe "Unbeatable" Tic-Tac-Toe

Paper-Scissor-Rock ... try to beat it anyway :)

Share this post


Link to post
Share on other sites

You should show us snippets of the files.

So, we could see what other columns are in the files.

All the other have integers or floating numbers only.

Share this post


Link to post
Share on other sites

If yur files are 200 MB at this stage and will grow even more (you said 1-2 GB) you can forget about processing them with AutoIt.

They are simply too big and being text files mainly, if you try to read them into an array, you will reach fast the limit for arrays.

It might work at this point but it will get slower and slower in time because you can't read only a part of a file.

If you use FileReadLine, that will read the whole file and go to that specific line number - doesn't matter that it returns only a line of text, it will read the whole file every time.

I remember trying to read some 500 MB files and running into memory problems.

The best you can hope is to get the file content into one big array - one line per element - and do your processing on that. That way you read the file only once.

About using Excel to deal with them ... If I'm not wrong, you can have a maximum of about 65k lines in Excel - any more than that and it might not work.

I am not an expert and I might be wrong but all my attempts to deal with files so large, failed.

Thats what i have been doing i.e. pull 1 row at time and comparing them but was doing it in excel.

Do you have any code that does it all yanking 1 line at a time ?

Also there is a Linux/Unix/Gnu commands also available for windoze that does this type of thing very efficiently probably a combination of "sort", "grep" and "comm". Might have to resort to that, unless some one herer an help :)

Share this post


Link to post
Share on other sites

FileRead then StringSplit for @CRLF and you end up with a big array in memory holding all the lines.

See if it works this way.


SNMP_UDF ... for SNMPv1 and v2c so far, GetBulk and a new example script

wannabe "Unbeatable" Tic-Tac-Toe

Paper-Scissor-Rock ... try to beat it anyway :)

Share this post


Link to post
Share on other sites

Any similar code examples per chance?

Share this post


Link to post
Share on other sites

Hi custos,

Welcome to the forum :)

Post couple lines of the CSV (maybe 50 or so) and a sample of what you would like the master file to look like, as well as your attempted code so that we may better understand your situation, and we then can provide more relevant help and examples.

If you run into difficulty while hatching together your code post it here ;) We'll help you debug.

Good luck,

-smartee

Share this post


Link to post
Share on other sites

Custos, you are working with some extremely large files. You are going to have performance issues, especially when you get in to the multiple gigabyte file sizes. Anyways, I played around with a test file and came up with something basic to get you started. I put the date in column B. The easiest way I could think of to sort this by date was to convert the date to a "general" number and sort it that way. At the end, I got it to display the final array, but you will probably rather save the Excel file. After sorting, you could convert the "general" numbers back to dates. Anyways, it's bedtime so I'm off to sleep. Good luck!

$file = "test.csv"
#include <Excel.au3>
#include <file.au3>
#include <array.au3>

$loopCount = _FileCountLines($file)
Local $oExcel = _ExcelBookNew()

For $i = 1 to $loopCount Step 1
    $lineRead = FileReadLine($file, $i)
    $aArray = StringSplit($lineRead, ",", 2)
    _ExcelWriteArray($oExcel, $i, 1, $aArray)
Next

; Repeat above loop to add lines from other files

For $i = 1 To $loopCount
    _ExcelNumberFormat($oExcel, "General", 1, 2, $loopCount, 2) ; My test file has the date in the 2nd column
Next

$bigArray = _ExcelReadSheetToArray($oExcel)

_ArraySort($bigArray, 0, 0, 0, 2) ; Sort array from oldest date to newest

_ArrayDisplay($bigArray)

#include <ByteMe.au3>

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

@sleepydvdr

Best regards for providing the code but your code fell into the trap I was telling him to avoid.

For every line you are reading the whole file again and again and again, that will account for a 90% waste of time.

You need to read it once then work with the content from memory.

@custos

I have given you the idea and the functions to work with. You were saying that you are "Old school rusty C developer" but that somehow doesn't match asking twice for code ...

I've told you what functions to use - a simple look in the help file (for example and syntax) and using a simple For/Next (if I'm not wrong, C uses that too) would have been enough for you to put together the code you were asking for - wouldn't have been more than 30 mins of your time. I am 99% for "teaching a man to fish" and this case definitely didn't make the 1% left.

Edit: I have just realised that all you did was asking for code - not even a little bit of consideration for those stepping in to help you. You should do better than that.

Edited by enaiman

SNMP_UDF ... for SNMPv1 and v2c so far, GetBulk and a new example script

wannabe "Unbeatable" Tic-Tac-Toe

Paper-Scissor-Rock ... try to beat it anyway :)

Share this post


Link to post
Share on other sites

@custos

I have given you the idea and the functions to work with. You were saying that you are "Old school rusty C developer" but that somehow doesn't match asking twice for code ...

I've told you what functions to use - a simple look in the help file (for example and syntax) and using a simple For/Next (if I'm not wrong, C uses that too) would have been enough for you to put together the code you were asking for - wouldn't have been more than 30 mins of your time. I am 99% for "teaching a man to fish" and this case definitely didn't make the 1% left.

Edit: I have just realised that all you did was asking for code - not even a little bit of consideration for those stepping in to help you. You should do better than that.

@enaiman

Thanks for your underhanded insults, always appreciated by new forum users.

If I wasn't "rusty" I wouldn't have requested help, I figured somebody might have solved a very similar problem and could give me a code example, which would help me quickly. I'm sorry you answered my request.

Share this post


Link to post
Share on other sites

Custos, you are working with some extremely large files. You are going to have performance issues, especially when you get in to the multiple gigabyte file sizes. Anyways, I played around with a test file and came up with something basic to get you started. I put the date in column B. The easiest way I could think of to sort this by date was to convert the date to a "general" number and sort it that way. At the end, I got it to display the final array, but you will probably rather save the Excel file. After sorting, you could convert the "general" numbers back to dates. Anyways, it's bedtime so I'm off to sleep. Good luck!

@sleepydvdr

Thanks, very helpful, making changes to get it going.

Share this post


Link to post
Share on other sites

With sizes in this ballpark your best bet is SQLite.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

There is another possibility: using SQLite.

For using SQLite look here

With sizes in this ballpark your best bet is SQLite.

OK Folks trying out SQL Lite, (although don't know crap about SQL statements), cause algo is getting too complex and not worth the effort, will ask for SQL statement help once i get the tables working. Thanks

Share this post


Link to post
Share on other sites

Don't fear SQL that much. Once you get used to basic design/statements, it's fairly easy to reach more sophisticated usage and queries.

From what you posted, I can only recommend you use ISO string format for your dates/times (AAAA/MM/DD HH:MM:SS). Not only does this format sort/index naturally, but it's also much more user-friendly than Unix-style timestamps.

Select a reasonable default for records (aka SQL "rows") where date/time is missing.

SQLite has the "nice & evil" feature to let you store _any_ data type in a column, but force yourself to use the SQLite type consistently as far as possible: it will only ease processing and streamline your code.

Rush to download a good, recent, supported SQLite DB manager, as it will save you up to 80% of your design/setup/administration time.

I warmly, strongly, recommend SQLite Expert. Don't hesitate to invest in the Pro (payware but cheap) version, as it allows import/export of CSV data among various formats. Pro features pay for themselves in days, if not in hours.

SQlite has a lively and friendly mailing list where you can obtain expert advices on precise questions. Anyway, take the time to browse some of the SQLite docs before engaging in design.

I can't tell you more right now and I may be off the Internet for unbounded time shortly, but I'll try to guide you if you ask questions here, provided I can get an access.

Best of luck!


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

...............
; group 1

_SQLite_Exec (-1, "CREATE TABLE Symbol1 (DateTime TEXT PRIMARY KEY, Open TEXT, Vol TEXT, SynthVIX Text, VWAP Text);")

If @error Then
            ConsoleWriteError("SQLERROR=" & @extended);
EndIf
        
For $i = 1 to $loopCount1 Step 1
    
    $lineRead1 = FileReadLine($infile1, $i)
    $Array1 = StringSplit($lineRead1, ",")

    _ArrayDisplay($Array1)

    $sql_cmd &= "INSERT INTO Symbol1 (DateTime, Open, Vol, SynthVIX, VWAP) VALUES ('" & $Array1[1] & "', '" & $Array1[2] & "', '" & $Array1[3] & "', '" & $Array1[4] & "', '" & $Array1[5] & "');"

    _SQLite_Exec (-1, $sql_cmd)
    $sql_cmd = ''
    
Next


;~ ; group 2
_SQLite_Exec (-1, "CREATE TABLE Symbol2 (DateTime TEXT PRIMARY KEY, Open TEXT, Vol TEXT, SynthVIX Text, VWAP Text);")

If @error Then
            ConsoleWriteError("SQLERROR=" & @extended);
EndIf
        
For $i = 1 to $loopCount2 Step 1
    
    $lineRead2 = FileReadLine($infile2, $i)
    $Array2 = StringSplit($lineRead2, ",")

    _ArrayDisplay($Array2)

    $sql_cmd &= "INSERT INTO Symbol2 (DateTime, Open, Vol, SynthVIX, VWAP) VALUES ('" & $Array2[1] & "', '" & $Array2[2] & "', '" & $Array2[3] & "', '" & $Array2[4] & "', '" & $Array2[5] & "');"
    

    _SQLite_Exec (-1, $sql_cmd)
    $sql_cmd = ''
    
Next
...............

OK thanks a lot, wasn't too bad although I coded all day like a kid just starting :)

Tables are working the above code was used and double checked w/ SQL LiteExpert Personal.

So hopefully last Q w/ project -

What SQL statement would give me the INTERSECTION of the two tables based upon the DateTime col's of the tables (i.e. datetime1 = datetime2) PLS?

PS Am adding following ":00" to DataTime col entries as i type this

Share this post


Link to post
Share on other sites

The following seems to works, trying to figure out how print to ALL matching cols now

SELECT DateTime FROM Symbol1

INTERSECT

SELECT DateTime FROM Symbol2

Share this post


Link to post
Share on other sites

#20 ·  Posted (edited)

Congrats, you got the main part of it.

Some remarks:

You definitely should use transactions for bulk inserts, as it will speed up this phase dramatically.

With separate INSERT like you're currently doing, SQLite considers every single insert as one transaction (lookup autocommit mode in SQLite doc on Google) and your statements are equivalent to the following:

For loop

BEGIN; (implicit transaction around every statement in autocommit mode)

INSERT ...;

COMMIT;

Next

What you should do:

BEGIN; (explicit transaction around 1000 to 100 000 inserts)

For loop

INSERT ...;

Next

COMMIT; (explicit commit)

Now I would say something about your design. Look at your tables: they are identical! A good rule with SQL design is that all similar "objects" should be placed in a single table.

I would merge your tables into one:

CREATE TABLE [symbols] (

[id] INTEGER PRIMARY KEY AUTOINCREMENT,

[DateTime] TEXT,

[source] INTEGER NOT NULL,

[Open] TEXT,

[Vol] TEXT,

[synthVIX] Text,

[VWAP] Text),

CONSTRAINT [ukDateSrce] UNIQUE([DateTime], [source]) ON CONFLICT IGNORE);

Using source = 1, 2, ... N depending on source file. You may want to build a "SourceFile" table to keep track of which is which (using a foreign key would beappropriate).

I added an explicit integer Id (an explicit column for rowid). In this special role, INTEGER is not the same as INT.

Then you add a compound unique index on (Date, Source) with option ON CONFLICT IGNORE in order to ignore errors if ever more than one row with datetime = "" in the same source file.

BTW your insert loop should skip those records where date is missing, probably something you're actually doing. Alternatively you could write an SQL trigger that will take care of that, but triggers slow down things and I guess a test and continueloop will be faster in your case.

Now your query to find the intersection of source files 1 and (say) 5 (with those explicit values) could become (that's not the best way of doing it in the general case):

select datetime from symbols S1 where source = 1 and exists (select 1 from symbols S2 where s1.datetime = s2.datetime and s2.source = 5);

More generally, finding the intersection of all files which have been entered in the DB, judged on equal datetime:

select datetime from symbols group by datetime having count(*) > 1;

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0