revonatu

Finding and reading a specific line in big CSV

18 posts in this topic

Hi,

I have a big CSV file with 1,864,653 lines with the following variables:

Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m), ...

28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397, ...

...

I want to find the line whose UTC Time is closest to a given one (stored as variable) in the same format (hh:mm:ss.sss). When found the line I want to extract the Lat, Lon and Altitude values.

How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time?

Thanks for hints and ideas.

Share this post


Link to post
Share on other sites



You can read the file line by line and compare the date. If the date is near or equal the searched date you can save it.

You have to define what date is near.

Further reading a file with 1.864.653 lines will take some time. Btw, what is the file size of this csv?


Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites

How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time?

Thanks for hints and ideas.

well, I don't know how to guess a file position for a data I haven't found. You'll have to read some how. Line at a time or chunk of x bytes but how can you find it without finding it ?. Sorry about the file size. Why don't you chop the file into dates if that is the search criteria. It will make consequent searches faster.

Share this post


Link to post
Share on other sites

If this is routine operation, I strongly suggest SQLite as Zedna just said.

You didn't give enough information so this is not a ready to use road but you can try doing the following:

Download the latest sqlite3.exe utility, e.g. from the official SQLite site http://www.sqlite.org/download.html and select the Precompiled command-line shell for Windows.

I tried with the following dummy file which I named testimport.csv:

Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m),Other data
28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2015,01:37:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2015,01:38:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2013,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2001,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/02/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/03/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/04/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
26/01/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl

using this in a prompt shell:

C:\Users\jc\Documents\AutoMAT\tmp>sqlite3 data.sq3
SQLite version 3.8.8.3 2015-02-25 13:29:11
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .import testimport.csv
Usage: .import FILE TABLE
sqlite> .import testimport.csv TestData
sqlite> .quit

Now file data.sq3 is created with the data in table TestData.

Go download SQLite Expert personal Edition (free) here http://www.sqliteexpert.com/

Open database data.sq3 and hit the Data tab: you can see the sample data.

Because the Date and Time columns are separate and not in a standard format, I advise you to issue the following commands in a new SQL tab:

update testdata set timestamp = replace("Date (UTC)" || ' ' || "Time (UTC)", '/', '-');
CREATE INDEX idxTS ON TestData (Timestamp);
alter table testdata add column Timestamp;
update testdata
       set timestamp = 
       substr("Date (UTC)", 7, 4) || '-' ||
       substr("Date (UTC)", 4, 2) || '-' ||
       substr("Date (UTC)", 1, 2) ||
       ' ' ||
       "Time (UTC)";
CREATE INDEX idxTS ON TestData (Timestamp);

Now you're ready in minutes to query the database efficiently. If you have a time window like '2014-27-01 10:28:00' +/- 23 hours, you can get the result by querying:

select
      "poslat (deg)" Lat,
      "poslon (deg)" Lon,
      "posalt (m)" Alt
from testdata
where
     timestamp between datetime('2014-01-27 10:28:00', '-23 hours')
               and datetime('2014-01-27 10:28:00', '+23 hours');

giving you:

Timestamp    Lat    Lon    Alt
2014-01-28 01:36:44.010    -36.78730154    158.63387961    19.397

(I was lazy to change the coordinates!)

If you actually want the row where the timestamp is closest to a given point, then the query is a bit more involved but it's definitely possible.

There are MANY optimizations left open here, this is just a sketch


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Hi,

I have a big CSV file with 1,864,653 lines with the following variables:

Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m), ...

28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397, ...

...

I want to find the line whose UTC Time is closest to a given one (stored as variable) in the same format (hh:mm:ss.sss). When found the line I want to extract the Lat, Lon and Altitude values.

How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time?

Thanks for hints and ideas.

 

A few basic questions:

  1. Is this information used as read only or read mostly? If so, it may be reasonable to pre-process the data into a more easily searchable format whenever the file is updated.
  2. Are the CSV records in chronological order (that is, monotonic with respect to UTC)? If so, you can perform a binary search on the UTC fields.

If the answer to #1 is Yes, then the pre-processing can include sorting the records (or producing a separate sorted index file). If the answer to #1 is No, but #2 is Yes, you can perform a binary search based on the File offset. Each iteration of the search will likely end up in the middle of a record, so some logic to sync up to the next end-of-line will be needed.

Hope this helps.

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

Thanks for the suggestions so far. I hoped it would be easier. Here are some more details on my data:

The 167 MB table contains the the coordinates of a whole flight day. Within this day I need the mean value of a single flight line. I calculated the mean time from start and end time of that stripe (saved in another file) and now I need the coordinates at that time. Read only

Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m),

28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397
28/01/2015,01:36:44.014,0.004,-36.78730154,158.63387961,19.397

28/01/2015,01:36:44.018,0.008,-36.78730154,158.63387961,19.397

....

As you can see the changes in time are slight and continuously increasing. The date remains the same. As an example the given time would be 03:03:23.071.

As this task is only a preparing piece of puzzle within a long AutoIt process I would prefer to solve it within AutoIt. 

Edited by revonatu

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

so roughly 25 entries per second?  You could read the first line of the file, see how many seconds that is away from your desired time,  multiply that value by 25, and thats roughly where your target is.  Grab the 50 entries before that and the 50 entries after.  Find your time in that small chunk of 100.   you could probably expand that out to many thousands and still finish in under a minute.

Edited by boththose
1 person likes this

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

@boththose

Nice idea! :) But to select the 50 lines before and after the desired time I needed to know their numbers. Then we come full circle. 

Using AutoIt:

By now I opened the CSV file with Notepad and "Found" (Ctrl F) the desired Time. Afterwards I selected and copied that line. Now I try to pick my variables out of this line via StrRegExp. Not elegant, but at least it works more or less and doesn't take too long.

If someone has more elegant ideas, please share them with me.

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

you do know the line number, at least a rough guess.  It is 25 x $n where $n equals the number of seconds between the time on line 1 and your target time.

then that resulting in $x just read lines $x - 50 to $x + 50, and search for your target.   and if you have to expand that window due to there being more or less entries for some seconds,  I dont think you will see an unbearable performance hit until you get into the thousands.

Edited by boththose

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

@boththose

BTW it has to be 250 x $n as one second consists of ca. 250 lines.

So far so good, but to get the difference between the two times I'd have to reformat them to hh-mm-ss which costs time again. Otherwise only the first digits in front of the ':' are taken into account for the subtraction which doesn't help. 

Oh, I was wrong. No need to reformat. Fine. :)

Next Question. I only know and found the functions FileRead and FileReadLine. Both don't work for several lines in the middle of a file. How do I read my $X +/-1000 lines? 

Edited by revonatu

Share this post


Link to post
Share on other sites

revonatu,

Let me suggest that you re-visit jchd's comments in post #5.  What he has outlined is an example of what SQLite can do, AND, it can all be done inline in your program relatively simply.

If this data is going to be read once then the use of SQLite might be debatable.  However, if you are going to access this data multiple times then SQLite is the only storage mechanism that makes sense.

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

I only need to read it once in this AutoIt session.

Share this post


Link to post
Share on other sites

I see.  If you can read the file to a string and you know the date/time you are looking for then an SRE solution might be appropriate.  (I know, a lot of "if"'n)


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

revonatu,

Just for grins I put together a test using SRE. 

The commented code generates a test file according to the sample data that you posted.  Each log entry is incremented 1 millisecond.  I generated 1.8 million entries (114.25 MB).

The code at the end is using regexp to find the last entry in the test file.  This is the run log

>"C:\Program Files (x86)\AutoIt3\SciTE\AutoIt3Wrapper\AutoIt3Wrapper.exe" /run /prod /ErrorStdOut /in "C:\Users\ADMIN010\Documents\help large csv.au3" /UserParams    
+>02:54:49 Starting AutoIt3Wrapper v.14.801.2025.0 SciTE v.3.4.4.0   Keyboard:00000409  OS:WIN_7/Service Pack 1  CPU:X64 OS:X64    Environment(Language:0409)
+>         SciTEDir => C:\Program Files (x86)\AutoIt3\SciTE   UserDir => C:\Users\ADMIN010\AppData\Local\AutoIt v3\SciTE\AutoIt3Wrapper   SCITE_USERHOME => C:\Users\ADMIN010\AppData\Local\AutoIt v3\SciTE 
>Running AU3Check (3.3.13.19)  from:C:\Program Files (x86)\AutoIt3  input:C:\Users\ADMIN010\Documents\help large csv.au3
+>02:54:49 AU3Check ended.rc:0
>Running:(3.3.12.0):C:\Program Files (x86)\AutoIt3\autoit3.exe "C:\Users\ADMIN010\Documents\help large csv.au3"    
--> Press Ctrl+Alt+Break to Restart or Ctrl+Break to Stop
Time to find entry = 4.77974345543552
Entry = 15/04/2015, 03:15:02.000,0.000,-36.78730154,158.63387961,19.397
+>02:54:54 AutoIt3.exe ended.rc:0
+>02:54:54 AutoIt3Wrapper Finished.
>Exit code: 0    Time: 5.505

The time is in seconds.

The code I used...

; 28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397

#cs - code used to generate test file of incrementing milliseconds

#include <array.au3>
#include <date.au3>
#include <file.au3>

local $st = timerinit()
local $a20[1800001]
local $MS = 000, $tail = '0.000,-36.78730154,158.63387961,19.397', $dte = _NowCalc()

for $1 = 0 to 1800000

    if mod($1,1000) = 0 then _add_second()
    $a20[$1] = stringregexpreplace($dte,'(\d\d\d\d)/(\d\d)/(\d\d)(.*)','$3/$2/$1,$4') & '.' & stringformat('%03i',$MS) & ',' & $tail
    $MS += 1

Next

ConsoleWrite('Test array generated in ' & timerdiff($st)/1000 & ' seconds' & @CRLF)

$st = timerinit()

_FileWriteFromArray(@scriptdir & '\testcsv.csv',$a20)

ConsoleWrite('Test file  generated in ' & timerdiff($st)/1000 & ' seconds' & @CRLF)

func _add_second()

    $MS = 000
    $dte = _dateadd('s',1,$dte)

endfunc

#ce

local $TimeToFind = '15/04/2015, 03:15:02.000'

local $st = timerinit()

local $aRET = stringregexp(fileread(@scriptdir & '\testcsv.csv'),$TimeToFind & '.*',3)

ConsoleWrite('Time to find entry = ' & timerdiff($st)/1000 & @LF & 'Entry = ' & $aRET[0] & @CRLF)

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

#16 ·  Posted (edited)

With such a big CSV file (167 MB) I recommend to read only part of file using FileOpen+FileSetPos+FileRead as in this example (example is for reading chars from end of file)

In your case determine approximate position of your time in file based on approx. length of each line and number of expected lines.

Then read few KB before and after that point into variable and Then you may do simple StringSplit() by @CRLF and go through each line in FOR/NEXT loop

because there will be only "few" lines so it will not be big speed problem anymore.

So idea is to avoid reading (and parsing) of WHOLE file into variable.

EDIT:

here is example for my idea:

msgbox(0, '10KB from position 3MB' , _FileReadMyPos('test.csv' , 3000000))

Func _FileReadMyPos($FileName, $pos, $count = 10000) ; implicitly read 10 KB
    $hFile = FileOpen($FileName, 0) ; for reading
    FileSetPos($hFile , $pos , 0) ; from begin
    $sOut = FileRead($hFile, $count)
    FileClose($hFile)
    return $sOut
EndFunc
Edited by Zedna

Share this post


Link to post
Share on other sites

#17 ·  Posted (edited)

So you can read few lines from begin of file and few lines from end of file to analyze starting/ending date/time and average length of line.

After this you can read (and parse) small part of file from desired postion as I wote in my previous post ...

Edited by Zedna

Share this post


Link to post
Share on other sites

#18 ·  Posted (edited)

I reduced the resolution to one line per second, which is enough for my needs and now I can read it to an array and pick my variables.

Here is my well working code (blowen up by error requests):

If Not _FileReadToArray ($posPath, $Pos, 4, " ") Then
    msgbox(16, "error", "Failed to read file. Error: " & @error)
    Exit
EndIf

Local $LineX            = _ArrayBinarySearch ($Pos, Round($MeanTime,0), 0)
If Not @error Then
    ;MsgBox(64, "Found", "LineX=    " & $LineX, 2) ; e.g. row 5200
Else
    msgbox(16, "Error", "Error: " & @error)
    Exit
EndIf
Local $ArrayX           = _ArrayExtract($Pos, $LineX, $LineX) ; extract the desired row to a new array
If Not @error Then
    ;_ArrayDisplay($ArrayX, "ArrayX", "", 96, Default, "TimeOfDay (UTC)|PosLat (deg)|PosLon (deg)|PosHeight (m)|AngleRoll (deg)|AnglePitch (deg)|Heading (deg)", 2)
Else
    msgbox(16, "Error", "Error: " & @error)
    Exit
EndIf

$Lat        = _ArrayUnique($ArrayX, 1); $ArrayX[1]
If Not @error Then
;    _ArrayDisplay($Lat)
Else
    msgbox(16, "Error", "Failed to extract. Error: " & @error)
EndIf
$Lon        = _ArrayUnique($ArrayX, 2); $ArrayX[2]
;$Elev      = _ArrayUnique($ArrayX, x); $ArrayX[x]
$Alt        = _ArrayUnique($ArrayX, 3); $ArrayX[3]
$Head       = _ArrayUnique($ArrayX, 6); $ArrayX[6]

Thanks for your help.

Edited by revonatu

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now