Finding and reading a specific line in big CSV

revonatu · April 14, 2015

Hi,

I have a big CSV file with 1,864,653 lines with the following variables:

Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m), ...

28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397, ...

...

I want to find the line whose UTC Time is closest to a given one (stored as variable) in the same format (hh:mm:ss.sss). When found the line I want to extract the Lat, Lon and Altitude values.

How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time?

Thanks for hints and ideas.

UEZ · April 14, 2015

You can read the file line by line and compare the date. If the date is near or equal the searched date you can save it.

You have to define what date is near.

Further reading a file with 1.864.653 lines will take some time. Btw, what is the file size of this csv?

argumentum · April 14, 2015

How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time?

Thanks for hints and ideas.

well, I don't know how to guess a file position for a data I haven't found. You'll have to read some how. Line at a time or chunk of x bytes but how can you find it without finding it ?. Sorry about the file size. Why don't you chop the file into dates if that is the search criteria. It will make consequent searches faster.

Zedna · April 14, 2015

You may try SQLite with memory database and test speed of this way ...

Edited April 14, 2015 by Zedna

jchd · April 14, 2015

If this is routine operation, I strongly suggest SQLite as Zedna just said.

You didn't give enough information so this is not a ready to use road but you can try doing the following:

Download the latest sqlite3.exe utility, e.g. from the official SQLite site http://www.sqlite.org/download.html and select the Precompiled command-line shell for Windows.

I tried with the following dummy file which I named testimport.csv:

Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m),Other data
28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2015,01:37:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2015,01:38:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2013,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/01/2001,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/02/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/03/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
28/04/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl
26/01/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl

using this in a prompt shell:

C:\Users\jc\Documents\AutoMAT\tmp>sqlite3 data.sq3
SQLite version 3.8.8.3 2015-02-25 13:29:11
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .import testimport.csv
Usage: .import FILE TABLE
sqlite> .import testimport.csv TestData
sqlite> .quit

Now file data.sq3 is created with the data in table TestData.

Go download SQLite Expert personal Edition (free) here http://www.sqliteexpert.com/

Open database data.sq3 and hit the Data tab: you can see the sample data.

Because the Date and Time columns are separate and not in a standard format, I advise you to issue the following commands in a new SQL tab:

update testdata set timestamp = replace("Date (UTC)" || ' ' || "Time (UTC)", '/', '-');
CREATE INDEX idxTS ON TestData (Timestamp);

alter table testdata add column Timestamp;
update testdata
       set timestamp = 
       substr("Date (UTC)", 7, 4) || '-' ||
       substr("Date (UTC)", 4, 2) || '-' ||
       substr("Date (UTC)", 1, 2) ||
       ' ' ||
       "Time (UTC)";
CREATE INDEX idxTS ON TestData (Timestamp);

Now you're ready in minutes to query the database efficiently. If you have a time window like '2014-27-01 10:28:00' +/- 23 hours, you can get the result by querying:

select
      "poslat (deg)" Lat,
      "poslon (deg)" Lon,
      "posalt (m)" Alt
from testdata
where
     timestamp between datetime('2014-01-27 10:28:00', '-23 hours')
               and datetime('2014-01-27 10:28:00', '+23 hours');

giving you:

Timestamp Lat Lon Alt
2014-01-28 01:36:44.010 -36.78730154 158.63387961 19.397

(I was lazy to change the coordinates!)

If you actually want the row where the timestamp is closest to a given point, then the query is a bit more involved but it's definitely possible.

There are MANY optimizations left open here, this is just a sketch

Spandex · April 14, 2015

Hi,

I have a big CSV file with 1,864,653 lines with the following variables:

Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m), ...

28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397, ...

...

I want to find the line whose UTC Time is closest to a given one (stored as variable) in the same format (hh:mm:ss.sss). When found the line I want to extract the Lat, Lon and Altitude values.

How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time?

Thanks for hints and ideas.

A few basic questions:

Is this information used as read only or read mostly? If so, it may be reasonable to pre-process the data into a more easily searchable format whenever the file is updated.
Are the CSV records in chronological order (that is, monotonic with respect to UTC)? If so, you can perform a binary search on the UTC fields.

If the answer to #1 is Yes, then the pre-processing can include sorting the records (or producing a separate sorted index file). If the answer to #1 is No, but #2 is Yes, you can perform a binary search based on the File offset. Each iteration of the search will likely end up in the middle of a record, so some logic to sync up to the next end-of-line will be needed.

Hope this helps.

revonatu · April 15, 2015

Thanks for the suggestions so far. I hoped it would be easier. Here are some more details on my data:

The 167 MB table contains the the coordinates of a whole flight day. Within this day I need the mean value of a single flight line. I calculated the mean time from start and end time of that stripe (saved in another file) and now I need the coordinates at that time. Read only.

Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m),

28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397

28/01/2015,01:36:44.014,0.004,-36.78730154,158.63387961,19.397

28/01/2015,01:36:44.018,0.008,-36.78730154,158.63387961,19.397

....

As you can see the changes in time are slight and continuously increasing. The date remains the same. As an example the given time would be 03:03:23.071.

As this task is only a preparing piece of puzzle within a long AutoIt process I would prefer to solve it within AutoIt.

Edited April 15, 2015 by revonatu

iamtheky · April 15, 2015

so roughly 25 entries per second? You could read the first line of the file, see how many seconds that is away from your desired time, multiply that value by 25, and thats roughly where your target is. Grab the 50 entries before that and the 50 entries after. Find your time in that small chunk of 100. you could probably expand that out to many thousands and still finish in under a minute.

Edited April 15, 2015 by boththose

revonatu · April 15, 2015

@boththose

Nice idea! But to select the 50 lines before and after the desired time I needed to know their numbers. Then we come full circle.

Using AutoIt:

By now I opened the CSV file with Notepad and "Found" (Ctrl F) the desired Time. Afterwards I selected and copied that line. Now I try to pick my variables out of this line via StrRegExp. Not elegant, but at least it works more or less and doesn't take too long.

If someone has more elegant ideas, please share them with me.

iamtheky · April 15, 2015

you do know the line number, at least a rough guess. It is 25 x $n where $n equals the number of seconds between the time on line 1 and your target time.

then that resulting in $x just read lines $x - 50 to $x + 50, and search for your target. and if you have to expand that window due to there being more or less entries for some seconds, I dont think you will see an unbearable performance hit until you get into the thousands.

Edited April 15, 2015 by boththose

revonatu · April 15, 2015

@boththose

BTW it has to be 250 x $n as one second consists of ca. 250 lines.

So far so good, but to get the difference between the two times I'd have to reformat them to hh-mm-ss which costs time again. Otherwise only the first digits in front of the ':' are taken into account for the subtraction which doesn't help.

Oh, I was wrong. No need to reformat. Fine.

Next Question. I only know and found the functions FileRead and FileReadLine. Both don't work for several lines in the middle of a file. How do I read my $X +/-1000 lines?

Edited April 15, 2015 by revonatu

kylomas · April 15, 2015

revonatu,

Let me suggest that you re-visit jchd's comments in post #5. What he has outlined is an example of what SQLite can do, AND, it can all be done inline in your program relatively simply.

If this data is going to be read once then the use of SQLite might be debatable. However, if you are going to access this data multiple times then SQLite is the only storage mechanism that makes sense.

kylomas

revonatu · April 15, 2015

I only need to read it once in this AutoIt session.

kylomas · April 15, 2015

I see. If you can read the file to a string and you know the date/time you are looking for then an SRE solution might be appropriate. (I know, a lot of "if"'n)

kylomas · April 15, 2015

revonatu,

Just for grins I put together a test using SRE.

The commented code generates a test file according to the sample data that you posted. Each log entry is incremented 1 millisecond. I generated 1.8 million entries (114.25 MB).

The code at the end is using regexp to find the last entry in the test file. This is the run log

>"C:\Program Files (x86)\AutoIt3\SciTE\AutoIt3Wrapper\AutoIt3Wrapper.exe" /run /prod /ErrorStdOut /in "C:\Users\ADMIN010\Documents\help large csv.au3" /UserParams    
+>02:54:49 Starting AutoIt3Wrapper v.14.801.2025.0 SciTE v.3.4.4.0   Keyboard:00000409  OS:WIN_7/Service Pack 1  CPU:X64 OS:X64    Environment(Language:0409)
+>         SciTEDir => C:\Program Files (x86)\AutoIt3\SciTE   UserDir => C:\Users\ADMIN010\AppData\Local\AutoIt v3\SciTE\AutoIt3Wrapper   SCITE_USERHOME => C:\Users\ADMIN010\AppData\Local\AutoIt v3\SciTE 
>Running AU3Check (3.3.13.19)  from:C:\Program Files (x86)\AutoIt3  input:C:\Users\ADMIN010\Documents\help large csv.au3
+>02:54:49 AU3Check ended.rc:0
>Running:(3.3.12.0):C:\Program Files (x86)\AutoIt3\autoit3.exe "C:\Users\ADMIN010\Documents\help large csv.au3"    
--> Press Ctrl+Alt+Break to Restart or Ctrl+Break to Stop
Time to find entry = 4.77974345543552
Entry = 15/04/2015, 03:15:02.000,0.000,-36.78730154,158.63387961,19.397
+>02:54:54 AutoIt3.exe ended.rc:0
+>02:54:54 AutoIt3Wrapper Finished.
>Exit code: 0    Time: 5.505

The time is in seconds.

The code I used...

; 28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397

#cs - code used to generate test file of incrementing milliseconds

#include <array.au3>
#include <date.au3>
#include <file.au3>

local $st = timerinit()
local $a20[1800001]
local $MS = 000, $tail = '0.000,-36.78730154,158.63387961,19.397', $dte = _NowCalc()

for $1 = 0 to 1800000

    if mod($1,1000) = 0 then _add_second()
    $a20[$1] = stringregexpreplace($dte,'(\d\d\d\d)/(\d\d)/(\d\d)(.*)','$3/$2/$1,$4') & '.' & stringformat('%03i',$MS) & ',' & $tail
    $MS += 1

Next

ConsoleWrite('Test array generated in ' & timerdiff($st)/1000 & ' seconds' & @CRLF)

$st = timerinit()

_FileWriteFromArray(@scriptdir & '\testcsv.csv',$a20)

ConsoleWrite('Test file  generated in ' & timerdiff($st)/1000 & ' seconds' & @CRLF)

func _add_second()

    $MS = 000
    $dte = _dateadd('s',1,$dte)

endfunc

#ce

local $TimeToFind = '15/04/2015, 03:15:02.000'

local $st = timerinit()

local $aRET = stringregexp(fileread(@scriptdir & '\testcsv.csv'),$TimeToFind & '.*',3)

ConsoleWrite('Time to find entry = ' & timerdiff($st)/1000 & @LF & 'Entry = ' & $aRET[0] & @CRLF)

kylomas

Zedna · April 15, 2015

With such a big CSV file (167 MB) I recommend to read only part of file using FileOpen+FileSetPos+FileRead as in this example (example is for reading chars from end of file)

In your case determine approximate position of your time in file based on approx. length of each line and number of expected lines.

Then read few KB before and after that point into variable and Then you may do simple StringSplit() by @CRLF and go through each line in FOR/NEXT loop

because there will be only "few" lines so it will not be big speed problem anymore.

So idea is to avoid reading (and parsing) of WHOLE file into variable.

EDIT:

here is example for my idea:

msgbox(0, '10KB from position 3MB' , _FileReadMyPos('test.csv' , 3000000))

Func _FileReadMyPos($FileName, $pos, $count = 10000) ; implicitly read 10 KB
    $hFile = FileOpen($FileName, 0) ; for reading
    FileSetPos($hFile , $pos , 0) ; from begin
    $sOut = FileRead($hFile, $count)
    FileClose($hFile)
    return $sOut
EndFunc

Edited April 15, 2015 by Zedna

Zedna · April 15, 2015

So you can read few lines from begin of file and few lines from end of file to analyze starting/ending date/time and average length of line.

After this you can read (and parse) small part of file from desired postion as I wote in my previous post ...

Edited April 15, 2015 by Zedna

revonatu · April 16, 2015

I reduced the resolution to one line per second, which is enough for my needs and now I can read it to an array and pick my variables.

Here is my well working code (blowen up by error requests):

If Not _FileReadToArray ($posPath, $Pos, 4, " ") Then
    msgbox(16, "error", "Failed to read file. Error: " & @error)
    Exit
EndIf

Local $LineX            = _ArrayBinarySearch ($Pos, Round($MeanTime,0), 0)
If Not @error Then
    ;MsgBox(64, "Found", "LineX=    " & $LineX, 2) ; e.g. row 5200
Else
    msgbox(16, "Error", "Error: " & @error)
    Exit
EndIf
Local $ArrayX           = _ArrayExtract($Pos, $LineX, $LineX) ; extract the desired row to a new array
If Not @error Then
    ;_ArrayDisplay($ArrayX, "ArrayX", "", 96, Default, "TimeOfDay (UTC)|PosLat (deg)|PosLon (deg)|PosHeight (m)|AngleRoll (deg)|AnglePitch (deg)|Heading (deg)", 2)
Else
    msgbox(16, "Error", "Error: " & @error)
    Exit
EndIf

$Lat        = _ArrayUnique($ArrayX, 1); $ArrayX[1]
If Not @error Then
;    _ArrayDisplay($Lat)
Else
    msgbox(16, "Error", "Failed to extract. Error: " & @error)
EndIf
$Lon        = _ArrayUnique($ArrayX, 2); $ArrayX[2]
;$Elev      = _ArrayUnique($ArrayX, x); $ArrayX[x]
$Alt        = _ArrayUnique($ArrayX, 3); $ArrayX[3]
$Head       = _ArrayUnique($ArrayX, 6); $ArrayX[6]

Thanks for your help.

Edited April 16, 2015 by revonatu

Finding and reading a specific line in big CSV

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members