manipulate large csv-file

dejhost · January 15, 2021

Hello!
I am dealing with image annotations for machine-learning purposes. My input-file is a .csv-file, which I need to edit. So output should still be .csv, but the little content that is actually required from the original file should be presented in a different structure. That means, that a lot of content from the original file can be deleted.

My challange is that the input file is quite large: Right now, I am just working with a snipped, consisting of a 8 rows (that's 1 image per row) and 822 columns. The final file will consist of more than 20.000 lines and possibly 2-3000 columns. Even a file with 500 rows is taking a lot of time and I fear that the final file will just fail.

I checked the time-consumption after each manipulation and wrote it as a comment behind the MsgBoxes (e.g. line 21).

Could you please take a look at my code below and the screenshots as well as a small input-file attached?

In the orig. file, 1 column lists 1 image, followed by all annotations. In line 64-78 in the code below, I change the format: After this loop, 1 coloumn contains only one annotation. Do you have an idea on how to speed this up?
Deleting columns (line 125 - 131) fails when $i = 514. I do not understand why.
For some reasons, the array seems to remain quite large, even after I deleted many columns. See the screenshot 03 attached. Why is that?

I'd appreciate any help to mature the code. Thank you very much.

#include <Array.au3>
#include <File.au3>
#include <MsgBoxConstants.au3>
#include <String.au3>
;~ Start timer
Local $hTimer = TimerInit() ; Begin the timer and store the handle in a variable.


Local $sFilePath = "C:\Users\Jost\Documents\RnD\rePho\Readcsv\result_short.csv"
Local $aCSV
Local $i, $j = 0, $cols, $rows
Global $sSingleFill

;Read the csv-file to Array and check timer
_FileReadToArray($sFilePath, $aCSV, Default,',')
$rows = $aCSV[0][0]
$cols = $aCSV[0][1]
consolewrite("$rows = " & $rows & "    $cols = " & $cols & @CRLF)
; Timer
Local $fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "21 Time Difference ", $fDiff)      ;14 ms
$hTimer = TimerInit()


;~ Remove the first 2 columns of the array since they do not contain any valuable information
_ArrayColDelete($aCSV,0)
_ArrayColDelete($aCSV,0)

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "30 Time Difference", $fDiff)       ;28 ms
$hTimer = TimerInit()

; update array dimensions
$aCSV[0][0] = $aCSV[0][0]-2
$rows = $aCSV[0][0]
consolewrite("$rows = " & $rows & "    $cols = " & $cols & @CRLF)

;~ Remove columns with unwanted information (repetative)
While $j < $cols - 6
    _ArrayColDelete($aCSV,$j + 1)
    _ArrayColDelete($aCSV,$j + 1)
    _ArrayColDelete($aCSV,$j + 1)
    _ArrayColDelete($aCSV,$j + 2)
    _ArrayColDelete($aCSV,$j + 6)
    $cols = $cols - 5
    $j = $j + 5
Wend

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "50 Time Difference", $fDiff)       ;2.8 s
$hTimer = TimerInit()

; update array dimensions
$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)
$rows = Ubound ($aCSV, 1)
$cols = Ubound ($aCSV, 2)
consolewrite("$rows = " & $rows & "    $cols = " & $cols & @CRLF)

;Each row is supposed to descibe only 1 annotation. But right now it contains all annotations, which leads to many columns. So:
;duplicate the field containing the image name ($aCSV[$i][0]), and create a new row for each annotation (5 coloumns).


For $i = 2 to $aCSV[0][0]-1                                         ; circle through all rows in array
    $j = 6                                                          ; in column 6, the second annotation starts
    For $j = 6 To $aCSV[0][1]-1 Step 5                              ; circle through all cols...
        If $aCSV[$i][$j] <> "" Then                                 ; as long as there are annotations.
            $sSingleFill = $aCSV[$i][0]                             ; pick the image name
            For $k = 0 To 4                                         ; add the next 4 coloumns
                $sSingleFill = $sSingleFill & "|" & $aCSV[$i][$j+$k]
                $aCSV[$i][$j+$k] = ""
            NEXT
        EndIF
        _ArrayAdd($aCSV, $sSingleFill)                              ;and shift them to a new row.
        $sSingleFill = ""
        $aCSV[0][0] = $aCSV[0][0] + 1                               ;update field containing number of rows
    Next
Next

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "81 Time Difference", $fDiff)       ;20.3 s


; update array dimensions
$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)
$rows = Ubound ($aCSV, 1)
$cols = Ubound ($aCSV, 2)
consolewrite("$rows = " & $rows & "    $cols = " & $cols & @CRLF)

_ArrayDisplay($aCSV, "finnished transformation line 91", "5")
$hTimer = TimerInit()

;~ Clean up the array
_ArrayDelete($aCSV, 1)                                              ;Delete the row containing the header
$aCSV[0][0] = $aCSV[0][0] - 1                                       ;Update field containing number of rows
$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "98 Time Difference", $fDiff)       ;0.4 s

_ArrayDisplay($aCSV, "deleted header line 100", "5")

; update array dimensions
$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)
$hTimer = TimerInit()

;Delete empty columns
For $i = 7 to $aCSV[0][1]
    _ArrayColDelete($aCSV, $i)
    $aCSV[0][1] = $aCSV[0][1] - 1
Next

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "114 Time Difference", $fDiff)          ; 45 s

_ArrayDisplay($aCSV, "finnished Delete empty columns line 117")
$hTimer = TimerInit()

; update array dimensions
$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)

;~ _DeleteEmptyRows($aCSV)

For $i = 1 to Ubound ($aCSV, 1)-1
    ConsoleWrite ("$i = " & $i & @CRLF)
    If $aCSV[$i][1] = "" Then
        _ArrayDelete($aCSV, $i)
        $aCSV[0][0] = $aCSV[0][0]-1
    EndIf
Next

$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "137 Time Difference", $fDiff)              ;0.3s
_ArrayDisplay($aCSV, "Array cleaned up line 136")
$hTimer = TimerInit()

;~ Extract image-name from field $aCSV[$i][0]
For $i = 1 to $aCSV[0][0]-1                                         ;Circle through all rows

    $string = $aCSV[$i][0]                                          ;Pick field containtn image name

    Local $rpos = StringInStr($string, "?")                         ;Search for delimiter
        $string = StringLeft ($string, $rpos-1)

    Local $lpos = StringInStr($string, "-", 0 , -1)                 ;Search for delimiter
        $string = StringTrimLeft ($string, $lpos )
    consolewrite("$string = " & $string & @CRLF)
    $aCSV[$i][0] = $string                                          ;Insert image-name into $aCSV[$i][0] (overwrite)
Next

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "154 Time Difference", $fDiff)              ; 10 s

_ArrayDisplay($aCSV, "All done line 156")
Exit



Func _DeleteEmptyRows($aCSV)
    Local $Rows = Ubound($aCSV,1)
    Local $Cols = Ubound($aCSV,2)
    Local $aTemp[$Rows][$Cols]
    Local $not_empty
    Local $Count = 0

    ;Loop through rows
    For $Y = 0 to $Rows - 1
        $not_empty = 0

        ;Loop through columns
        For $X = 0 to $Cols - 1

            ;Copy all columns to temp array even if they are all empty
            $aTemp[$Count][$X] = $aCSV[$Y][$X]

            ;If even one column contains data, make sure it doesn't get deleted
            If $aCSV[$Y][$X] <> "" Then $not_empty = BitOr($not_empty, 1)
        Next

        ;If the row has any data, increment, else keep overwriting last row until it contains something
        If $not_empty Then $Count += 1
    Next

    Redim $aTemp[$Count][$Cols]
    Return $aTemp
EndFunc

result_short.csv

FrancescoDiMuro · January 15, 2021

@dejhost
I think that it could be more easily for us to know what's your goals instead of trying to understand your script.
Just make a list of "to do's" and we'll try our best to help you

Nine · January 15, 2021

A few general comments :

1- Using _Array* repeatedly to add or delete 1 row/col each time is terribly slow with a large array, consider using ReDim instead

2- Multiple steps (like you did) simplify the programmer conceptional effort but generate extra machine execution work, consider reducing it to a single step (maybe 2 at most)

3- On Win10, array manipulation is quite slow with the current AutoIt version, consider using beta version

orbs · January 15, 2021

i fully agree with the two posts above, except that ReDim is also a poor choice for many successive operations.

i'd go about it this way: read the input file line by line, and for each line in turn:

1) split the columns into a 1-D array of values.

2) build the target row, as a string, from the desired values.

3) write the new line into the target file.

this way you avoid loading the RAM too much, and you do not need to perform array manipulation at all.

Edited January 15, 2021 by orbs

Nine · January 15, 2021

1 minute ago, orbs said:

except that ReDim is also a poor choice for many successive operations

I was not clear enough, instead of deleting last X columns with _ArrayColDelete one at a time and deleting last Y rows with _ArrayDelete one at a time, you can ReDim the array in a single statement step (which is quite fast). For example :

#include <Constants.au3>
#include <Array.au3>

Local $arr[200][1000]

For $i = 0 to UBound($arr, 1) - 1
  For $j = 0 to UBound($arr, 2) - 1
    $arr[$i][$j] = $i & "-" & $j
  Next
Next

; _ArrayDisplay($arr)

Local $hTime = TimerInit()
ReDim $arr[10][10]
MsgBox ($MB_SYSTEMMODAL,"",TimerDiff($hTime))

_ArrayDisplay($arr)

dejhost · January 15, 2021

7 hours ago, FrancescoDiMuro said:

@dejhost
I think that it could be more easily for us to know what's your goals instead of trying to understand your script.
Just make a list of "to do's" and we'll try our best to help you

Good point, thank you.

Since I received quite valuable comments already, let me improve my skript and come back to you afterwards concerning a to-do list.

However: this is the goal: a .csv, looking like this:

image.png.7aa6b06980f8425e1b05744bd615cc75.png

---

Thanks everybody: I had no idea that the array-operations are relatively slow. Reading and processing one line at the time makes sense to me.

FrancescoDiMuro · January 16, 2021

@dejhost
As I was saying, knowing the final result is a must in order to build a script that works for your needs, but this may help you to continue your development:

#include <Array.au3>
#include <File.au3>
#include <StringConstants.au3>

Test()

Func Test()

    Local $strFileName = @ScriptDir & "\result_short.csv", _
          $arrFileContent, _
          $arrImages[1][8] = [["Filename", "Width", "Height", "X_Min", "Y_Min", "X_Max", "Y_Max"]], _
          $arrImageFileName, _
          $hTimer

    ; Start the timer
    $hTimer = TimerInit()

    _FileReadToArray($strFileName, $arrFileContent, $FRTA_COUNT, ",")
    If @error Then Return ConsoleWrite("_FileReadToArray ERR: " & @error & @CRLF)

    ; Loop through the array content
    For $i = 1 To $arrFileContent[0][0] Step 1

        ; If the image has no width or height, then just continue the loop
        If ($arrFileContent[$i][10] = "") Or ($arrFileContent[$i][11] = "") Then ContinueLoop

        ; Obtaining the filename of the image
        $arrImageFileName = StringRegExp($arrFileContent[$i][2], '(?<=%)([^.]+\.jpg)', $STR_REGEXPARRAYMATCH)

        ; If the name of the image has been obtained correctly
        If IsArray($arrImageFileName) Then

            ; Adding the elements to the Images array
            _ArrayAdd($arrImages, $arrImageFileName[0] & "|" & _
                                  $arrFileContent[$i][11] & "|" & _
                                  $arrFileContent[$i][10] & "|" & _
                                  "|" & _
                                  "|" & _
                                  "|" & _
                                  "|" & _
                                  "")
            If @error Then Return ConsoleWrite("_ArrayAdd ERR: " & @error & @CRLF)

        EndIf

    Next

    ; Time for the script execution
    ConsoleWrite("Script executed in " & TimerDiff($hTimer)/1000 & " sec." & @CRLF)

    ; Display the Images array
    _ArrayDisplay($arrImages)

EndFunc

And here's the execution time of the script:

Script executed in 0.0331623 sec.

Edited January 16, 2021 by FrancescoDiMuro

junkew · January 18, 2021

Maybe first read it into an sqlite database and then query it first to an easier to understand 13 column table

Another example with first transforming it in powershell (same you could do with SQLite database)

Powershell is powerfull in dealing with csv files and querying them directly with named columns instead of numbers
Your testfile misses 1 column at the end
You have to create the calculations yourself.
Below outputs it into file transformed.csv which you then can deal with from Autoit

clear-host
<#
ID                           : ckj9snx1r00003a699vfsn650
DataRow ID                   : ckj9odjg47bie0rh60mq7edin
Labeled Data                 : https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2Fa7fdb985-6331-2931-da04-ad7fa0e1
                               48c1-n101_0058.jpg?Expires=1611474714883&KeyName=labelbox-assets-key-1&Signature=uviaZ-Av
                               HVLSuDCabq3Fw9CU13E
Label/objects/0/featureId    : ckj9qsu9707br0yaf04i5bzxd
Label/objects/0/schemaId     : ckj9q1xtm00ko0yae812f3asg
Label/objects/0/title        : Ocluded Ucoded Target
Label/objects/0/value        : ocluded_ucoded_target
Label/objects/0/color        : #008941
Label/objects/0/bbox/top     : 803
Label/objects/0/bbox/left    : 0
Label/objects/0/bbox/height  : 24
Label/objects/0/bbox/width   : 20
Label/objects/0/instanceURI  : https://api.labelbox.com/masks/feature/ckj9qsu9707br0yaf04i5bzxd?token=eyJhbGciOiJIUzI1Ni
                               IsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJja2o5cTNjeTg3anRvMDcxNmU2Zzd0ejEzIiwib3JnYW5pemF0aW9uSW
                               QiOiJja2o5bzFudmY2cHFrMDcxNnZuMjRub3o3IiwiaWF0IjoxNjEwMjY1MTEzLCJleHAiOjE2MTI4NTcxMTN9.nI
                               Ov1e49Ci1etM4YxN57hujTKB0kQpueX26YULxFrsI
...
Label/objects/81/bbox/width
#>
$rawData = Import-Csv -Path "result_short.csv"
$rowcount=$rawdata.count
$columnCount=$rawdata[0].psobject.properties.name.count
#10 is default number of columns
$repeatNNTimes = ($columncount-2) / 10


#0-81 repeats for all blocks
for ($i=0; $i -le $repeatNNTimes-1; $i++) {
    $prefix="Label/objects/$i"
    $global:rowId=0
    $transformedData=$rawdata | select-object @{name="rowid" ; expression={$global:rowId=$global:rowId+1; return $global:rowId}},
                         @{name="colid" ; expression={return $i}}, 
                         @{name="imgname"; expression={$_."Labeled Data"-match "[^-]*.jpg" | out-null; return $matches[0]}}, 
                         @{name="width"  ; expression={$_."$prefix/bbox/width"}},
                         @{name="height" ; expression={$_."$prefix/bbox/height"}},
                         @{name="class"  ; expression={"tag"}},
                         @{name="xmin"   ; expression={"xmin"}},
                         @{name="ymin"   ; expression={"ymin"}},
                         @{name="xmax"   ; expression={"xmax"}},
                         @{name="ymax"   ; expression={"ymax"}},
                         @{name="instanceURI"   ; expression={$_."$prefix/instanceURI"}} | where {$_.instanceURI -ne ""}
                         
  if ($i -eq 0) {($transformedData | select -First 1 | convertto-csv -NoTypeInformation)[0]  | out-file transformed.csv -encoding utf8 -force }
  $transformedData | convertto-csv -NoTypeInformation | Select-Object -Skip 1 | out-file transformed.csv -Append -encoding utf8
  #$transformedData | format-table -AutoSize -HideTableHeaders 
    }

dejhost · January 20, 2021

Thanks @everybody here for your input.

I re-wrote my script: It

reads 1 line at the time,
builds the output-line in form of an array and finally
appends the line to an output file.

It's not superfast (I'm not exactly what you would call an expert on Regexp or SQL), but it's acceptable.

Processing the input-file, the script bumps into troubles after 120 annotations of success. Reasons: The csv-input-file breakes its rutines (in column 1130 or whatever). Fixing this manually just leads to more troubles a little later. So either the csv-input-file was inconsistent when I revceived it, or the conversion I performed (prior reading it with autoit) screwed it up.

However: it seems to me that working with the original files is the better way. Disadvantage is, that they are pretty hard to read for me. Please take a look at them (identical content, diffrerent format). Could you please advise me what would be a good approach? Should I work with the csv or json?

Thank you very much for giving me a hand.

csv.zip json.zip

dejhost · January 20, 2021

Below I add the strucutre of the output-file as it is supposed to be (the first few Annotations). The values in the second and third columns are retrieved from another source.

Filename	Width	Height	Class	X_Min	Y_Min	X_Max	Y_Max
n101_0058.jpg	2848	4288	ocluded_ucoded_target	803	0	24	20
n101_0058.jpg	2848	4288	ocluded_coded_target	845	0	57	53
n101_0058.jpg	2848	4288	uncoded_target	1007	66	32	38
n101_0058.jpg	2848	4288	coded_target	1051	72	58	65
n101_0058.jpg	2848	4288	uncoded_target	1213	145	30	38
n101_0058.jpg	2848	4288	coded_target	1257	155	56	62
n101_0058.jpg	2848	4288	uncoded_target	1417	225	32	37
n101_0058.jpg	2848	4288	ocluded_coded_target	645	0	48	26
n101_0058.jpg	2848	4288	uncoded_target	795	48	31	33
n101_0058.jpg	2848	4288	coded_target	844	63	48	46
n101_0058.jpg	2848	4288	uncoded_target	999	130	32	32
n101_0058.jpg	2848	4288	uncoded_target	1205	212	30	29
n101_0058.jpg	2848	4288	coded_target	1251	226	52	44

Edited January 20, 2021 by dejhost
text in table had white color...

Nine · January 20, 2021

I am supposing that the json file is the original and the csv is the result of the conversion of the json, right ?

Each line in the json file is an annotation as you are calling it ?

dejhost · January 20, 2021

Both attached files are exports from a software called label-box: https://labelbox.com

Quote

Each line in the json file is an annotation as you are calling it ?

If you look at the json-file in notepad++ (See screenshot below), each line contains many annotations. Starting with column 3 ("Labeled Data"), every 10th column starts a new annotation. So column 3, 13, 23, 33.... is called ("Labeled Data") and contains the image name (the image name ist hidden in the link). But using this numbering failed in coloumn 11200 (or whatever). However: each line describes one image (so about 50 images):

---

If you look at the json in a webbrowser (See screenshot below)... well: the first "0" relates to the first image. The second "0" (below "objects") is the first annotation. The values I am looking for are red encircled, and are part of the line in the table posted an hour ago:

Nine · January 20, 2021

Could you test that, seems to be working :

#include <Constants.au3>
#include <Array.au3>
#include <File.au3>

Local $aFile = FileReadToArray("export-2021-01-10T07_51_53.836Z.json")
Local $hFile = FileOpen("Output.txt", $FO_OVERWRITE)
Local $sContent, $sLine, $aContent

Local $hTime = TimerInit()
For $i = 1 to UBound($aFile)-1
  $sContent = StringRegExp(StringRegExp($aFile[$i],'"Labeled Data":"([^"]+)',1)[0],'(?i)([^-]*?\.jpg)',1)[0]
  ConsoleWrite ($sContent & @CRLF)
  $aContent = StringRegExp($aFile[$i],'"value":"([^"]+).+?:\{"top":([^,]+),"left":([^,]+),"height":([^,]+),"width":([^}]+)', 3)
  $sLine = ""
  For $j = 0 To UBound($aContent) - 1 Step 5
    $sLine &= $sContent
    For $k = $j To $j+4
      $sLine &= @TAB & $aContent[$k]
    Next
    $sLine &= @CRLF
  Next
  FileWrite($hFile, $sLine)
Next
FileClose($hFile)
MsgBox ($MB_SYSTEMMODAL,"",Round(TimerDiff($hTime)/1000,2))

I optimized the code. Went from ~5secs to ~0.6sec

Edited January 20, 2021 by Nine

junkew · January 20, 2021

As said by @FrancescoDiMuro it helps to get the overall target picture clear first.

If it starts with JSON you should manipulate that as the earlier in the process you can structure your data the faster your full process will be.

Intall Notepad++ with plugin JSON viewer. Browser is fine but more specific JSON viewer tools will help sometimes
I do not know a good JSON library in Autoit that can easily flatten your filebut maybe others can give input for that
See below how you can easily get your array from the JSON in powershell directly as csv
- you have to do the calculations yourself

$jsonFile= "c:\temp\export-2021-01-10T07_51_53.836Z.json"

$rawData=Get-Content -Raw -Path $jsonFile | ConvertFrom-Json

$result=foreach($row in $rawdata) {
    foreach($o in $row.label.objects) {
       # write-host $row.'Labeled Data' $o.featureid $o.schemaid $ o.title $o.value $o.color $o.bbox.top $o.bbox.left $o.bbox.height $o.bbox.width
       # write-host $row."External ID" $o.featureid $o.schemaid $o.title $o.value $o.color $o.bbox.top $o.bbox.left $o.bbox.height $o.bbox.width
       [pscustomobject]@{ 
          id       = $row."External ID" ;
          featurid = $o.featureid ;
          schemaid = $o.schemaid ;
          title    = $o.title ;
          value    = $o.value ;
          color    = $o.color ;
          top      = $o.bbox.top ;
          left     = $o.bbox.left ;
          height   = $o.bbox.height ;
          width    = $o.bbox.width
       }
    }
}

$result | convertto-csv -NoTypeInformation | out-file result.csv -encoding utf8

seadoggie01 · January 20, 2021

I'd second junkew's suggestion to get the data from JSON. This UDF is used for the Danp2's WebDriver UDF, so I think it's pretty stable:

Here's a brief example:
(Note: your JSON file is huge, it may take a long time to load into AutoIT. I gave up after 20 seconds, but I use a relic)

#include <json.au3>

Func Main()
    ; ... 
    Local $sFileContents = FileRead("export-2021-01-10T07_51_53.836Z.json")
    ; Load the JSON into AutoIt
    Local $oJson = Json_Decode($sFileContents)
    If @error Then Exit ConsoleWrite("Json_Decode Error: " & @error & @CRLF)

    Local $sValue
    For $i=0 To 5
        ; Get the value of the ID at item $i
        $sValue = Json_Get($oJson, "[" & $i & "][ID]")
        If @error Then Exit ConsoleWrite("Json_Get[" & $i & "] Error: " & @error & @CRLF)
        ConsoleWrite("ID: " & $sValue & @CRLF)
    Next
    ; ... 
    
EndFunc

TheXman · January 21, 2021

Here's an example that uses the jq UDF to generate CSV data from the JSON file attached in one of the previous posts. It generates the 52,405 CSV records in less than 2 seconds on my laptop. The CSV contains the data items that were circled in red in the post above. If you want TSV data instead of CSV data, just change the @csv to @tsv in $JQ_FILTER.

#cs
    This example uses the jq UDF.

    https://www.autoitscript.com/forum/files/file/502-jq-udf-a-powerful-flexible-json-processor/
#ce

#include <Constants.au3>
#include "jq.au3"       ;<== Modify as needed

_jqInit("jq-win64.exe") ;<== Modify as needed
If @error Then Exit MsgBox($MB_ICONERROR + $MB_TOPMOST, "ERROR", StringFormat("ERROR: Unable to initialize jq - @error = %s", @error))

example()

Func example()
    Const $JSON_FILE  = "export-2021-01-10T07_51_53.836Z.json" ;<== Modify as needed
    Const $JQ_FILTER  = '.[] | ' & _
                        '{file: ."External ID", type: .Label.objects[]?} | ' & _
                        '{file, value: .type.value, top: .type.bbox.top, left: .type.bbox.left, height: .type.bbox.height, width: .type.bbox.width} | ' & _
                        '[.[]] | @csv'

    Local $hTimer     = TimerInit()
    Local $sCmdOutput = _jqExecFile($JSON_FILE, $JQ_FILTER)
    Local $iTimerDiff = TimerDiff($hTimer)

    write_log_line(@CRLF & "================================")
    write_log_line($sCmdOutput)
    write_log_line(@CRLF & "Duration: " & StringFormat("%.3f seconds", $iTimerDiff / 1000))
EndFunc

Func write_log_line($sMsg = "")
    Const $TITLE_NOTEPAD = "[RegExpTitle:(?i)untitled - notepad]"
    Static $hWndNotepad = -1

    ;If we don't have a handle to notepad yet
    If $hWndNotepad = -1 Then
        ;If there isn't an existing instance of notepad running, launch one
        If Not WinExists($TITLE_NOTEPAD) Then Run("Notepad.exe")

        ;Get handle to notepad window
        $hWndNotepad = WinWait($TITLE_NOTEPAD, "", 3)
        If Not $hWndNotepad Then Exit MsgBox($MB_ICONERROR, "ERROR", "Unable to find Notepad window.")
    EndIf

    ;Paste msg to notepad text
    If WinExists($hWndNotepad) Then
        ControlCommand($hWndNotepad, "", "Edit1", "EditPaste", $sMsg & @CRLF)
    EndIf
EndFunc

Snippet of the output:

"n101_0058.jpg","ocluded_ucoded_target",803,0,24,20
"n101_0058.jpg","ocluded_coded_target",845,0,57,53
"n101_0058.jpg","uncoded_target",1007,66,32,38
"n101_0058.jpg","coded_target",1051,72,58,65
"n101_0058.jpg","uncoded_target",1213,145,30,38
.
.
.
"n101_0252.jpg","uncoded_target",2377,4194,40,54
"n101_0252.jpg","uncoded_target",2651,3126,40,48
"n101_0252.jpg","uncoded_target",2013,3421,31,47
"n101_0252.jpg","coded_target",2054,3224,54,75
"n101_0252.jpg","uncoded_target",2084,3170,33,42

Duration: 1.279 seconds

For anyone reading this later who's interested in understanding the jq filter that was used, here's the filter with comments. FYI, this filter can also be fed to the jqExec* functions as a file, instead of as a string literal, using the -f option. The "|" symbol basically tells the processor to feed the result of the previous part to the next part.

# For each top-level array entry
.[] | 

# Reduce to only interesting json objects/data,
# creating an object for each .Label.objects' array
# object
{
  file: ."External ID", 
  label: .Label.objects[]?
} | 

# Reduce result to final set of objects with data of interest
{
  file,
  value:  .label.value,
  top:    .label.bbox.top,
  left:   .label.bbox.left,
  height: .label.bbox.height,
  width:  .label.bbox.width
} |

# Convert result to an array for input to @csv
[.[]] |

# Convert array to CSV
@csv

Edited January 27, 2021 by TheXman
Added commented filter for those that may be interested

dejhost · January 21, 2021

I feel a bit overwhelemed by your enthusiasm! Thanks everybody!

@Nine: Works like a charm! Damn fast as well. I realised by now that the image-name is also listed as "External ID" (easy to see, once you kow about the json plugin for notepad++ ;-)) . So I retreive the image name directly, and skip the line

$sContent = StringRegExp(StringRegExp($aFile[$i],'"Labeled Data":"([^"]+)',1)[0],'(?i)([^-]*?\.jpg)',1)[0]

This brings it down to 0,5s. 👍

One comment/question remains: As mentioned, the input file is only a trial, dealing with 500 images. The final file will deal with more than 20.000 images, so 40 times larger. So I might go back to reading one line at the time: using "FileReadLine". Does that make sense?

@junkew: I agree on your "target picture"- comment. But it wasn't so easy to understand it myself: Only 1-2 days ago, I realised that one can look at json-files in a structured way (web browser) - giving a much better understanding of the file structure. The notepad-pluggin sure comes in handy.

Haven't used powershell yet, but reading about it on the web... possibly an even better tool than autoit (for this task). I also found https://pandas.pydata.org . But I will stick to autoit for now, since the task is basically done.

@seadoggie01: I too started using the json.au3 yesterday evening, but it seemed rather slow, yes.

@TheXman: Nice! Very fast, too!

Once again: Thank you everybody! Your skrips make it look so easy.... 😉

Nine · January 21, 2021

4 hours ago, dejhost said:

@Nine: Works like a charm! Damn fast as well.

Thank you, by writing once at every line of input makes it way faster...

4 hours ago, dejhost said:

As mentioned, the input file is only a trial, dealing with 500 images. The final file will deal with more than 20.000 images, so 40 times larger.

At the moment 500 images takes about 27.5 Mb, so 40 times brings it to 1.1Gb. Tad too big in my opinion to use FileReadToArray, it may slow down the whole process. I would tend to agree with you, that reading the json file line by line may be a better option. But nothing stops you to try both methods...Let us know how it goes.

Edited January 21, 2021 by Nine

TheXman · January 21, 2021

9 hours ago, dejhost said:

@TheXman: Nice! Very fast, too!

Thanks! I updated my previous post with a commented description of what the jq filter does just in case anyone else, that comes across this topic, may be interested in jq and how it can be used to process JSON.

Edited January 21, 2021 by TheXman

TheXman · January 22, 2021

@dejhost

FYI,

Because you're dealing with such large datasets, it is easy to miss some of the details & data if you aren't careful. For example, in the sample JSON file that you provided, there are 2 images that don't have any .Label object information. Therefore, those 2 records did not show up in the CSV data. If that's what was expected, then all is good. If you expected to see all images, whether they had object information or not, then those types of records would need to be taken into account. So if you noticed and were wondering why you were missing some of the input images in the CSV data, that's why.

The 2 JSON objects without label object information:

{
  "ID": "ckjayp5lc00003a69wakzxsxg",
  "DataRow ID": "ckj9odjg37bi20rh6ei442ug7",
  "Labeled Data": "https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2F0bacdc01-bee2-3b16-627e-671b8f6c2a1a-DSC04905.jpg?Expires=1611474714892&KeyName=labelbox-assets-key-1&Signature=dvsZdXT3dbUeRBvORvVxtMhhyMA",
  "Label": {},
  "Created By": "victor@oasisoutsourcing.co.ke",
  "Project Name": "SubSeaScanning",
  "Created At": "2020-12-30T05:11:30.000Z",
  "Updated At": "2020-12-30T05:23:05.000Z",
  "Seconds to Label": 32.094,
  "External ID": "DSC04905.jpg",
  "Agreement": -1,
  "Benchmark Agreement": -1,
  "Benchmark ID": null,
  "Dataset Name": "Trial",
  "Reviews": [
    {
      "score": 1,
      "id": "ckjc8j0k60gk50yaw07n69wiu",
      "createdAt": "2020-12-31T02:34:25.000Z",
      "createdBy": "victor@oasisoutsourcing.co.ke"
    }
  ],
  "View Label": "https://editor.labelbox.com?project=ckj9obfp954gq0718tasdrinc&label=ckjayp5lc00003a69wakzxsxg"
}
{
  "ID": "ckjmlilco000039686bfxpf8k",
  "DataRow ID": "ckjm8xorhfbe80rj53hi0bg36",
  "Labeled Data": "https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2F4644e43b-5e58-ff51-d1e3-4551ed722d6f-n101_0408.jpg?Expires=1611474715610&KeyName=labelbox-assets-key-1&Signature=erTXaMYyBO56kurONJv6AVGt8zU",
  "Label": {},
  "Created By": "evans@oasisoutsourcing.co.ke",
  "Project Name": "SubSeaScanning",
  "Created At": "2021-01-07T08:31:15.000Z",
  "Updated At": "2021-01-07T08:31:16.000Z",
  "Seconds to Label": 219.925,
  "External ID": "n101_0408.jpg",
  "Agreement": -1,
  "Benchmark Agreement": -1,
  "Benchmark ID": null,
  "Dataset Name": "Aassgard Spool - Batch 1",
  "Reviews": [
    {
      "score": 1,
      "id": "ckjn49ska0p010yd17ws9eyyh",
      "createdAt": "2021-01-07T17:20:44.000Z",
      "createdBy": "victor@oasisoutsourcing.co.ke"
    }
  ],
  "View Label": "https://editor.labelbox.com?project=ckj9obfp954gq0718tasdrinc&label=ckjmlilco000039686bfxpf8k"
}

Edited January 22, 2021 by TheXman

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached)	Screen Scraping	Multi-Threading Made Easy

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached)	Screen Scraping	Multi-Threading Made Easy

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached)	Screen Scraping	Multi-Threading Made Easy

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached)	Screen Scraping	Multi-Threading Made Easy

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached)	Screen Scraping	Multi-Threading Made Easy

manipulate large csv-file

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members