Jump to content

manipulate large csv-file


dejhost
 Share

Recommended Posts

Hello!
I am dealing with image annotations for machine-learning purposes. My input-file is a .csv-file, which I need to edit. So output should still be .csv, but the little content that is actually required from the original file should be presented in a different structure. That means, that a lot of content from the original file can be deleted.

My challange is that the input file is quite large: Right now, I am just working with a snipped, consisting of  a 8 rows (that's 1 image per row) and 822 columns. The final file will consist of more than 20.000 lines and possibly 2-3000 columns. Even  a file with 500 rows is taking a lot of time and I fear that the final file will just fail.

I checked the time-consumption after each manipulation and wrote it as a comment behind the MsgBoxes (e.g. line 21).

 

Could you please take a look at my code below and the screenshots as well as a small input-file attached?

  1. In the orig. file, 1 column lists 1 image, followed by all  annotations. In line 64-78 in the code below, I change the format: After this loop, 1 coloumn contains only one annotation. Do you have an idea on how to speed this up?
  2. Deleting columns (line 125 - 131) fails when $i = 514. I do not understand why.
  3. For some reasons, the array seems to remain quite large, even after I deleted many columns. See the screenshot 03 attached. Why is that?

I'd appreciate any help to mature the code. Thank you very much.

#include <Array.au3>
#include <File.au3>
#include <MsgBoxConstants.au3>
#include <String.au3>
;~ Start timer
Local $hTimer = TimerInit() ; Begin the timer and store the handle in a variable.


Local $sFilePath = "C:\Users\Jost\Documents\RnD\rePho\Readcsv\result_short.csv"
Local $aCSV
Local $i, $j = 0, $cols, $rows
Global $sSingleFill

;Read the csv-file to Array and check timer
_FileReadToArray($sFilePath, $aCSV, Default,',')
$rows = $aCSV[0][0]
$cols = $aCSV[0][1]
consolewrite("$rows = " & $rows & "    $cols = " & $cols & @CRLF)
; Timer
Local $fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "21 Time Difference ", $fDiff)      ;14 ms
$hTimer = TimerInit()


;~ Remove the first 2 columns of the array since they do not contain any valuable information
_ArrayColDelete($aCSV,0)
_ArrayColDelete($aCSV,0)

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "30 Time Difference", $fDiff)       ;28 ms
$hTimer = TimerInit()

; update array dimensions
$aCSV[0][0] = $aCSV[0][0]-2
$rows = $aCSV[0][0]
consolewrite("$rows = " & $rows & "    $cols = " & $cols & @CRLF)

;~ Remove columns with unwanted information (repetative)
While $j < $cols - 6
    _ArrayColDelete($aCSV,$j + 1)
    _ArrayColDelete($aCSV,$j + 1)
    _ArrayColDelete($aCSV,$j + 1)
    _ArrayColDelete($aCSV,$j + 2)
    _ArrayColDelete($aCSV,$j + 6)
    $cols = $cols - 5
    $j = $j + 5
Wend

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "50 Time Difference", $fDiff)       ;2.8 s
$hTimer = TimerInit()

; update array dimensions
$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)
$rows = Ubound ($aCSV, 1)
$cols = Ubound ($aCSV, 2)
consolewrite("$rows = " & $rows & "    $cols = " & $cols & @CRLF)

;Each row is supposed to descibe only 1 annotation. But right now it contains all annotations, which leads to many columns. So:
;duplicate the field containing the image name ($aCSV[$i][0]), and create a new row for each annotation (5 coloumns).


For $i = 2 to $aCSV[0][0]-1                                         ; circle through all rows in array
    $j = 6                                                          ; in column 6, the second annotation starts
    For $j = 6 To $aCSV[0][1]-1 Step 5                              ; circle through all cols...
        If $aCSV[$i][$j] <> "" Then                                 ; as long as there are annotations.
            $sSingleFill = $aCSV[$i][0]                             ; pick the image name
            For $k = 0 To 4                                         ; add the next 4 coloumns
                $sSingleFill = $sSingleFill & "|" & $aCSV[$i][$j+$k]
                $aCSV[$i][$j+$k] = ""
            NEXT
        EndIF
        _ArrayAdd($aCSV, $sSingleFill)                              ;and shift them to a new row.
        $sSingleFill = ""
        $aCSV[0][0] = $aCSV[0][0] + 1                               ;update field containing number of rows
    Next
Next

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "81 Time Difference", $fDiff)       ;20.3 s


; update array dimensions
$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)
$rows = Ubound ($aCSV, 1)
$cols = Ubound ($aCSV, 2)
consolewrite("$rows = " & $rows & "    $cols = " & $cols & @CRLF)

_ArrayDisplay($aCSV, "finnished transformation line 91", "5")
$hTimer = TimerInit()

;~ Clean up the array
_ArrayDelete($aCSV, 1)                                              ;Delete the row containing the header
$aCSV[0][0] = $aCSV[0][0] - 1                                       ;Update field containing number of rows
$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "98 Time Difference", $fDiff)       ;0.4 s

_ArrayDisplay($aCSV, "deleted header line 100", "5")

; update array dimensions
$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)
$hTimer = TimerInit()

;Delete empty columns
For $i = 7 to $aCSV[0][1]
    _ArrayColDelete($aCSV, $i)
    $aCSV[0][1] = $aCSV[0][1] - 1
Next

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "114 Time Difference", $fDiff)          ; 45 s

_ArrayDisplay($aCSV, "finnished Delete empty columns line 117")
$hTimer = TimerInit()

; update array dimensions
$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)

;~ _DeleteEmptyRows($aCSV)

For $i = 1 to Ubound ($aCSV, 1)-1
    ConsoleWrite ("$i = " & $i & @CRLF)
    If $aCSV[$i][1] = "" Then
        _ArrayDelete($aCSV, $i)
        $aCSV[0][0] = $aCSV[0][0]-1
    EndIf
Next

$aCSV[0][0] = Ubound ($aCSV, 1)
$aCSV[0][1] = Ubound ($aCSV, 2)

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "137 Time Difference", $fDiff)              ;0.3s
_ArrayDisplay($aCSV, "Array cleaned up line 136")
$hTimer = TimerInit()

;~ Extract image-name from field $aCSV[$i][0]
For $i = 1 to $aCSV[0][0]-1                                         ;Circle through all rows

    $string = $aCSV[$i][0]                                          ;Pick field containtn image name

    Local $rpos = StringInStr($string, "?")                         ;Search for delimiter
        $string = StringLeft ($string, $rpos-1)

    Local $lpos = StringInStr($string, "-", 0 , -1)                 ;Search for delimiter
        $string = StringTrimLeft ($string, $lpos )
    consolewrite("$string = " & $string & @CRLF)
    $aCSV[$i][0] = $string                                          ;Insert image-name into $aCSV[$i][0] (overwrite)
Next

$fDiff = TimerDiff($hTimer)
MsgBox($MB_SYSTEMMODAL, "154 Time Difference", $fDiff)              ; 10 s

_ArrayDisplay($aCSV, "All done line 156")
Exit



Func _DeleteEmptyRows($aCSV)
    Local $Rows = Ubound($aCSV,1)
    Local $Cols = Ubound($aCSV,2)
    Local $aTemp[$Rows][$Cols]
    Local $not_empty
    Local $Count = 0

    ;Loop through rows
    For $Y = 0 to $Rows - 1
        $not_empty = 0

        ;Loop through columns
        For $X = 0 to $Cols - 1

            ;Copy all columns to temp array even if they are all empty
            $aTemp[$Count][$X] = $aCSV[$Y][$X]

            ;If even one column contains data, make sure it doesn't get deleted
            If $aCSV[$Y][$X] <> "" Then $not_empty = BitOr($not_empty, 1)
        Next

        ;If the row has any data, increment, else keep overwriting last row until it contains something
        If $not_empty Then $Count += 1
    Next

    Redim $aTemp[$Count][$Cols]
    Return $aTemp
EndFunc

 

 

 

 

 

 

Screenshot 03.jpg

Screenshot 01.jpg

Screenshot 02.jpg

result_short.csv

Link to comment
Share on other sites

A few general comments :

1- Using _Array* repeatedly to add or delete 1 row/col each time is terribly slow with a large array, consider using ReDim instead

2- Multiple steps (like you did) simplify the programmer conceptional effort but generate extra machine execution work, consider reducing it to a single step (maybe 2 at most)

3- On Win10, array manipulation is quite slow with the current AutoIt version, consider using beta version

Link to comment
Share on other sites

i fully agree with the two posts above, except that ReDim is also a poor choice for many successive operations.

i'd go about it this way: read the input file line by line, and for each line in turn:

1) split the columns into a 1-D array of values.

2) build the target row, as a string, from the desired values.

3) write the new line into the target file.

this way you avoid loading the RAM too much, and you do not need to perform array manipulation at all.

Edited by orbs

Signature - my forum contributions:

Spoiler

UDF:

LFN - support for long file names (over 260 characters)

InputImpose - impose valid characters in an input control

TimeConvert - convert UTC to/from local time and/or reformat the string representation

AMF - accept multiple files from Windows Explorer context menu

DateDuration -  literal description of the difference between given dates

Apps:

Touch - set the "modified" timestamp of a file to current time

Show For Files - tray menu to show/hide files extensions, hidden & system files, and selection checkboxes

SPDiff - Single-Pane Text Diff

 

Link to comment
Share on other sites

1 minute ago, orbs said:

except that ReDim is also a poor choice for many successive operations

I was not clear enough, instead of deleting last X columns with _ArrayColDelete one at a time and deleting last Y rows with _ArrayDelete one at a time, you can ReDim the array in a single statement step (which is quite fast).  For example :

#include <Constants.au3>
#include <Array.au3>

Local $arr[200][1000]

For $i = 0 to UBound($arr, 1) - 1
  For $j = 0 to UBound($arr, 2) - 1
    $arr[$i][$j] = $i & "-" & $j
  Next
Next

; _ArrayDisplay($arr)

Local $hTime = TimerInit()
ReDim $arr[10][10]
MsgBox ($MB_SYSTEMMODAL,"",TimerDiff($hTime))

_ArrayDisplay($arr)

 

Link to comment
Share on other sites

7 hours ago, FrancescoDiMuro said:

@dejhost
I think that it could be more easily for us to know what's your goals instead of trying to understand your script.
Just make a list of "to do's" and we'll try our best to help you :)

Good point, thank you.

Since I received quite valuable comments already, let me improve my skript and come back to you afterwards concerning a to-do list. 

However: this is the goal: a .csv, looking like this:

image.png.7aa6b06980f8425e1b05744bd615cc75.png

---

Thanks everybody: I had no idea that the array-operations are relatively slow. Reading and processing one line at the time makes sense to me.

Link to comment
Share on other sites

@dejhost
As I was saying, knowing the final result is a must in order to build a script that works for your needs, but this may help you to continue your development:

#include <Array.au3>
#include <File.au3>
#include <StringConstants.au3>

Test()

Func Test()

    Local $strFileName = @ScriptDir & "\result_short.csv", _
          $arrFileContent, _
          $arrImages[1][8] = [["Filename", "Width", "Height", "X_Min", "Y_Min", "X_Max", "Y_Max"]], _
          $arrImageFileName, _
          $hTimer

    ; Start the timer
    $hTimer = TimerInit()

    _FileReadToArray($strFileName, $arrFileContent, $FRTA_COUNT, ",")
    If @error Then Return ConsoleWrite("_FileReadToArray ERR: " & @error & @CRLF)

    ; Loop through the array content
    For $i = 1 To $arrFileContent[0][0] Step 1

        ; If the image has no width or height, then just continue the loop
        If ($arrFileContent[$i][10] = "") Or ($arrFileContent[$i][11] = "") Then ContinueLoop

        ; Obtaining the filename of the image
        $arrImageFileName = StringRegExp($arrFileContent[$i][2], '(?<=%)([^.]+\.jpg)', $STR_REGEXPARRAYMATCH)

        ; If the name of the image has been obtained correctly
        If IsArray($arrImageFileName) Then

            ; Adding the elements to the Images array
            _ArrayAdd($arrImages, $arrImageFileName[0] & "|" & _
                                  $arrFileContent[$i][11] & "|" & _
                                  $arrFileContent[$i][10] & "|" & _
                                  "|" & _
                                  "|" & _
                                  "|" & _
                                  "|" & _
                                  "")
            If @error Then Return ConsoleWrite("_ArrayAdd ERR: " & @error & @CRLF)

        EndIf

    Next

    ; Time for the script execution
    ConsoleWrite("Script executed in " & TimerDiff($hTimer)/1000 & " sec." & @CRLF)

    ; Display the Images array
    _ArrayDisplay($arrImages)

EndFunc

And here's the execution time of the script:

Script executed in 0.0331623 sec.

:)

Edited by FrancescoDiMuro

Click here to see my signature:

Spoiler

ALWAYS GOOD TO READ:

 

Link to comment
Share on other sites

Maybe first read it into an sqlite database and then query it first to an easier to understand 13 column table

Another example with first transforming it in powershell (same you could do with SQLite database)

  • Powershell is powerfull in dealing with csv files and querying them directly with named columns instead of numbers
  • Your testfile misses 1 column at the end
  • You have to create the calculations yourself.
  • Below outputs it into file transformed.csv which you then can deal with from Autoit
clear-host
<#
ID                           : ckj9snx1r00003a699vfsn650
DataRow ID                   : ckj9odjg47bie0rh60mq7edin
Labeled Data                 : https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2Fa7fdb985-6331-2931-da04-ad7fa0e1
                               48c1-n101_0058.jpg?Expires=1611474714883&KeyName=labelbox-assets-key-1&Signature=uviaZ-Av
                               HVLSuDCabq3Fw9CU13E
Label/objects/0/featureId    : ckj9qsu9707br0yaf04i5bzxd
Label/objects/0/schemaId     : ckj9q1xtm00ko0yae812f3asg
Label/objects/0/title        : Ocluded Ucoded Target
Label/objects/0/value        : ocluded_ucoded_target
Label/objects/0/color        : #008941
Label/objects/0/bbox/top     : 803
Label/objects/0/bbox/left    : 0
Label/objects/0/bbox/height  : 24
Label/objects/0/bbox/width   : 20
Label/objects/0/instanceURI  : https://api.labelbox.com/masks/feature/ckj9qsu9707br0yaf04i5bzxd?token=eyJhbGciOiJIUzI1Ni
                               IsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJja2o5cTNjeTg3anRvMDcxNmU2Zzd0ejEzIiwib3JnYW5pemF0aW9uSW
                               QiOiJja2o5bzFudmY2cHFrMDcxNnZuMjRub3o3IiwiaWF0IjoxNjEwMjY1MTEzLCJleHAiOjE2MTI4NTcxMTN9.nI
                               Ov1e49Ci1etM4YxN57hujTKB0kQpueX26YULxFrsI
...
Label/objects/81/bbox/width
#>
$rawData = Import-Csv -Path "result_short.csv"
$rowcount=$rawdata.count
$columnCount=$rawdata[0].psobject.properties.name.count
#10 is default number of columns
$repeatNNTimes = ($columncount-2) / 10


#0-81 repeats for all blocks
for ($i=0; $i -le $repeatNNTimes-1; $i++) {
    $prefix="Label/objects/$i"
    $global:rowId=0
    $transformedData=$rawdata | select-object @{name="rowid" ; expression={$global:rowId=$global:rowId+1; return $global:rowId}},
                         @{name="colid" ; expression={return $i}}, 
                         @{name="imgname"; expression={$_."Labeled Data"-match "[^-]*.jpg" | out-null; return $matches[0]}}, 
                         @{name="width"  ; expression={$_."$prefix/bbox/width"}},
                         @{name="height" ; expression={$_."$prefix/bbox/height"}},
                         @{name="class"  ; expression={"tag"}},
                         @{name="xmin"   ; expression={"xmin"}},
                         @{name="ymin"   ; expression={"ymin"}},
                         @{name="xmax"   ; expression={"xmax"}},
                         @{name="ymax"   ; expression={"ymax"}},
                         @{name="instanceURI"   ; expression={$_."$prefix/instanceURI"}} | where {$_.instanceURI -ne ""}
                         
  if ($i -eq 0) {($transformedData | select -First 1 | convertto-csv -NoTypeInformation)[0]  | out-file transformed.csv -encoding utf8 -force }
  $transformedData | convertto-csv -NoTypeInformation | Select-Object -Skip 1 | out-file transformed.csv -Append -encoding utf8
  #$transformedData | format-table -AutoSize -HideTableHeaders 
    }

 

Link to comment
Share on other sites

Thanks @everybody here for your input.

I re-wrote my script: It

  • reads 1 line at the time,
  • builds the output-line in form of an array and finally
  • appends the line to an output file.

It's not superfast (I'm not exactly what you would call an expert on Regexp or SQL), but it's acceptable.

Processing the input-file, the script bumps into troubles after 120 annotations of success. Reasons: The csv-input-file breakes its rutines (in column 1130 or whatever). Fixing this manually just leads to more troubles a little later. So either the csv-input-file was inconsistent when I revceived it, or the conversion I performed (prior reading it with autoit) screwed it up.

However: it seems to me that working with the original files is the better way. Disadvantage is, that they are pretty hard to read for me. Please take a look at them (identical content, diffrerent format). Could you please advise me what would be a good approach? Should I work with the csv or json?

Thank you very much for giving me a hand.

csv.zip json.zip

Link to comment
Share on other sites

Below I add the strucutre of the output-file as it is supposed to be (the first few Annotations). The values in the second and third columns are retrieved from another source.

Filename Width Height Class X_Min Y_Min X_Max Y_Max
n101_0058.jpg 2848 4288 ocluded_ucoded_target 803 0 24 20
n101_0058.jpg 2848 4288 ocluded_coded_target 845 0 57 53
n101_0058.jpg 2848 4288 uncoded_target 1007 66 32 38
n101_0058.jpg 2848 4288 coded_target 1051 72 58 65
n101_0058.jpg 2848 4288 uncoded_target 1213 145 30 38
n101_0058.jpg 2848 4288 coded_target 1257 155 56 62
n101_0058.jpg 2848 4288 uncoded_target 1417 225 32 37
n101_0058.jpg 2848 4288 ocluded_coded_target 645 0 48 26
n101_0058.jpg 2848 4288 uncoded_target 795 48 31 33
n101_0058.jpg 2848 4288 coded_target 844 63 48 46
n101_0058.jpg 2848 4288 uncoded_target 999 130 32 32
n101_0058.jpg 2848 4288 uncoded_target 1205 212 30 29
n101_0058.jpg 2848 4288 coded_target 1251 226 52 44
Edited by dejhost
text in table had white color...
Link to comment
Share on other sites

I am supposing that the json file is the original and the csv is the result of the conversion of the json, right ?

Each line in the json file is an annotation as you are calling it ?

Link to comment
Share on other sites

Both attached files are exports from a software called label-box: https://labelbox.com   

Quote

Each line in the json file is an annotation as you are calling it ?

If you look at the json-file in notepad++ (See screenshot below), each line contains many annotations. Starting with column 3 ("Labeled Data"), every 10th column starts a new annotation. So column 3, 13, 23, 33.... is called ("Labeled Data") and contains the image name (the image name ist hidden in the link). But using this numbering failed in coloumn 11200 (or whatever).  However: each line describes one image (so about 50 images):

image.thumb.png.6605d4677985adad8f49d1f5effadf9d.png

---

If you look at the json in a webbrowser (See screenshot below)... well: the first "0" relates to the first image. The second "0" (below "objects") is the first annotation. The values I am looking for are red encircled, and are part of the line in the table posted an hour ago:

image.thumb.png.239623ff11c197ecf9b72e91dc36f447.png

 

 

 

Link to comment
Share on other sites

Could you test that, seems to be working :

#include <Constants.au3>
#include <Array.au3>
#include <File.au3>

Local $aFile = FileReadToArray("export-2021-01-10T07_51_53.836Z.json")
Local $hFile = FileOpen("Output.txt", $FO_OVERWRITE)
Local $sContent, $sLine, $aContent

Local $hTime = TimerInit()
For $i = 1 to UBound($aFile)-1
  $sContent = StringRegExp(StringRegExp($aFile[$i],'"Labeled Data":"([^"]+)',1)[0],'(?i)([^-]*?\.jpg)',1)[0]
  ConsoleWrite ($sContent & @CRLF)
  $aContent = StringRegExp($aFile[$i],'"value":"([^"]+).+?:\{"top":([^,]+),"left":([^,]+),"height":([^,]+),"width":([^}]+)', 3)
  $sLine = ""
  For $j = 0 To UBound($aContent) - 1 Step 5
    $sLine &= $sContent
    For $k = $j To $j+4
      $sLine &= @TAB & $aContent[$k]
    Next
    $sLine &= @CRLF
  Next
  FileWrite($hFile, $sLine)
Next
FileClose($hFile)
MsgBox ($MB_SYSTEMMODAL,"",Round(TimerDiff($hTime)/1000,2))

I optimized the code.  Went from ~5secs to ~0.6sec

Edited by Nine
Link to comment
Share on other sites

As said by @FrancescoDiMuro it helps to get the overall  target picture clear first.

If it starts with JSON you should manipulate that as the earlier in the process you can structure your data the faster your full process will be.

  1. Intall Notepad++ with plugin JSON viewer. Browser is fine but more specific JSON viewer tools will help sometimes
  2. I do not know a good JSON library in Autoit that can easily flatten your filebut maybe others can give input for that
  3. See below how you can easily get your array from the JSON in powershell directly as csv
    • you have to do the calculations yourself
$jsonFile= "c:\temp\export-2021-01-10T07_51_53.836Z.json"

$rawData=Get-Content -Raw -Path $jsonFile | ConvertFrom-Json

$result=foreach($row in $rawdata) {
    foreach($o in $row.label.objects) {
       # write-host $row.'Labeled Data' $o.featureid $o.schemaid $ o.title $o.value $o.color $o.bbox.top $o.bbox.left $o.bbox.height $o.bbox.width
       # write-host $row."External ID" $o.featureid $o.schemaid $o.title $o.value $o.color $o.bbox.top $o.bbox.left $o.bbox.height $o.bbox.width
       [pscustomobject]@{ 
          id       = $row."External ID" ;
          featurid = $o.featureid ;
          schemaid = $o.schemaid ;
          title    = $o.title ;
          value    = $o.value ;
          color    = $o.color ;
          top      = $o.bbox.top ;
          left     = $o.bbox.left ;
          height   = $o.bbox.height ;
          width    = $o.bbox.width
       }
    }
}

$result | convertto-csv -NoTypeInformation | out-file result.csv -encoding utf8

 

Link to comment
Share on other sites

I'd second junkew's suggestion to get the data from JSON. This UDF is used for the Danp2's WebDriver UDF, so I think it's pretty stable: 

Here's a brief example:
(Note: your JSON file is huge, it may take a long time to load into AutoIT. I gave up after 20 seconds, but I use a relic)

#include <json.au3>

Func Main()
    ; ... 
    Local $sFileContents = FileRead("export-2021-01-10T07_51_53.836Z.json")
    ; Load the JSON into AutoIt
    Local $oJson = Json_Decode($sFileContents)
    If @error Then Exit ConsoleWrite("Json_Decode Error: " & @error & @CRLF)

    Local $sValue
    For $i=0 To 5
        ; Get the value of the ID at item $i
        $sValue = Json_Get($oJson, "[" & $i & "][ID]")
        If @error Then Exit ConsoleWrite("Json_Get[" & $i & "] Error: " & @error & @CRLF)
        ConsoleWrite("ID: " & $sValue & @CRLF)
    Next
    ; ... 
    
EndFunc

All my code provided is Public Domain... but it may not work. ;) Use it, change it, break it, whatever you want.

Spoiler

My Humble Contributions:
Personal Function Documentation - A personal HelpFile for your functions
Acro.au3 UDF - Automating Acrobat Pro
ToDo Finder - Find #ToDo: lines in your scripts
UI-SimpleWrappers UDF - Use UI Automation more Simply-er
KeePass UDF - Automate KeePass, a password manager
InputBoxes - Simple Input boxes for various variable types

Link to comment
Share on other sites

Here's an example that uses the jq UDF to generate CSV data from the JSON file attached in one of the previous posts.  It generates the 52,405 CSV records in less than 2 seconds on my laptop.  The CSV contains the data items that were circled in red in the post above.  If you want TSV data instead of CSV data, just change the @csv to @tsv in $JQ_FILTER.

#cs
    This example uses the jq UDF.

    https://www.autoitscript.com/forum/files/file/502-jq-udf-a-powerful-flexible-json-processor/
#ce

#include <Constants.au3>
#include "jq.au3"       ;<== Modify as needed

_jqInit("jq-win64.exe") ;<== Modify as needed
If @error Then Exit MsgBox($MB_ICONERROR + $MB_TOPMOST, "ERROR", StringFormat("ERROR: Unable to initialize jq - @error = %s", @error))

example()

Func example()
    Const $JSON_FILE  = "export-2021-01-10T07_51_53.836Z.json" ;<== Modify as needed
    Const $JQ_FILTER  = '.[] | ' & _
                        '{file: ."External ID", type: .Label.objects[]?} | ' & _
                        '{file, value: .type.value, top: .type.bbox.top, left: .type.bbox.left, height: .type.bbox.height, width: .type.bbox.width} | ' & _
                        '[.[]] | @csv'

    Local $hTimer     = TimerInit()
    Local $sCmdOutput = _jqExecFile($JSON_FILE, $JQ_FILTER)
    Local $iTimerDiff = TimerDiff($hTimer)

    write_log_line(@CRLF & "================================")
    write_log_line($sCmdOutput)
    write_log_line(@CRLF & "Duration: " & StringFormat("%.3f seconds", $iTimerDiff / 1000))
EndFunc

Func write_log_line($sMsg = "")
    Const $TITLE_NOTEPAD = "[RegExpTitle:(?i)untitled - notepad]"
    Static $hWndNotepad = -1

    ;If we don't have a handle to notepad yet
    If $hWndNotepad = -1 Then
        ;If there isn't an existing instance of notepad running, launch one
        If Not WinExists($TITLE_NOTEPAD) Then Run("Notepad.exe")

        ;Get handle to notepad window
        $hWndNotepad = WinWait($TITLE_NOTEPAD, "", 3)
        If Not $hWndNotepad Then Exit MsgBox($MB_ICONERROR, "ERROR", "Unable to find Notepad window.")
    EndIf

    ;Paste msg to notepad text
    If WinExists($hWndNotepad) Then
        ControlCommand($hWndNotepad, "", "Edit1", "EditPaste", $sMsg & @CRLF)
    EndIf
EndFunc

Snippet of the output:

"n101_0058.jpg","ocluded_ucoded_target",803,0,24,20
"n101_0058.jpg","ocluded_coded_target",845,0,57,53
"n101_0058.jpg","uncoded_target",1007,66,32,38
"n101_0058.jpg","coded_target",1051,72,58,65
"n101_0058.jpg","uncoded_target",1213,145,30,38
.
.
.
"n101_0252.jpg","uncoded_target",2377,4194,40,54
"n101_0252.jpg","uncoded_target",2651,3126,40,48
"n101_0252.jpg","uncoded_target",2013,3421,31,47
"n101_0252.jpg","coded_target",2054,3224,54,75
"n101_0252.jpg","uncoded_target",2084,3170,33,42

Duration: 1.279 seconds

 

For anyone reading this later who's interested in understanding the jq filter that was used, here's the filter with comments.  FYI, this filter can also be fed to the jqExec* functions as a file, instead of as a string literal, using the -f option.  The "|" symbol basically tells the processor to feed the result of the previous part to the next part.

# For each top-level array entry
.[] | 

# Reduce to only interesting json objects/data,
# creating an object for each .Label.objects' array
# object
{
  file: ."External ID", 
  label: .Label.objects[]?
} | 

# Reduce result to final set of objects with data of interest
{
  file,
  value:  .label.value,
  top:    .label.bbox.top,
  left:   .label.bbox.left,
  height: .label.bbox.height,
  width:  .label.bbox.width
} |

# Convert result to an array for input to @csv
[.[]] |

# Convert array to CSV
@csv

 

Edited by TheXman
Added commented filter for those that may be interested
Link to comment
Share on other sites

I feel a bit overwhelemed by your enthusiasm! Thanks everybody!

 

@Nine: Works like a charm! Damn fast as well. I realised by now that the image-name is also listed as "External ID" (easy to see, once you kow about the json plugin for notepad++ ;-)) . So I retreive the image name directly, and skip the line

$sContent = StringRegExp(StringRegExp($aFile[$i],'"Labeled Data":"([^"]+)',1)[0],'(?i)([^-]*?\.jpg)',1)[0]

This brings it down to 0,5s. 👍

One comment/question remains: As mentioned, the input file is only a trial, dealing with 500 images. The final file will deal with more than 20.000 images, so 40 times larger. So I might go back to reading one line at the time: using "FileReadLine". Does that make sense?

@junkew: I agree on your "target picture"- comment. But it wasn't so easy to understand it myself: Only 1-2 days ago, I realised that one can look at json-files in a structured way (web browser) - giving a much better understanding of the file structure. The notepad-pluggin sure comes in handy.

Haven't used powershell yet, but reading about it on the web... possibly an even better tool than autoit (for this task). I also found https://pandas.pydata.org . But I will stick to autoit for now, since the task is basically done.

@seadoggie01: I too started using the json.au3 yesterday evening, but it seemed rather slow, yes.

@TheXman: Nice! Very fast, too!

 

Once again: Thank you everybody! Your skrips make it look so easy.... 😉

Link to comment
Share on other sites

4 hours ago, dejhost said:

@Nine: Works like a charm! Damn fast as well.

Thank you, by writing once at every line of input makes it way faster...

4 hours ago, dejhost said:

As mentioned, the input file is only a trial, dealing with 500 images. The final file will deal with more than 20.000 images, so 40 times larger.

At the moment 500 images takes about 27.5 Mb, so 40 times brings it to 1.1Gb.  Tad too big in my opinion to use FileReadToArray, it may slow down the whole process.  I would tend to agree with you, that reading the json file line by line may be a better option.  But nothing stops you to try both methods...Let us know how it goes.

Edited by Nine
Link to comment
Share on other sites

9 hours ago, dejhost said:

@TheXman: Nice! Very fast, too!

Thanks!  I updated my previous post with a commented description of what the jq filter does just in case anyone else, that comes across this topic, may be interested in jq and how it can be used to process JSON.

Edited by TheXman
Link to comment
Share on other sites

@dejhost

FYI,

Because you're dealing with such large datasets, it is easy to miss some of the details & data if you aren't careful.  For example, in the sample JSON file that you provided, there are 2 images that don't have any .Label object information.  Therefore, those 2 records did not show up in the CSV data.  If that's what was expected, then all is good.  If you expected to see all images, whether they had object information or not, then those types of records would need to be taken into account.  So if you noticed and were wondering why you were missing some of the input images in the CSV data, that's why.  :)

The 2 JSON objects without label object information:

{
  "ID": "ckjayp5lc00003a69wakzxsxg",
  "DataRow ID": "ckj9odjg37bi20rh6ei442ug7",
  "Labeled Data": "https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2F0bacdc01-bee2-3b16-627e-671b8f6c2a1a-DSC04905.jpg?Expires=1611474714892&KeyName=labelbox-assets-key-1&Signature=dvsZdXT3dbUeRBvORvVxtMhhyMA",
  "Label": {},
  "Created By": "victor@oasisoutsourcing.co.ke",
  "Project Name": "SubSeaScanning",
  "Created At": "2020-12-30T05:11:30.000Z",
  "Updated At": "2020-12-30T05:23:05.000Z",
  "Seconds to Label": 32.094,
  "External ID": "DSC04905.jpg",
  "Agreement": -1,
  "Benchmark Agreement": -1,
  "Benchmark ID": null,
  "Dataset Name": "Trial",
  "Reviews": [
    {
      "score": 1,
      "id": "ckjc8j0k60gk50yaw07n69wiu",
      "createdAt": "2020-12-31T02:34:25.000Z",
      "createdBy": "victor@oasisoutsourcing.co.ke"
    }
  ],
  "View Label": "https://editor.labelbox.com?project=ckj9obfp954gq0718tasdrinc&label=ckjayp5lc00003a69wakzxsxg"
}
{
  "ID": "ckjmlilco000039686bfxpf8k",
  "DataRow ID": "ckjm8xorhfbe80rj53hi0bg36",
  "Labeled Data": "https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2F4644e43b-5e58-ff51-d1e3-4551ed722d6f-n101_0408.jpg?Expires=1611474715610&KeyName=labelbox-assets-key-1&Signature=erTXaMYyBO56kurONJv6AVGt8zU",
  "Label": {},
  "Created By": "evans@oasisoutsourcing.co.ke",
  "Project Name": "SubSeaScanning",
  "Created At": "2021-01-07T08:31:15.000Z",
  "Updated At": "2021-01-07T08:31:16.000Z",
  "Seconds to Label": 219.925,
  "External ID": "n101_0408.jpg",
  "Agreement": -1,
  "Benchmark Agreement": -1,
  "Benchmark ID": null,
  "Dataset Name": "Aassgard Spool - Batch 1",
  "Reviews": [
    {
      "score": 1,
      "id": "ckjn49ska0p010yd17ws9eyyh",
      "createdAt": "2021-01-07T17:20:44.000Z",
      "createdBy": "victor@oasisoutsourcing.co.ke"
    }
  ],
  "View Label": "https://editor.labelbox.com?project=ckj9obfp954gq0718tasdrinc&label=ckjmlilco000039686bfxpf8k"
}

 

Edited by TheXman
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...