Jump to content

Fixed Field Length Text Datafile Into 2d Array Using Regexp


Recommended Posts

I am trying to generate a 2d array out of an imported text file with fixed length fields on each line.

File format for each line is below.  The is no delimiter, just need to capture 10 chars then 35 chars then 4 chars etc etc into the 2d array.

10, 35,  4,  2,  8, 30,  1, 14, 10, 16, 50, 10,  8

The data file I'm importing is pretty big, 140MB or so and 810k lines.

This is how I'm doing it now.  It's a bit slow though.  Any suggestions for a better way much appreciated guys as my regexfu is weak :(

#include <array.au3>

;read in data file
$raw = FileRead(@Scriptdir & "\datafile.dat")

;create 1d array of all values in file (would love to just have a regex to made a 2d array in one hit... sigh.. i suck)
$array1d = StringRegExp($raw,"(.{10})(.{35})(.{4})(.{2})(.{8})(.{1})(.{15})(.{10})(.{16})(.{50})(.{10})(.{8})",3)


$rows = UBound($array1d)/12     ;12 fields per line in data file & regex makes 1d array so this is the line count of file
$cols = 12                      ;12 fields per line
local $array2d[$rows][$cols]    ;2d array to hold results
local $counter = 0              ;like it says


for $x = 0 to $rows-1
    for $y = 0 to $cols-1
        $array2d[$x][$y] = $array1d[$counter] ;insert 12 elements per row from 1d array into 2d array
        $counter = $counter + 1
    Next
Next

_ArrayDisplay($array2d,"Results")

edit: sorry, inserted broken code, it works now

 

Edited by gruntydatsun
Link to comment
Share on other sites

#include <Array.au3>

$s_File = FileRead(@Scriptdir & "\datafile.dat")

$aA_RegExReturn = StringRegExp($s_File, "(?m)^(.{10})(.{35})(.{4})(.{2})(.{8})(.{1})(.{15})(.{10})(.{16})(.{50})(.{10})(.{8})", 4)
$a_Result = _ArrayAinATo2d($aA_RegExReturn)
_ArrayDisplay($a_Result)


; #FUNCTION# ======================================================================================
; Name ..........: _ArrayAinATo2d()
; Description ...: Convert a Arrays in Array into a 2D array
; Syntax ........: _ArrayAinATo2d(ByRef $A)
; Parameters ....: $A             - the arrays in array which should be converted
; Return values .: Success: a 2D Array build from the input array
;                  Failure: False
;                     @error = 1: $A is'nt an 1D array
;                            = 2: $A is empty
;                            = 3: first element isn't a array
; Author ........: AspirinJunkie
; =================================================================================================
Func _ArrayAinATo2d(ByRef $A)
    If UBound($A, 0) <> 1 Then Return SetError(1, UBound($A, 0), False)
    Local $N = UBound($A)
    If $N < 1 Then Return SetError(2, $N, False)
    Local $u = UBound($A[0])
    If $u < 1 Then Return SetError(3, $u, False)

    Local $a_Ret[$N][$u]

    For $i = 0 To $N - 1
        Local $t = $A[$i]
        If UBound($t) > $u Then ReDim $a_Ret[$N][UBound($t)]
        For $j = 0 To UBound($t) - 1
            $a_Ret[$i][$j] = $t[$j]
        Next
    Next
    Return $a_Ret
EndFunc   ;==>_ArrayAinATo2d

 

If the datatypes are necessary or you want to convert some columns or anything then maybe this is an alternative for you:

#include <Array.au3>

$s_File = FileRead(@ScriptDir & "\datafile.dat")
$a_Data = String_HandleTable($s_File, "int 10; leftString 35; myOwnFunc 4; Int 2;")
_ArrayDisplay($a_Data, "parsed")


Func myOwnFunc(ByRef $A)
    Return String($A)
EndFunc   ;==>myOwnFunc

; convert a string with fixed column widths into an 2D-array + user defined type conversions
Func String_HandleTable(ByRef Const $s_String, Const $s_Format)
    ; by AspirinJunkie
    ; create format array
    Local $a_Split = StringRegExp($s_Format, '\s*(\w+)\s*(\d+)', 4)
    If @error Then Return SetError(1, @error, "")
    Local $a_Cols[UBound($a_Split)][3]

    Local $x = 1, $j = 0
    For $i In $a_Split
        $x += $i[2]
        If $i[1] = "void" Then ContinueLoop
        $a_Cols[$j][0] = $i[1]
        $a_Cols[$j][1] = $x - $i[2]
        $a_Cols[$j][2] = $i[2]

        $j += 1
    Next
    If $j <> UBound($a_Cols) Then ReDim $a_Cols[$j][3]

    ; process string

    Local $a_Split = StringRegExp($s_String, "([^\r\n|\n|\r]+)", 3)
    Local $a_Ret[UBound($a_Split)][UBound($a_Cols)]
    Local $s_rawVal, $s_Val, $x = 0
    For $s_Line In $a_Split
        For $i = 0 To UBound($a_Cols) - 1
            $s_rawVal = StringMid($s_Line, $a_Cols[$i][1], $a_Cols[$i][2])
            Switch $a_Cols[$i][0]
                Case "leftString"
                    $s_Val = StringStripWS($s_rawVal, 2)
                Case "rightString"
                    $s_Val = StringStripWS($s_rawVal, 1)
                Case Else
                    $s_Val = Execute($a_Cols[$i][0] & '("' & $s_rawVal & '")')
                    If @error Then ContinueLoop
            EndSwitch
            $a_Ret[$x][$i] = $s_Val
        Next
        $x += 1
    Next
    Return $a_Ret
EndFunc   ;==>String_HandleTable

 

Edited by AspirinJunkie
Link to comment
Share on other sites

Hi,

18 hours ago, gruntydatsun said:

The data file I'm importing is pretty big, 140MB or so and 810k lines.

This is how I'm doing it now.  It's a bit slow though. 

The question is, what do you want to do?

Nobody reads 140MB data to display them within an _ArrayDisplay()!

Some profiling shows that reading a 169MB testfile takes about 4 seconds, regex takes 6 seconds, transforming 1D_>2D-array takes 17 seconds....all in all 27 seconds on my machine (without "showing" the data via _ArrayDisplay() which takes 8-10 extra seconds)

How should the data be further processed? I am almost sure that there is a faster solution for your "real" problem...

Edited by AndyG
Link to comment
Share on other sites

Further to AndyG's comments

If you want to view data after the 65,525th row (index 65,524) , then _ArrayDisplay() is not the way to go. (See _ArrayDisplay - 2nd paragraph of "Remarks")

#include <array.au3>

;read in data file
Local $raw ;= FileRead(@Scriptdir & "\data_file.txt")

; Create test data
For $i = 0 To 169 * 70000 ;  850000
    $raw &= Chr(Mod($i, 26) + 97)
Next
ConsoleWrite("Test data created  " & 169 * 850000 / 1000000 & " Mb" & @CRLF) ; Returns 143.65 Mb

;create 1d array of all values in file (would love to just have a regex to made a 2d array in one hit... sigh.. i suck)
$array = StringRegExp($raw, "(.{10})(.{35})(.{4})(.{2})(.{8})(.{1})(.{15})(.{10})(.{16})(.{50})(.{10})(.{8})", 3)
_ArrayDisplay($array)

$rows = Ceiling(UBound($array) / 12) ;12 fields per line in data file & regex makes 1d array so this is the line count of file
$cols = 12 ;12 fields per line
Local $array2d[$rows][$cols] ;2d array to hold results

; 1D array, $array, to 2D, array, $array2d.
For $i = 0 To UBound($array) - 1
    $array2d[Int($i / $cols)][Mod($i, $cols)] = $array[$i]
Next

_ArrayDisplay($array2d, "Results", "70000") ; Note: Only 65525 rows shown, not 70,000.

An alternative method is:

Local $raw = FileRead(@ScriptDir & "\data_file.txt")

Local $sTabSeparatedFields = StringRegExpReplace($raw, "(.{10})(.{35})(.{4})(.{2})(.{8})(.{1})(.{15})(.{10})(.{16})(.{50})(.{10})(.{8})", _
        "\1" & @TAB & "\2" & @TAB & "\3" & @TAB & "\4" & @TAB & "\5" & @TAB & "\6" & @TAB & "\7" & @TAB & "\8" & @TAB & _
        "\9" & @TAB & "\10" & @TAB & "\11" & @CRLF) ; <-- Trailing " & @CRLF" may not be needed if data fields already are in lines 169 characters long.
ConsoleWrite($sTabSeparatedFields & @CRLF)

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...