Jump to content
Sign in to follow this  
DCCD

replace multiple strings in 100mb file

Recommended Posts

DCCD

Hi, i wrote a script that can replace multiple strings in a xml file works fine but so slow!

I've used StringReplace ,_ReplaceStringInFile, StringRegExpReplace, all the same very slow,.

The number of replacements in the file about 8000

Any help would be greatly appreciated

#include <File.au3>
$path = @ScriptDir & '\xmlfo.xml'
$OXML = FileOpen($path, 256)
$XML = FileRead($OXML)

$term = 'post'
$nofr = 1
Local $aArray = StringRegExp($XML, '(?s)<entry[^>]*>.*?</entry>', 3)
FileClose($OXML)
$XL = $XML
If Not @error Then
    For $i = 0 To UBound($aArray) - 1
        ;get data start
        ;ConsoleWrite ( $aArray[0] &' '&$i& @CRLF)
        $date = StringRegExp($aArray[$i], '(?i)<published>(.*?)</published>', 3)
        If @error Then
            $date = StringRegExp("date err", "(.{33,}?(?:\s)|.+)", 3)
        ElseIf Not @error Then
            ;ConsoleWrite($date[0] & ' ' & $i & @CRLF)
        EndIf
        $kind = StringRegExp($aArray[$i], '(?i)<category>(.*?)</category>', 3)
        If @error Then
            $kind = StringRegExp("kind err", "(.{33,}?(?:\s)|.+)", 3)
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        ElseIf Not @error Then
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        EndIf
        If $kind[0] = $term And Data(getdate($date[0], 'year'), getdate($date[0], 'month')) = True Then

            _ReplaceStringInFile($path, $aArray[$i], '')

            If Not @error Then
                ;MsgBox(16,'',$XL)
                ConsoleWrite($nofr & ' ' & $i & @CRLF)
                $nofr = $nofr + 1
            EndIf
            ;FileDelete(@ScriptDir & '\XML_output.xml')
            ;FileWrite (@ScriptDir & '\XML_output.xml', StringToBinary ( StringReplace($temp, $aArray[$i], "") , 4) )
        Else
            ConsoleWrite ('err0x0'& @CRLF)
        EndIf
    Next
EndIf
Edited by DCCD

Share this post


Link to post
Share on other sites
SmOke_N

Well, you have a huge issue with loading and unloading 100mb's into memory over and over.

Every call to _ReplaceStringInFile opens the file twice.

So... 2 suggestions I can think of.

1.  Ditch _ReplaceStringInFile() and just read the file into memory once, enum each line, keep what you want, remove what you don't (would require a second string to write back to the file, I say a second string because _ArrayDelete ReDims the array every time).

2.  Read the file into chunks and repeat step 1.

Edit:

If this is some type of database script, sqlite would make a lot more sense.

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites
kylomas

DCCD,

See jdelaney's sig for working with XML files directly. 

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
jguinch

This : StringRegExp("date err", "(.{33,}?(?:s)|.+)", 3)

and this : StringRegExp("kind err", "(.{33,}?(?:s)|.+)", 3)

has not sense...

Can you post a sample of your XML file, and explain us what exactly you want to replace by what ?

Share this post


Link to post
Share on other sites
SmOke_N

You're going to have us guess without your code and an example file of what you've tried aren't you :( ... ?

  • Like 1

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites
DCCD

Well, you have a huge issue with loading and unloading 100mb's into memory over and over.

Every call to _ReplaceStringInFile opens the file twice.

So... 2 suggestions I can think of.

1.  Ditch _ReplaceStringInFile() and just read the file into memory once, enum each line, keep what you want, remove what you don't (would require a second string to write back to the file, I say a second string because _ArrayDelete ReDims the array every time).

2.  Read the file into chunks and repeat step 1.

Edit:

If this is some type of database script, sqlite would make a lot more sense.

 

each text string need to be replaced may contain more than 500 characters/numbers.

Share this post


Link to post
Share on other sites
SmOke_N

I'm sorry, I don't see the relevance to your statement/reply.

 

Edit:

This would speed up your script exponentially.

#include <File.au3>
$path = @ScriptDir & '\xmlfo.xml'
$OXML = FileOpen($path, 256)
$XML = FileRead($OXML)
FileClose($OXML)

$term = 'post'
$nofr = 1
Local $aArray = StringRegExp($XML, '(?s)<entry[^>]*>.*?</entry>', 3)

If Not @error Then
    For $i = 0 To UBound($aArray) - 1
        ;get data start
        ;ConsoleWrite ( $aArray[0] &' '&$i& @CRLF)
        $date = StringRegExp($aArray[$i], '(?i)<published>(.*?)</published>', 3)
        If @error Then
            $date = StringRegExp("date err", "(.{33,}?(?:\s)|.+)", 3)
        ElseIf Not @error Then
            ;ConsoleWrite($date[0] & ' ' & $i & @CRLF)
        EndIf
        $kind = StringRegExp($aArray[$i], '(?i)<category>(.*?)</category>', 3)
        If @error Then
            $kind = StringRegExp("kind err", "(.{33,}?(?:\s)|.+)", 3)
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        ElseIf Not @error Then
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        EndIf
        If $kind[0] = $term And Data(getdate($date[0], 'year'), getdate($date[0], 'month')) = True Then
            
            $XML = StringReplace($XML, $aArray[$i], '')
            ;_ReplaceStringInFile($path, $aArray[$i], '')

            If Not @error Then
                ;MsgBox(16,'',$XL)
                ConsoleWrite($nofr & ' ' & $i & @CRLF)
                $nofr = $nofr + 1
            EndIf
            ;FileDelete(@ScriptDir & '\XML_output.xml')
            ;FileWrite (@ScriptDir & '\XML_output.xml', StringToBinary ( StringReplace($temp, $aArray[$i], "") , 4) )
        Else
            ConsoleWrite ('err0x0'& @CRLF)
        EndIf
    Next
EndIf

Global $ghOpen = FileOpen($path, $FO_UTF8_NOBOM + $FO_OVERWRITE)
FileWrite($ghOpen, $XML)
FileClose($ghOpen)

Here, as suggested before, we are only opening the file, reading the file, and writing to the file 1 time.

Your way, it was opening, reading to memory, writing as many times as the loop was long.

One thing is different, the FileOpen at the bottom of the script, you never told _ReplaceStringInFile how to write the data back to the file, so it was writing it regularly, I added $FO_UTF8_NOBOM strictly because that's how you opened it before in your code example.

So you may want to backup your xml file before using this code (just FYI).

Edited by SmOke_N
  • Like 1

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites
DCCD

I'm sorry, I don't see the relevance to your statement/reply.

 

Edit:

This would speed up your script exponentially.

#include <File.au3>
$path = @ScriptDir & '\xmlfo.xml'
$OXML = FileOpen($path, 256)
$XML = FileRead($OXML)
FileClose($OXML)

$term = 'post'
$nofr = 1
Local $aArray = StringRegExp($XML, '(?s)<entry[^>]*>.*?</entry>', 3)

If Not @error Then
    For $i = 0 To UBound($aArray) - 1
        ;get data start
        ;ConsoleWrite ( $aArray[0] &' '&$i& @CRLF)
        $date = StringRegExp($aArray[$i], '(?i)<published>(.*?)</published>', 3)
        If @error Then
            $date = StringRegExp("date err", "(.{33,}?(?:\s)|.+)", 3)
        ElseIf Not @error Then
            ;ConsoleWrite($date[0] & ' ' & $i & @CRLF)
        EndIf
        $kind = StringRegExp($aArray[$i], '(?i)<category>(.*?)</category>', 3)
        If @error Then
            $kind = StringRegExp("kind err", "(.{33,}?(?:\s)|.+)", 3)
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        ElseIf Not @error Then
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        EndIf
        If $kind[0] = $term And Data(getdate($date[0], 'year'), getdate($date[0], 'month')) = True Then
            
            $XML = StringReplace($XML, $aArray[$i], '')
            ;_ReplaceStringInFile($path, $aArray[$i], '')

            If Not @error Then
                ;MsgBox(16,'',$XL)
                ConsoleWrite($nofr & ' ' & $i & @CRLF)
                $nofr = $nofr + 1
            EndIf
            ;FileDelete(@ScriptDir & '\XML_output.xml')
            ;FileWrite (@ScriptDir & '\XML_output.xml', StringToBinary ( StringReplace($temp, $aArray[$i], "") , 4) )
        Else
            ConsoleWrite ('err0x0'& @CRLF)
        EndIf
    Next
EndIf

Global $ghOpen = FileOpen($path, $FO_UTF8_NOBOM + $FO_OVERWRITE)
FileWrite($ghOpen, $XML)
FileClose($ghOpen)

Here, as suggested before, we are only opening the file, reading the file, and writing to the file 1 time.

Your way, it was opening, reading to memory, writing as many times as the loop was long.

One thing is different, the FileOpen at the bottom of the script, you never told _ReplaceStringInFile how to write the data back to the file, so it was writing it regularly, I added $FO_UTF8_NOBOM strictly because that's how you opened it before in your code example.

So you may want to backup your xml file before using this code (just FYI).

 

@SmOke_N, Thank you for all your help  ^_^ and I apologize for the late response :sweating:

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Similar Content

    • therks
      By therks
      I'm looking for a regex genius, cus I'm stumped when it comes to assertions.
      So what I have now, is this regular expression: ([^|=]+)=([^|]+)
      It takes a string (user input) of keys=values separated by pipes (ie: "param=value|param=value") and splits them into an array.
      Example:
      $vParamData = 'example=value|fruit=apple|phrase=Hello world' $aRegEx = StringRegExp($vParamData, '([^|=]+)=([^|]+)', 3) ; Result ; [0] => example ; [1] => value ; [2] => fruit ; [3] => apple ; [4] => phrase ; [5] => Hello world So that's working fine, but I'm wondering if there's also a way I could have this capture escaped pipes instead of splitting by them.
      ie:
      $vParamData = 'pipe test=this \| is a pipe|example=value' $aRegEx = StringRegExp($vParamData, '([^|=]+)=([^|]+)', 3) ; I'm getting this: ; [0] => pipe test ; [1] => this \ ; [2] => example ; [3] => value ; But I'd like a result like this: ; [0] => pipe test ; [1] => this \| is a pipe ; [2] => example ; [3] => value Is there some pattern that would accomplish this, or am I better off parsing it some other way?
    • corz
      By corz
      Associative Array Functions
      I've seen a couple of UDFs for this on the forum. One of them I quite like. But it's still nearly not as good as this method, IMHO.
      I don't recall if I discovered the "Scripting.Dictionary" COM object myself or if I got the original base code from somewhere online. I have recently searched the web (and here) hard for any AutoIt references to this, other than my own over the years I've been using this (in ffe, etc..), and I can find nothing, so I dunno. If anyone does, I'd love to give credit where it's due; this is some cute stuff! It could actually be all my own work! lol
      At any rate, it's too useful to not have posted somewhere at autoitscript.com, so I've put together a wee demo.
      For those who haven't heard of the COM "Scripting.Dictionary".. 
      If you've ever coded in Perl or PHP (and many other languages), you know how useful associative arrays are. Basically, rather than having to iterate through an array to discover it's values, with an associative array you simply pluck values out by their key "names".
      I've added a few functions over the years, tweaked and tuned, and this now represent pretty much everything you need to easily work with associative arrays in AutoIt. En-joy!
      The main selling point of this approach is its simplicity and weight. I mean, look at how much code it takes to work with associative arrays! The demo is bigger than all the functions put together! The other selling point is that we are using Windows' built-in COM object functions which are at least theoretically, fast and robust.
      I've used it many times without issues, anyhow, here goes..
      ; Associative arrays in AutoIt? Hells yeah! ; Initialize your array ... global $oMyError = ObjEvent("AutoIt.Error", "AAError") ; Initialize a COM error handler ; first example, simple. global $simple AAInit($simple) AAAdd($simple, "John", "Baptist") AAAdd($simple, "Mary", "Lady Of The Night") AAAdd($simple, "Trump", "Silly Man-Child") AAList($simple) debug("It is said that Trump is a " & AAGetItem($simple, "Trump") & ".", @ScriptLineNumber);debug debug("") ; slightly more interesting.. $ini_path = "AA_Test.ini" ; Put this prefs section in your ini file.. ; [test] ; foo=foo value ; foo2=foo2 value ; bar=bar value ; bar2=bar2 value global $associative_array AAInit($associative_array) ; We are going to convert this 2D array into a cute associative array where we ; can access the values by simply using their respective key names.. $test_array = IniReadSection($ini_path, "test") for $z = 1 to 2 ; do it twice, to show that the items are *really* there! for $i = 1 to $test_array[0][0] $key_name = $test_array[$i][0] debug("Adding '" & $key_name & "'..");debug ; key already exists in "$associative_array", use the pre-determined value.. if AAExists($associative_array, $key_name) then $this_value = AAGetItem($associative_array, $key_name) debug("key_name ALREADY EXISTS! : =>" & $key_name & "<=" , @ScriptLineNumber);debug else $this_value = $test_array[$i][1] ; store left=right value pair in AA if $this_value then AAAdd($associative_array, $key_name, $this_value) endif endif next next debug(@CRLF & "Array Count: =>" & AACount($associative_array) & "<=" , @ScriptLineNumber);debug AAList($associative_array) debug(@CRLF & "Removing 'foo'..");debug AARemove($associative_array, "foo") debug(@CRLF & "Array Count: =>" & AACount($associative_array) & "<=" , @ScriptLineNumber);debug AAList($associative_array) debug(@CRLF & "Removing 'bar'..");debug AARemove($associative_array, "bar") debug(@CRLF & "Array Count: =>" & AACount($associative_array) & "<=" , @ScriptLineNumber);debug AAList($associative_array) quit() func quit() AAWipe($associative_array) AAWipe($simple) endfunc ;; Begin AA Functions func AAInit(ByRef $dict_obj) $dict_obj = ObjCreate("Scripting.Dictionary") endfunc ; Adds a key and item pair to a Dictionary object.. func AAAdd(ByRef $dict_obj, $key, $val) $dict_obj.Add($key, $val) If @error Then return SetError(1, 1, -1) endfunc ; Removes a key and item pair from a Dictionary object.. func AARemove(ByRef $dict_obj, $key) $dict_obj.Remove($key) If @error Then return SetError(1, 1, -1) endfunc ; Returns true if a specified key exists in the associative array, false if not.. func AAExists(ByRef $dict_obj, $key) return $dict_obj.Exists($key) endfunc ; Returns a value for a specified key name in the associative array.. func AAGetItem(ByRef $dict_obj, $key) return $dict_obj.Item($key) endfunc ; Returns the total number of keys in the array.. func AACount(ByRef $dict_obj) return $dict_obj.Count endfunc ; List all the "Key" > "Item" pairs in the array.. func AAList(ByRef $dict_obj) debug("AAList: =>", @ScriptLineNumber);debug local $k = $dict_obj.Keys ; Get the keys ; local $a = $dict_obj.Items ; Get the items for $i = 0 to AACount($dict_obj) -1 ; Iterate the array debug($k[$i] & " ==> " & AAGetItem($dict_obj, $k[$i])) next endfunc ; Wipe the array, obviously. func AAWipe(ByRef $dict_obj) $dict_obj.RemoveAll() endfunc ; Oh oh! func AAError() Local $err = $oMyError.number If $err = 0 Then $err = -1 SetError($err) ; to check for after this function returns endfunc ;; End AA Functions. ; debug() (trimmed-down version) ; ; provides quick debug report in your console.. func debug($d_string, $ln=false) local $pre ; For Jump-to-Line in Notepad++ if $ln then $pre = "(" & $ln & ") " & @Tab ConsoleWrite($pre & $d_string & @CRLF) endfunc  
      ;o) Cor
    • Eminence
      By Eminence
      Hello,
      Is there a way wherein I can access the data from an array coming from an Excel file then have it assigned on to a variable?
      Below is a snippet of my current code. For now, it just reads and outputs the data from the excel file and have it displayed via an array.
      #include <Array.au3> #include <Excel.au3> #include <MsgBoxConstants.au3> Local $oExcel = _Excel_Open(False) If @error Then Exit MsgBox(0, "Error", "Error creating application object." & @CRLF & "Error: " & @error & " Extends: " & @extended) ; Open Excel Woorkbook and return object Local $sWorkbook = @ScriptDir & "\Excel Files\Test Data.xlsx" Local $oWorkbook = _Excel_BookOpen($oExcel, $sWorkbook, False, True) If @error Then MsgBox(0, "Error", "Error opening workbook'" & $sWorkbook & ".'" & @CRLF & "Error: " & @error & "Extends: " & @extended) _Excel_Close($oExcel) Exit EndIf Local $aResult = _Excel_RangeRead($oWorkbook) ; Error Trapping If @error Then MsgBox(0, "Error", "Error reading data from '" & $sWorkbook & ".'" & @CRLF & "Error: " & @error & " Extends: " & @extended) _Excel_Close($oExcel) Exit EndIf _ArrayDisplay($aResult) My Excel file has values from Column A to H with values from 1 to 30, what I desired to do is have the value in "A7" assigned on to a variable. 
       
      Any help is appreciated. Thanks in advance.
    • Abdulla060
      By Abdulla060
      i have a 3d array that is [10][20][6] for now lets assume that its [3][3][3] so it looks something like this 
      [[[1,2,3],[1,2,3],[1,2,3]], [[1,2,3],[1,2,3],[1,2,3]], [[1,2,3],[1,2,3],[1,2,3]]] i need to add another 1d  array to the position [2][3] ( i hope its clear) so it becomes like this 
      [[[1,2,3],[1,2,3],[1,2,3]], [[1,2,3],[1,2,3],[1,2,3]], [[1,2,3],[1,2,3],[1,2,3],[4,5,6]]] and i have no idea how  
    • NizonRox
      By NizonRox
      Hi, i'm currently facing problems with understanding how arrays work, or atleast a few commands that alter arrays.
      My current situation is:
      1. I'm taking the process list and putting it all in an array
      2. I want to remove the boring common windows processes
      3. Profit
      And i'm currently stuck on step 2, while i already found this thread it dosn't seem that i can make it do what i want.
      Current code:
      Local $PList = ProcessList() Local $RL[6] = ["smss.exe", "csrss.exe", "svchost.exe", "iexplore.exe", "chrome.exe", "conhost.exe"] Sleep(1) For $i=1 To Ubound($RL)-1 Sleep(1) While Not @Error $iIndex = _ArraySearch($PList, $RL[$i], 1, 0, 0, 1) _ArrayDelete($PList, $iIndex) WEnd Next It seems to remove all but smss.exe from the array list unless i have it two times in the array.
       
      Note: The sleep(1) is there to clear the error else the command wont fire for the rest of the array, any other way of doing it?
×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.