Jump to content
Sign in to follow this  
DCCD

replace multiple strings in 100mb file

Recommended Posts

DCCD

Hi, i wrote a script that can replace multiple strings in a xml file works fine but so slow!

I've used StringReplace ,_ReplaceStringInFile, StringRegExpReplace, all the same very slow,.

The number of replacements in the file about 8000

Any help would be greatly appreciated

#include <File.au3>
$path = @ScriptDir & '\xmlfo.xml'
$OXML = FileOpen($path, 256)
$XML = FileRead($OXML)

$term = 'post'
$nofr = 1
Local $aArray = StringRegExp($XML, '(?s)<entry[^>]*>.*?</entry>', 3)
FileClose($OXML)
$XL = $XML
If Not @error Then
    For $i = 0 To UBound($aArray) - 1
        ;get data start
        ;ConsoleWrite ( $aArray[0] &' '&$i& @CRLF)
        $date = StringRegExp($aArray[$i], '(?i)<published>(.*?)</published>', 3)
        If @error Then
            $date = StringRegExp("date err", "(.{33,}?(?:\s)|.+)", 3)
        ElseIf Not @error Then
            ;ConsoleWrite($date[0] & ' ' & $i & @CRLF)
        EndIf
        $kind = StringRegExp($aArray[$i], '(?i)<category>(.*?)</category>', 3)
        If @error Then
            $kind = StringRegExp("kind err", "(.{33,}?(?:\s)|.+)", 3)
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        ElseIf Not @error Then
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        EndIf
        If $kind[0] = $term And Data(getdate($date[0], 'year'), getdate($date[0], 'month')) = True Then

            _ReplaceStringInFile($path, $aArray[$i], '')

            If Not @error Then
                ;MsgBox(16,'',$XL)
                ConsoleWrite($nofr & ' ' & $i & @CRLF)
                $nofr = $nofr + 1
            EndIf
            ;FileDelete(@ScriptDir & '\XML_output.xml')
            ;FileWrite (@ScriptDir & '\XML_output.xml', StringToBinary ( StringReplace($temp, $aArray[$i], "") , 4) )
        Else
            ConsoleWrite ('err0x0'& @CRLF)
        EndIf
    Next
EndIf
Edited by DCCD

Share this post


Link to post
Share on other sites
SmOke_N

Well, you have a huge issue with loading and unloading 100mb's into memory over and over.

Every call to _ReplaceStringInFile opens the file twice.

So... 2 suggestions I can think of.

1.  Ditch _ReplaceStringInFile() and just read the file into memory once, enum each line, keep what you want, remove what you don't (would require a second string to write back to the file, I say a second string because _ArrayDelete ReDims the array every time).

2.  Read the file into chunks and repeat step 1.

Edit:

If this is some type of database script, sqlite would make a lot more sense.

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites
kylomas

DCCD,

See jdelaney's sig for working with XML files directly. 

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
jguinch

This : StringRegExp("date err", "(.{33,}?(?:s)|.+)", 3)

and this : StringRegExp("kind err", "(.{33,}?(?:s)|.+)", 3)

has not sense...

Can you post a sample of your XML file, and explain us what exactly you want to replace by what ?

Share this post


Link to post
Share on other sites
SmOke_N

You're going to have us guess without your code and an example file of what you've tried aren't you :( ... ?

  • Like 1

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites
DCCD

Well, you have a huge issue with loading and unloading 100mb's into memory over and over.

Every call to _ReplaceStringInFile opens the file twice.

So... 2 suggestions I can think of.

1.  Ditch _ReplaceStringInFile() and just read the file into memory once, enum each line, keep what you want, remove what you don't (would require a second string to write back to the file, I say a second string because _ArrayDelete ReDims the array every time).

2.  Read the file into chunks and repeat step 1.

Edit:

If this is some type of database script, sqlite would make a lot more sense.

 

each text string need to be replaced may contain more than 500 characters/numbers.

Share this post


Link to post
Share on other sites
SmOke_N

I'm sorry, I don't see the relevance to your statement/reply.

 

Edit:

This would speed up your script exponentially.

#include <File.au3>
$path = @ScriptDir & '\xmlfo.xml'
$OXML = FileOpen($path, 256)
$XML = FileRead($OXML)
FileClose($OXML)

$term = 'post'
$nofr = 1
Local $aArray = StringRegExp($XML, '(?s)<entry[^>]*>.*?</entry>', 3)

If Not @error Then
    For $i = 0 To UBound($aArray) - 1
        ;get data start
        ;ConsoleWrite ( $aArray[0] &' '&$i& @CRLF)
        $date = StringRegExp($aArray[$i], '(?i)<published>(.*?)</published>', 3)
        If @error Then
            $date = StringRegExp("date err", "(.{33,}?(?:\s)|.+)", 3)
        ElseIf Not @error Then
            ;ConsoleWrite($date[0] & ' ' & $i & @CRLF)
        EndIf
        $kind = StringRegExp($aArray[$i], '(?i)<category>(.*?)</category>', 3)
        If @error Then
            $kind = StringRegExp("kind err", "(.{33,}?(?:\s)|.+)", 3)
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        ElseIf Not @error Then
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        EndIf
        If $kind[0] = $term And Data(getdate($date[0], 'year'), getdate($date[0], 'month')) = True Then
            
            $XML = StringReplace($XML, $aArray[$i], '')
            ;_ReplaceStringInFile($path, $aArray[$i], '')

            If Not @error Then
                ;MsgBox(16,'',$XL)
                ConsoleWrite($nofr & ' ' & $i & @CRLF)
                $nofr = $nofr + 1
            EndIf
            ;FileDelete(@ScriptDir & '\XML_output.xml')
            ;FileWrite (@ScriptDir & '\XML_output.xml', StringToBinary ( StringReplace($temp, $aArray[$i], "") , 4) )
        Else
            ConsoleWrite ('err0x0'& @CRLF)
        EndIf
    Next
EndIf

Global $ghOpen = FileOpen($path, $FO_UTF8_NOBOM + $FO_OVERWRITE)
FileWrite($ghOpen, $XML)
FileClose($ghOpen)

Here, as suggested before, we are only opening the file, reading the file, and writing to the file 1 time.

Your way, it was opening, reading to memory, writing as many times as the loop was long.

One thing is different, the FileOpen at the bottom of the script, you never told _ReplaceStringInFile how to write the data back to the file, so it was writing it regularly, I added $FO_UTF8_NOBOM strictly because that's how you opened it before in your code example.

So you may want to backup your xml file before using this code (just FYI).

Edited by SmOke_N
  • Like 1

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites
DCCD

I'm sorry, I don't see the relevance to your statement/reply.

 

Edit:

This would speed up your script exponentially.

#include <File.au3>
$path = @ScriptDir & '\xmlfo.xml'
$OXML = FileOpen($path, 256)
$XML = FileRead($OXML)
FileClose($OXML)

$term = 'post'
$nofr = 1
Local $aArray = StringRegExp($XML, '(?s)<entry[^>]*>.*?</entry>', 3)

If Not @error Then
    For $i = 0 To UBound($aArray) - 1
        ;get data start
        ;ConsoleWrite ( $aArray[0] &' '&$i& @CRLF)
        $date = StringRegExp($aArray[$i], '(?i)<published>(.*?)</published>', 3)
        If @error Then
            $date = StringRegExp("date err", "(.{33,}?(?:\s)|.+)", 3)
        ElseIf Not @error Then
            ;ConsoleWrite($date[0] & ' ' & $i & @CRLF)
        EndIf
        $kind = StringRegExp($aArray[$i], '(?i)<category>(.*?)</category>', 3)
        If @error Then
            $kind = StringRegExp("kind err", "(.{33,}?(?:\s)|.+)", 3)
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        ElseIf Not @error Then
            ;ConsoleWrite ( $kind[0] &' '&$i& @CRLF)
        EndIf
        If $kind[0] = $term And Data(getdate($date[0], 'year'), getdate($date[0], 'month')) = True Then
            
            $XML = StringReplace($XML, $aArray[$i], '')
            ;_ReplaceStringInFile($path, $aArray[$i], '')

            If Not @error Then
                ;MsgBox(16,'',$XL)
                ConsoleWrite($nofr & ' ' & $i & @CRLF)
                $nofr = $nofr + 1
            EndIf
            ;FileDelete(@ScriptDir & '\XML_output.xml')
            ;FileWrite (@ScriptDir & '\XML_output.xml', StringToBinary ( StringReplace($temp, $aArray[$i], "") , 4) )
        Else
            ConsoleWrite ('err0x0'& @CRLF)
        EndIf
    Next
EndIf

Global $ghOpen = FileOpen($path, $FO_UTF8_NOBOM + $FO_OVERWRITE)
FileWrite($ghOpen, $XML)
FileClose($ghOpen)

Here, as suggested before, we are only opening the file, reading the file, and writing to the file 1 time.

Your way, it was opening, reading to memory, writing as many times as the loop was long.

One thing is different, the FileOpen at the bottom of the script, you never told _ReplaceStringInFile how to write the data back to the file, so it was writing it regularly, I added $FO_UTF8_NOBOM strictly because that's how you opened it before in your code example.

So you may want to backup your xml file before using this code (just FYI).

 

@SmOke_N, Thank you for all your help  ^_^ and I apologize for the late response :sweating:

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Similar Content

    • rm4453
      By rm4453
      Hello,
       
      I am currently writing a program that parses a massive table from a website, and need a way to add a progress bar while parsing.
      I am currently using the function _IETableWriteToArray($oObj, True) to parse the array. I need the progress bar to update as the table is parsed, not just at the end of the parsing.
      Any help at all would be very much appreciated!
       
      *EDIT --> The array I am left with after parsing is $array[0-50000][16]
    • TrashBoat
      By TrashBoat
      So Im trying to make a simple 2d game and make some sort of collision detection so why not to make a 2 dimensional array but i have no clue how  to write it in multiple lines
      Global $map[5,5] = [0,0,0,0,0 _ [0,0,0,0,0 _ [0,0,0,0,0 _ [0,0,0,0,0 _ [0,0,0,0,0] something like this but it doesn't work
    • Zein
      By Zein
      #include "..\Include\Array.au3" #include "..\Include\File.au3" #include "..\Include\AutoItConstants.au3" Local $aRetArray Local $sFilePath = "n.csv" _FileReadToArray($sFilePath, $aRetArray, ",") ; _FileReadToArray($sFilePath, $aRetArray, $FRTA_COUNT, ",") _ArrayDisplay($aRetArray, "Original", Default, 8) The above code shows two versions of _FileReadToArray and both don't work as expected.
      The first one doesn't use the comma as a delimiter. (so I get a single column array)  I tried adding "Default" between $aRetArray and "," then it told me it had an incorrect number of parameters. 

      I looked again at the documentation:
       
      #include <File.au3> _FileReadToArray ( $sFilePath, ByRef $vReturn [, $iFlags = $FRTA_COUNT [, $sDelimiter = ""]] )
      And I with or without the flags params I should be getting a 2D array due to my file being a csv. 
      I then tried a regular flag, $FRTA_COUNT, and it tells me that I'm using a variable $FRTA_COUNT while it's not declared. Tried putting in 1 instead and it told me again, incorrect number of params. 

       
    • ternal
      By ternal
      Hi,
      Recently I have had the need to do a sort and then do a second sort while the item of the first sort stays the same ( double sorting , first on column x then while column x is the same sort column y).
      I did not put much efffort into error checking but so far I did not need it.
      For my applications so far it works perfectly however if someone is willing I want to test this extensivly.
      If anyone has big lists of random stuff to sort could you try this out please?
      #include <Array.au3> ; #FUNCTION# ==================================================================================================================== ; Name ..........: _ArraySort_Double ; Description ...: ; Syntax ........: _ArraySort_Double (Byref $array[, $first_index = Default[, $second_index = Default[, $ascending = Default]]]) ; Parameters ....: $array - 2d array to sort. ; $first_index - [optional] first column to sort. Default is 0. ; $second_index - [optional] second column to sort. Default is 1. ; $ascending - [optional] ascending/descending. Default is 1. ; Return values .: 1 if no errors occured , -1 if errors occured ; Author ........: Ternal ; Remarks .......: Needs excessive testing. ; Related .......: _arraysort() ; =============================================================================================================================== Func _ArraySort_Double (byref $array, $first_index = Default, $second_index = Default, $ascending = Default) Local $temp_value Local $counter = 1 If UBound($array, $UBOUND_DIMENSIONS) <> 2 Then MsgBox(0, "error", "error") return -1 EndIf If $first_index = Default Then $first_index = 0 If $second_index = Default Then $second_index = 1 If $ascending = Default Then $ascending = 1 _ArraySort($array, $ascending, 0, 0, $first_index); you can alter settings of primary sort here If @error Then MsgBox(0, "error", @error) return -1 EndIf $temp_value = $array[0][$first_index] For $x = 1 to UBound($array, 1) - 1 If Mod( $x, 10000) = 0 Then ConsoleWrite("at " & $x & " of a total : " & UBound($array, 1) & @CRLF) If $array[$x][$first_index] = $temp_value Then $counter+= 1 If $x = UBound($array, 1) - 1 Then; do last line here(if last line is not a new item) _ArraySort($array, $ascending, $x - $counter, $x, $second_index);you can alter settings of secondary sort here(don't forget to place line 34 the exact same) If @error Then MsgBox(0, "error", @error) return -1 EndIf EndIf Else If $counter > 0 Then ;at least 2 of the same _ArraySort($array, $ascending, $x - $counter, $x - 1, $second_index);you can alter settings of secondary sort here(don't forget to place line 29 the exact same) If @error Then MsgBox(0, "error", @error) return -1 EndIf $counter = 1 EndIf EndIf $temp_value = $array[$x][$first_index] Next Return 1 EndFunc Kind regards, Ternal
    • TrashBoat
      By TrashBoat
      So I've made this script that detects how long i have held down my left mouse button for and stores the information in an array and then sorts its using _ArraySort but the output is half sorted half broken.
      Here's my script:
      HotKeySet("{F1}","_exit") #include <Misc.au3> #include <Timers.au3> #include <Array.au3> Local $dll = DllOpen("user32.dll") $on = False Global $array[0] While(1) If _IsPressed(01,$dll) Then $timer = _Timer_Init() While _IsPressed(01,$dll) Sleep(1) WEnd $time = _Timer_Diff($timer) _ArrayAdd($array,"Time: " & Floor($time) & " ms") ;~ ConsoleWrite("Time: " & Floor($time) & " ms" & @CRLF) EndIf Sleep(50) WEnd Func _exit() _ArraySort($array) _ArrayDisplay($array) Exit EndFunc And the output:

      See how its not sorted?  What is the problem here?
×