Jump to content

XML to SQLite


Recommended Posts

Hello there,

I'm successfully importing XML files into a SQLite database, parsing them with the UDFs from _XMLDomWrapper.au3.

However, my script is taking too long to parse XML files (about 3 times more, compared to opening the XML file directly into Excel).

This is mainly for two reasons:

1. I'm using too many If statements, because I need to check the "integrity" of the data that I'm importing (imagine that somewhere along the file, one of the fields isn't present: it would ruin the import process);

2. Probably there are better and safer ways of accomplishing this goal.

The script sequence is essentially as described below:

1. Open XML file

2. Get node count for parent nodes (populate parent nodes)

3. For each parent node, get node count for its child nodes (populate child nodes)

5. For each child node, get its field names

6. Compare every field name with the intended field names, to make sure the import of that child node will be successful

4. For each child node, get values from every field

5. Transform the array into a formatted string (obeying another 3 or 4 rules so that the final string will be compatible with the SQL INSERT command)

(6. Repeat from step 5 until there are no more child nodes for that parent node, then go all over again from step 3 until there are no more parent nodes)

Initially the script took double the time, because I was retrieving field values individually, rather than grouped by child node.

So, do you have any suggestions? Alternative methods, perhaps?

What about ADO?

I tried some stuff, but with no success:

http://support.microsoft.com/?scid=kb%3Ben-us%3B263247&x=11&y=9

http://msdn.microsoft.com/en-us/library/aa468547.aspx

I couldn't write an equivalent for those lines in AutoIt without getting COM errors (XML is invalid??)...

EDIT (2): These errors happen because ADO can only handle certain XML types. The XML files you find most regularly don't comply to that format.

Your thoughts would be most appreciated.

footswitch

EDIT (1): Here's an example code of what "my" XML parsing looks like (also check out the attached file xml_parsing.zip):

;#include <Array.au3>
#include <_XMLDomWrapper.au3>

#region Load XML
$xml_file="cd_catalog.xls"
$xml_string=FileRead($xml_file)
_XMLLoadXML($xml_string)
If @error Then
MsgBox(16,"Erro",_XMLError(""))
Exit
EndIf
#endregion Load XML

$path = "/MAIN_PARENT/CATALOG"

$kids = _XMLGetChildNodes($path&"/*")
;_ArrayDisplay($kids)

$rows_total=_XMLGetNodeCount($path&"/CD")
ConsoleWrite("Total rows: "[email="&$rows_total&@CRLF"]&$rows_total&@CRLF[/email])
If $rows_total<=0 Then
MsgBox(16,"Error","The selected XML file doesn't contain any useful data.")
__ExitError()
EndIf
;Exit


__ReadXMLtoSQLite()



Func __ExitError()
;_SQLite_Close()
;_SQLite_Shutdown()
Exit
EndFunc


Func __ReadXMLtoSQLite()
$rows_progress=0 ; irá contabilizar a evolução do processamento do ficheiro
;_SQLite_Exec(-1,"BEGIN IMMEDIATE;") ; inicia o transact SQL - vai escrevendo os dados em memoria

$progress=0
$progress_lastupdate=0

$timer=TimerInit() ; inicia o cronómetro

For $h = 1 To _XMLGetNodeCount($path) ; for each CATALOG
 ConsoleWrite("-> CATALOG "[email="&$h&@CRLF"]&$h&@CRLF[/email])
 For $i = 1 To _XMLGetNodeCount($path&"[" & $h & "]/CD") ; for each CD in each CATALOG
 $rows_progress += 1

 $progress=Floor(($rows_progress/$rows_total)*100)
 If @error Then $progress=0

 #region show progress
 If $progress>$progress_lastupdate Then
 ConsoleWrite("Progress: "&$progress&" %"[email="&@CRLF"]&@CRLF[/email])
 $progress_lastupdate=$progress
 EndIf
 #endregion show progress

 #region Count Checking
 $aValues = ""
 $sValue = ""

 $aValues = _XMLGetValue($path & "[" & $h & "]/CD[" & $i & "]/" & $kids[1])
 If IsArray($aValues) Then $sValue=$aValues[1]
 If $sValue<>1 Then
 ; this doesn't apply to this XML file, but it's something I have to check with my XML reports.
 EndIf
 #endregion Count Checking

 $aValues = ""
 $sValues = ""

 ;#cs
 #region field checking

 ; the following verification is essential, because there's the possibility
 ; of having a node which doesn't contain a certain field, or the field isn't in the same order as expected
 ; This requires around 50% more processing time (e.g. instead of 20 seconds, it'll take 30)
 $aValues = _XMLGetChildNodes($path & "[" & $h & "]/*[" & $i & "]")
 ;_ArrayDisplay($aValues)

 If IsArray($aValues) Then
 If $aValues[0]<>6 Then
    MsgBox(16,"Erro","O registo nº " & $rows_progress & "contém " & $aValues[0] & " campos, quando devia conter 6."[email="&@CRLF"]&@CRLF[/email]& _
    "O ficheiro XML não está correcto. O programa vai sair.")
    __ExitError()
 ElseIf $aValues[1]<>"TITLE" Or $aValues[2]<>"ARTIST" Or $aValues[3]<>"COUNTRY" Or $aValues[4]<>"COMPANY" Or _
    $aValues[5]<>"PRICE" Or $aValues[6]<>"YEAR" Then
    MsgBox(16,"Error","The headers don't match. We're in trouble. I'm gonna run now.")
    __ExitError()
 EndIf
 EndIf
 #endregion field checking
 ;#ce

 #region get values
 $aValues = _XMLGetValue($path & "[" & $h & "]/*[" & $i & "]/*")
 ;_ArrayDisplay($aValues)

 If Not IsArray($aValues) Then
 MsgBox(16,"Erro","O registo nº " & $rows_progress & "não foi lido correctamente. O programa vai sair.")
 __ExitError()
 EndIf

 If $aValues[3]=="" Then ; if this is blank (it should always be 1 or 2), then the whole row would be blank. avoid writing blank lines
 ConsoleWrite("-> Line " & $rows_progress & " is blank." & @CRLF)
 $write=False ; linha inválida / sem conteúdo
 ContinueLoop
 Else
 $sValues &= $aValues[3] & ","
 EndIf

 $sValues &= '"'&StringReplace($aValues[4],'"',"'")&'"'&","

 If $aValues[5]=="" Then
 $sValues &="NULL,"
 Else
 $sValues &= $aValues[5] & ","
 EndIf

 $sValues &= $aValues[6] & ","
 #endregion get values

 #region insert in SQL
 ;_SQLite_Exec(-1, "INSERT INTO tblPicagens VALUES (" & $sValues & ");") ; remove as virgulas e substitui ';' por ','
 ;If $iRval <> $SQLITE_OK Then ConsoleWrite("SQLite Error: "&_SQLite_ErrMsg ()&@CRLF)
 #endregion insert in SQL
 Next
Next

ConsoleWrite("-> It took "&TimerDiff($timer)&" milisseconds."[email="&@CRLF"]&@CRLF[/email])

Return
EndFunc
Edited by footswitch
Link to comment
Share on other sites

One thing that might help to speed things up is to get SQLITE to use a buffer while you do all your inserts.

Quite a bit of time maybe lost updating the table on each insert because the database file is saved every time.

_SQLite_Exec(-1,"BEGIN;")

Process all of your XML files

1. Open XML file

2. Get node count for parent nodes (populate parent nodes)

3. For each parent node, get node count for its child nodes (populate child nodes)

5. For each child node, get its field names

6. Compare every field name with the intended field names, to make sure the import of that child node will be successful

4. For each child node, get values from every field

5. Transform the array into a formatted string (obeying another 3 or 4 rules so that the final string will be compatible with the SQL INSERT command)

(6. Repeat from step 5 until there are no more child nodes for that parent node, then go all over again from step 3 until there are no more parent nodes)

_SQLite_Exec(-1,"COMMIT;")

Link to comment
Share on other sites

Oh One other thing you can do which may speed things up is rather than opening the XML file with the XML dom, you could read the file in to memory and parse the string with the dom.

$XMLStr = FileRead("C:\File.xml")

_XMLLoadXML($XMLStr)

Link to comment
Share on other sites

@ChrisL

While I globaly agree with you, bracketting a group of RMW SQL statement inside a transaction has nothing to do with "buffers".

The helicopter-view effect of a transaction is to get out of autocommit mode for the duration of the transaction, which keeps the pager free to minimize disk writes. It doesn't however stop the writing of a journal file, so large enough transactions never get to zero disk I/O. It's correct that bulk insert/update statement see a dramatic speed improvement when grouped in chunks of few thousands, best figure depending on each situation.

BTW the OP is using several concurent processes at times, hence using a long ebough SQLite timeout and BEGIN IMMEDIATE/COMMIT is _much_ more robust in his case.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Thank you, gentlemen.

Thank you for your observations.

The SQL transaction is already performed with BEGIN / COMMIT. The time that SQLite consumes in this scenario is meaningless, when compared to the XML parsing.

Loading the XML file into memory is an interesting suggestion, though.

Surprisingly enough I hadn't even think of it ;)

It's a 10% improvement in speed.

Thank you.

Now I'm still wondering where the other 60% are... :blink:

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...