Jump to content

re-ordering an xml file


Will66
 Share

Recommended Posts

i'm using sourceforge.net/projects/pdftohtml to create an xml file.

The output is not perfect and i need to sort the nodes by attribute top and left to get the proper page flow using Microsoft.XMLDOM via Com.

<text top="75" left="57" width="3" height="10" font="0"> </text>

<text top="77" left="120" width="80" height="14" font="1">SECTION 1</text>

<text top="77" left="270" width="39" height="14" font="1">text1</text>

<text top="77" left="323" width="35" height="14" font="1">text2</text>

<text top="59" left="57" width="5" height="14" font="1"> </text>

<text top="59" left="136" width="49" height="14" font="1">DOWN</text>

<text top="59" left="268" width="765" height="14" font="1">text 3 </text>

<text top="59" left="1082" width="9" height="14" font="1">1</text>

note in the document flow the last node in this example has top=59....these are placed by pdftohtml to aprroximate position.

So, i need to re-sort these nodes by top then by left attributes to get the proper document layout so i can parse the node tree. In the above example <text top="75" left="57" ...is node[0] when i need <text top="59" left="268" to be node[0]

XQuery http://www.w3schools.com/xquery/xquery_flwor.asp is not an option in this instance because i can't see how to use it via Com.

i can use XPath but it doesn't support sorting.

#include <Array.au3>
Dim $xmlDoc;

$xmlDoc = ObjCreate("Microsoft.XMLDOM");
$xmlDoc.async="false";
$xmlDoc.load("p9.xml");

$xtract=$xmlDoc.getElementsByTagName('text');

for $i=0 to $xtract.length-1
    ConsoleWrite("top:" & $xtract($i).getAttribute('top') & @CRLF)
     ConsoleWrite("left:" & $xtract($i).getAttribute('left') & @CRLF)
      ConsoleWrite("left:" & $xtract($i).text & @CRLF)
nextoÝ÷ ÙK"éjØn®
Link to comment
Share on other sites

This isn't xml or html as far as I know:

<text top="75" left="57" width="3" height="10" font="0"> </text>

One of the points to the existense of xml is to separate data from page formatting. Xml is used to store data. Maybe what you really want is html. Using xsl is with xml is one way to format the display of the data in the xml.

yes of course....as explained thats what pdftohtml dumps out though.

I'm not looking to display the xml, just parse the nodes in page reading order so i can further manipulate the data for db etc.

Link to comment
Share on other sites

This might get you going on the right path. It stores the values in an array for later sorting:

#include <Array.au3>
Dim $xmlDoc
Dim $TopAttrArr[3000]
Dim $LeftAttrArr[3000]
Dim $TextArr[3000]
$xmlDoc = ObjCreate("Microsoft.XMLDOM")
$xmlDoc.Async = "false"
$xmlDoc.Load("p9.xml")
$xtract = $xmlDoc.getElementsByTagName('text')
For $i = 0 To $xtract.length - 1
    $TopAttrArr[$i] = $xtract($i).getAttribute('top')
    $LeftAttrArr[$i] = $xtract($i).getAttribute('left')
    $TextArr[$i] = $xtract($i).Text
Next

AutoIt has some good sorting functions to it these days. So if you know the DOM and the COM syntax, we can help a lot with the rest.

I guess you found this function in the AutoIt Help file:

;Sort a mutiple dimensional Array.

#include <Array.au3>

_ArraySort

Edited by Squirrely1

Das Häschen benutzt Radar

Link to comment
Share on other sites

This seems like a better starting place, to use a three dimensional array, since you want to sort something $a by two other things - top and left:

#include <Array.au3>
Dim $xmlDoc
Dim $p9Arr[3000][3000][3000]
$xmlDoc = ObjCreate("Microsoft.XMLDOM")
$xmlDoc.Async = "false"
$xmlDoc.Load("p9.xml")
$xtract = $xmlDoc.getElementsByTagName('text')
For $i = 0 To $xtract.length - 1
    $p9Arr[$a][$b][$i] = $xtract($i).getAttribute('top')
    $p9Arr[$a][$i][$b] = $xtract($i).getAttribute('left')
    $p9Arr[$i][$a][$b] = $xtract($i).Text
Next
But now you have to work on which incrementers should be in the array element indexes and how to increment them: $p9Arr[$i][$a][$b]

With a three dimensional array, everything will be linked already, and you can sort the one array a few times, by various criteria, using _ArraySort

I think that figuring out the criteria for sorting is going to be easy. It's this incrementing thing that hard - if you do that, we at the forums will handle the rest.

Edited by Squirrely1

Das Häschen benutzt Radar

Link to comment
Share on other sites

Well you had this line in your script:

#include <Array.au3>

so it seemed you knew all about _ArraySort which is good for sorting even three-dimensional arrays - look in the

AutoIt Help file > User Defined Functions > Array Management.

We can sort your array if it is built in an organized way. I don't myself use multi-dimensional arrays very much, and I would find it easier to work with single-dimensional arrays. Maybe a two-dimensional array is all you need, though.

Edited by Squirrely1

Das Häschen benutzt Radar

Link to comment
Share on other sites

Okay, this is simpler than I thought - here is your code except I need to build the sorting routine next:

#include <Array.au3>
Dim $xmlDoc
Dim $p9Arr[3000][3]
$xmlDoc = ObjCreate("Microsoft.XMLDOM")
$xmlDoc.Async = "false"
$xmlDoc.Load("p9.xml")
$xtract = $xmlDoc.getElementsByTagName('text')
For $i = 0 To $xtract.length - 1
    $p9Arr[$i][0] = $xtract($i).getAttribute('top')
    $p9Arr[$i][1] = $xtract($i).getAttribute('left')
    $p9Arr[$i][2] = $xtract($i).Text
Next

Now we just need you to describe very precisely, by what criteria you wish to sort this array where $p9Arr[$i][0] contains the top parameter - I could suppose - only by that and from smallest number to largest?

Edited by Squirrely1

Das Häschen benutzt Radar

Link to comment
Share on other sites

Well you had this line in your script:

#include <Array.au3>

so it seemed you knew all about _ArraySort which is good for sorting even three-dimensional arrays - look in the

AutoIt Help file > User Defined Functions > Array Management.

We can sort your array if it is built in an organized way. I don't myself use multi-dimensional arrays very much, and I would find it easier to work with single-dimensional arrays. Maybe a two-dimensional array is all you need, though.

I researched the _ArraySort first up.

Unfortunatley i can't find a clear example on sorting a multi-dimensional array.

An example would be helpful.

If this was vbscript i would have little trouble, but the way autoit implements multi-arrays is quite different from what can tell

Yes sort by top, then left asc (min-max)

Edited by Will66
Link to comment
Share on other sites

Ok, this seems to be working but i can't follw the syntax for sorting:

#include <Array.au3>
_ArraySort ( ByRef $a_Array [, $i_Descending [, $i_Base=0 [, $i_Ubound=0 [, $i_Dim=1 [, $i_SortIndex=0 ]]]]] )

#include <Array.au3>
Dim $xmlDoc
$xmlDoc = ObjCreate("Microsoft.XMLDOM")
$xmlDoc.Async = "false"
$xmlDoc.Load("p9.xml")
$xtract = $xmlDoc.getElementsByTagName('text')
Dim $p9Arr[$xtract.length][3]

For $i = 0 To $xtract.length - 1
    $p9Arr[$i][0] = number($xtract($i).getAttribute('top'))
    $p9Arr[$i][1] = Number($xtract($i).getAttribute('left'))
    $p9Arr[$i][2] = $xtract($i).Text
Next
;_ArraySort($p9Arr[0][0][0])
_ArrayDisplay($p9Arr)
Link to comment
Share on other sites

We need somehow first, to clean up the array of empty elements. Run this and see if it throws an error on that ReDim statement. If so, we need to remove the semi-colon that is at the front of of the line before the ReDim statement:

#include <Array.au3>
Dim $xmlDoc
Dim $p9Arr[3000][3]
$xmlDoc = ObjCreate("Microsoft.XMLDOM")
$xmlDoc.Async = "false"
$xmlDoc.Load("p9.xml")
$xtract = $xmlDoc.getElementsByTagName('text')
For $i = 0 To $xtract.length - 1
    $p9Arr[$i][0] = $xtract($i).getAttribute('top')
    $p9Arr[$i][1] = $xtract($i).getAttribute('left')
    $p9Arr[$i][2] = $xtract($i).Text
Next
;$i -= 1
ReDim $p9Arr[$i][3]

Sorry - I had to re-do the ReDim statement ...

Edited by Squirrely1

Das Häschen benutzt Radar

Link to comment
Share on other sites

We need somehow first, to clean up the array of empty elements. Run this and see if it throws an error on that ReDim statement. If so, we need to remove the semi-colon that is at the front of of the line before the ReDim statement:

That fine, no error, but i dimmed it using Dim $p9Arr[$xtract.length][3] anyway per my pev post.

How to sort it ?

Link to comment
Share on other sites

Okay dude, this might be your whole code, but it might throw errors and need just a little work:

#include <Array.au3>
Dim $xmlDoc
Dim $p9Arr[3000][3]
$xmlDoc = ObjCreate("Microsoft.XMLDOM")
$xmlDoc.Async = "false"
$xmlDoc.Load("p9.xml")
$xtract = $xmlDoc.getElementsByTagName('text')
For $i = 0 To $xtract.length - 1
    $p9Arr[$i][0] = $xtract($i).getAttribute('top')
    $p9Arr[$i][1] = $xtract($i).getAttribute('left')
    $p9Arr[$i][2] = $xtract($i).Text
Next
;$i -= 1
ReDim $p9Arr[$i][3]
_ArraySort($p9Arr, 1, 0, UBound($p9Arr), 3, 2)
_ArraySort($p9Arr, 1, 0, UBound($p9Arr), 3, 3)

ConsoleWrite(@CR)
ConsoleWrite("Take a quick look at the last values and the first to validate your data: ")

ConsoleWrite(@CR)
For $i = 0 To UBound($p9Arr)
    ConsoleWrite($p9Arr[$i][0] & @CR)
Next

ConsoleWrite(@CR)
For $i = 0 To UBound($p9Arr)
    ConsoleWrite($p9Arr[$i][1] & @CR)
Next

ConsoleWrite(@CR)
For $i = 0 To UBound($p9Arr)
    ConsoleWrite($p9Arr[$i][2] & @CR)
Next
:D

Das Häschen benutzt Radar

Link to comment
Share on other sites

You were right about _ArrayDisplay - it will apparently only work on one or two dimensional arrays, going by the "Remarks" section, despite its description at the top in the Help file - "Sorts a multiple dimensional array".

Edited by Squirrely1

Das Häschen benutzt Radar

Link to comment
Share on other sites

I'm not sure I understand that fourth parameter in _ArraySort very well, but somebody on the forums does.

using arrays was my first inclination on a solution. Maybe regular expressions can help....another thing i suck at :D

Link to comment
Share on other sites

Now I think that fourth parameter might just be the highest element number in the second dimension.

Now this edition takes into account that you usually have to subtract 1 from Ubound ... and I am not sure of the wisdom of sorting according to 'left', so I rem'd that out:

#include <Array.au3>

Dim $xmlDoc
Dim $p9Arr[3000][3]

$xmlDoc = ObjCreate("Microsoft.XMLDOM")
$xmlDoc.Async = "false"
$xmlDoc.Load("p9.xml")
Sleep(1300)

$xtract = $xmlDoc.getElementsByTagName('text')
Sleep(1800)

For $i = 0 To $xtract.length - 1
    $p9Arr[$i][0] = $xtract($i).getAttribute('top')
    $p9Arr[$i][1] = $xtract($i).getAttribute('left')
    $p9Arr[$i][2] = $xtract($i).Text
Next
Sleep(1300)

;$i -= 1

ReDim $p9Arr[$i][3]
Sleep(1300)

_ArraySort($p9Arr, 1, 0, UBound($p9Arr)-1, 3, 2)
Sleep(1300)

;_ArraySort($p9Arr, 1, 0, UBound($p9Arr)-1, 3, 3)
;Sleep(1300)

ConsoleWrite(@CR)
ConsoleWrite("Take a quick look at the last values and the first to validate your data: ")

ConsoleWrite(@CR)
For $i = 0 To UBound($p9Arr)
    ConsoleWrite($p9Arr[$i][0] & @CR)
Next

;ConsoleWrite(@CR)
;For $i = 0 To UBound($p9Arr)
;   ConsoleWrite($p9Arr[$i][1] & @CR)
;Next

ConsoleWrite(@CR)
For $i = 0 To UBound($p9Arr)
    ConsoleWrite($p9Arr[$i][2] & @CR)
Next

Das Häschen benutzt Radar

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...