Will66 Posted January 30, 2008 Share Posted January 30, 2008 i'm using sourceforge.net/projects/pdftohtml to create an xml file.The output is not perfect and i need to sort the nodes by attribute top and left to get the proper page flow using Microsoft.XMLDOM via Com.<text top="75" left="57" width="3" height="10" font="0"> </text><text top="77" left="120" width="80" height="14" font="1">SECTION 1</text><text top="77" left="270" width="39" height="14" font="1">text1</text><text top="77" left="323" width="35" height="14" font="1">text2</text><text top="59" left="57" width="5" height="14" font="1"> </text><text top="59" left="136" width="49" height="14" font="1">DOWN</text><text top="59" left="268" width="765" height="14" font="1">text 3 </text><text top="59" left="1082" width="9" height="14" font="1">1</text>note in the document flow the last node in this example has top=59....these are placed by pdftohtml to aprroximate position.So, i need to re-sort these nodes by top then by left attributes to get the proper document layout so i can parse the node tree. In the above example <text top="75" left="57" ...is node[0] when i need <text top="59" left="268" to be node[0]XQuery http://www.w3schools.com/xquery/xquery_flwor.asp is not an option in this instance because i can't see how to use it via Com.i can use XPath but it doesn't support sorting.#include <Array.au3> Dim $xmlDoc; $xmlDoc = ObjCreate("Microsoft.XMLDOM"); $xmlDoc.async="false"; $xmlDoc.load("p9.xml"); $xtract=$xmlDoc.getElementsByTagName('text'); for $i=0 to $xtract.length-1 ConsoleWrite("top:" & $xtract($i).getAttribute('top') & @CRLF) ConsoleWrite("left:" & $xtract($i).getAttribute('left') & @CRLF) ConsoleWrite("left:" & $xtract($i).text & @CRLF) nextoÝ÷ ÙK"éjØn® Link to comment Share on other sites More sharing options...
Will66 Posted January 30, 2008 Author Share Posted January 30, 2008 This isn't xml or html as far as I know:<text top="75" left="57" width="3" height="10" font="0"> </text>One of the points to the existense of xml is to separate data from page formatting. Xml is used to store data. Maybe what you really want is html. Using xsl is with xml is one way to format the display of the data in the xml.yes of course....as explained thats what pdftohtml dumps out though.I'm not looking to display the xml, just parse the nodes in page reading order so i can further manipulate the data for db etc. Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 (edited) This might get you going on the right path. It stores the values in an array for later sorting: #include <Array.au3> Dim $xmlDoc Dim $TopAttrArr[3000] Dim $LeftAttrArr[3000] Dim $TextArr[3000] $xmlDoc = ObjCreate("Microsoft.XMLDOM") $xmlDoc.Async = "false" $xmlDoc.Load("p9.xml") $xtract = $xmlDoc.getElementsByTagName('text') For $i = 0 To $xtract.length - 1 $TopAttrArr[$i] = $xtract($i).getAttribute('top') $LeftAttrArr[$i] = $xtract($i).getAttribute('left') $TextArr[$i] = $xtract($i).Text Next AutoIt has some good sorting functions to it these days. So if you know the DOM and the COM syntax, we can help a lot with the rest. I guess you found this function in the AutoIt Help file: ;Sort a mutiple dimensional Array. #include <Array.au3> _ArraySort Edited January 30, 2008 by Squirrely1 Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 (edited) This seems like a better starting place, to use a three dimensional array, since you want to sort something $a by two other things - top and left: #include <Array.au3> Dim $xmlDoc Dim $p9Arr[3000][3000][3000] $xmlDoc = ObjCreate("Microsoft.XMLDOM") $xmlDoc.Async = "false" $xmlDoc.Load("p9.xml") $xtract = $xmlDoc.getElementsByTagName('text') For $i = 0 To $xtract.length - 1 $p9Arr[$a][$b][$i] = $xtract($i).getAttribute('top') $p9Arr[$a][$i][$b] = $xtract($i).getAttribute('left') $p9Arr[$i][$a][$b] = $xtract($i).Text NextBut now you have to work on which incrementers should be in the array element indexes and how to increment them: $p9Arr[$i][$a][$b] With a three dimensional array, everything will be linked already, and you can sort the one array a few times, by various criteria, using _ArraySort I think that figuring out the criteria for sorting is going to be easy. It's this incrementing thing that hard - if you do that, we at the forums will handle the rest. Edited January 30, 2008 by Squirrely1 Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Will66 Posted January 30, 2008 Author Share Posted January 30, 2008 This seems like a better starting place, to use a three dimensional array, since you want to sort something $a by two other things - top and left:I'll re-check, but i don't think you can sort a > 2 dimension array Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 (edited) Well you had this line in your script: #include <Array.au3> so it seemed you knew all about _ArraySort which is good for sorting even three-dimensional arrays - look in the AutoIt Help file > User Defined Functions > Array Management. We can sort your array if it is built in an organized way. I don't myself use multi-dimensional arrays very much, and I would find it easier to work with single-dimensional arrays. Maybe a two-dimensional array is all you need, though. Edited January 30, 2008 by Squirrely1 Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 (edited) Okay, this is simpler than I thought - here is your code except I need to build the sorting routine next: #include <Array.au3> Dim $xmlDoc Dim $p9Arr[3000][3] $xmlDoc = ObjCreate("Microsoft.XMLDOM") $xmlDoc.Async = "false" $xmlDoc.Load("p9.xml") $xtract = $xmlDoc.getElementsByTagName('text') For $i = 0 To $xtract.length - 1 $p9Arr[$i][0] = $xtract($i).getAttribute('top') $p9Arr[$i][1] = $xtract($i).getAttribute('left') $p9Arr[$i][2] = $xtract($i).Text Next Now we just need you to describe very precisely, by what criteria you wish to sort this array where $p9Arr[$i][0] contains the top parameter - I could suppose - only by that and from smallest number to largest? Edited January 30, 2008 by Squirrely1 Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Will66 Posted January 30, 2008 Author Share Posted January 30, 2008 (edited) Well you had this line in your script: #include <Array.au3> so it seemed you knew all about _ArraySort which is good for sorting even three-dimensional arrays - look in the AutoIt Help file > User Defined Functions > Array Management. We can sort your array if it is built in an organized way. I don't myself use multi-dimensional arrays very much, and I would find it easier to work with single-dimensional arrays. Maybe a two-dimensional array is all you need, though. I researched the _ArraySort first up. Unfortunatley i can't find a clear example on sorting a multi-dimensional array. An example would be helpful. If this was vbscript i would have little trouble, but the way autoit implements multi-arrays is quite different from what can tell Yes sort by top, then left asc (min-max) Edited January 30, 2008 by Will66 Link to comment Share on other sites More sharing options...
Will66 Posted January 30, 2008 Author Share Posted January 30, 2008 Ok, this seems to be working but i can't follw the syntax for sorting: #include <Array.au3> _ArraySort ( ByRef $a_Array [, $i_Descending [, $i_Base=0 [, $i_Ubound=0 [, $i_Dim=1 [, $i_SortIndex=0 ]]]]] ) #include <Array.au3> Dim $xmlDoc $xmlDoc = ObjCreate("Microsoft.XMLDOM") $xmlDoc.Async = "false" $xmlDoc.Load("p9.xml") $xtract = $xmlDoc.getElementsByTagName('text') Dim $p9Arr[$xtract.length][3] For $i = 0 To $xtract.length - 1 $p9Arr[$i][0] = number($xtract($i).getAttribute('top')) $p9Arr[$i][1] = Number($xtract($i).getAttribute('left')) $p9Arr[$i][2] = $xtract($i).Text Next ;_ArraySort($p9Arr[0][0][0]) _ArrayDisplay($p9Arr) Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 (edited) We need somehow first, to clean up the array of empty elements. Run this and see if it throws an error on that ReDim statement. If so, we need to remove the semi-colon that is at the front of of the line before the ReDim statement: #include <Array.au3> Dim $xmlDoc Dim $p9Arr[3000][3] $xmlDoc = ObjCreate("Microsoft.XMLDOM") $xmlDoc.Async = "false" $xmlDoc.Load("p9.xml") $xtract = $xmlDoc.getElementsByTagName('text') For $i = 0 To $xtract.length - 1 $p9Arr[$i][0] = $xtract($i).getAttribute('top') $p9Arr[$i][1] = $xtract($i).getAttribute('left') $p9Arr[$i][2] = $xtract($i).Text Next ;$i -= 1 ReDim $p9Arr[$i][3] Sorry - I had to re-do the ReDim statement ... Edited January 30, 2008 by Squirrely1 Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Will66 Posted January 30, 2008 Author Share Posted January 30, 2008 We need somehow first, to clean up the array of empty elements. Run this and see if it throws an error on that ReDim statement. If so, we need to remove the semi-colon that is at the front of of the line before the ReDim statement: That fine, no error, but i dimmed it using Dim $p9Arr[$xtract.length][3] anyway per my pev post. How to sort it ? Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 Okay dude, this might be your whole code, but it might throw errors and need just a little work: #include <Array.au3> Dim $xmlDoc Dim $p9Arr[3000][3] $xmlDoc = ObjCreate("Microsoft.XMLDOM") $xmlDoc.Async = "false" $xmlDoc.Load("p9.xml") $xtract = $xmlDoc.getElementsByTagName('text') For $i = 0 To $xtract.length - 1 $p9Arr[$i][0] = $xtract($i).getAttribute('top') $p9Arr[$i][1] = $xtract($i).getAttribute('left') $p9Arr[$i][2] = $xtract($i).Text Next ;$i -= 1 ReDim $p9Arr[$i][3] _ArraySort($p9Arr, 1, 0, UBound($p9Arr), 3, 2) _ArraySort($p9Arr, 1, 0, UBound($p9Arr), 3, 3) ConsoleWrite(@CR) ConsoleWrite("Take a quick look at the last values and the first to validate your data: ") ConsoleWrite(@CR) For $i = 0 To UBound($p9Arr) ConsoleWrite($p9Arr[$i][0] & @CR) Next ConsoleWrite(@CR) For $i = 0 To UBound($p9Arr) ConsoleWrite($p9Arr[$i][1] & @CR) Next ConsoleWrite(@CR) For $i = 0 To UBound($p9Arr) ConsoleWrite($p9Arr[$i][2] & @CR) Next Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 (edited) You were right about _ArrayDisplay - it will apparently only work on one or two dimensional arrays, going by the "Remarks" section, despite its description at the top in the Help file - "Sorts a multiple dimensional array". Edited January 30, 2008 by Squirrely1 Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 I'm not sure I understand that fourth parameter in _ArraySort very well, but somebody on the forums does. Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Will66 Posted January 30, 2008 Author Share Posted January 30, 2008 I'm not sure I understand that fourth parameter in _ArraySort very well, but somebody on the forums does.using arrays was my first inclination on a solution. Maybe regular expressions can help....another thing i suck at Link to comment Share on other sites More sharing options...
Squirrely1 Posted January 30, 2008 Share Posted January 30, 2008 Now I think that fourth parameter might just be the highest element number in the second dimension. Now this edition takes into account that you usually have to subtract 1 from Ubound ... and I am not sure of the wisdom of sorting according to 'left', so I rem'd that out: expandcollapse popup#include <Array.au3> Dim $xmlDoc Dim $p9Arr[3000][3] $xmlDoc = ObjCreate("Microsoft.XMLDOM") $xmlDoc.Async = "false" $xmlDoc.Load("p9.xml") Sleep(1300) $xtract = $xmlDoc.getElementsByTagName('text') Sleep(1800) For $i = 0 To $xtract.length - 1 $p9Arr[$i][0] = $xtract($i).getAttribute('top') $p9Arr[$i][1] = $xtract($i).getAttribute('left') $p9Arr[$i][2] = $xtract($i).Text Next Sleep(1300) ;$i -= 1 ReDim $p9Arr[$i][3] Sleep(1300) _ArraySort($p9Arr, 1, 0, UBound($p9Arr)-1, 3, 2) Sleep(1300) ;_ArraySort($p9Arr, 1, 0, UBound($p9Arr)-1, 3, 3) ;Sleep(1300) ConsoleWrite(@CR) ConsoleWrite("Take a quick look at the last values and the first to validate your data: ") ConsoleWrite(@CR) For $i = 0 To UBound($p9Arr) ConsoleWrite($p9Arr[$i][0] & @CR) Next ;ConsoleWrite(@CR) ;For $i = 0 To UBound($p9Arr) ; ConsoleWrite($p9Arr[$i][1] & @CR) ;Next ConsoleWrite(@CR) For $i = 0 To UBound($p9Arr) ConsoleWrite($p9Arr[$i][2] & @CR) Next Das Häschen benutzt Radar Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now