Jump to content

RegExp - effective HTML table parsing (rows,columns)


Recommended Posts

I'm doing parsing of HTML file with <table>. I need to go through rows and columns of table, ideally to get two dimensional array.
I use this way with simple two levels of calling StrinRegExp() for rows and columns:

;~ $html = FileRead('table.html')
$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$rows = StringRegExp($html, '(?s)(?i)<tr>(.*?)</tr>', 3)
For $i = 0 to UBound($rows) - 1
    $row = $rows[$i]
    ConsoleWrite("Row " & $i & ': ' & $row & @CRLF)

    $cols = StringRegExp($row, '(?s)(?i)<td>(.*?)</td>', 3)
    For $j = 0 to UBound($cols) - 1
        $col = $cols[$j]
        ConsoleWrite("  Col " & $j & ': ' & $col & @CRLF)
    Next
Next

Output:

Row 0:  <td>r1c1</td> <td>r1c2</td>
  Col 0: r1c1
  Col 1: r1c2
Row 1:  <td>r2c1</td> <td>r2c2</td>
  Col 0: r2c1
  Col 1: r2c2
Row 2:  <td>r3c1</td> <td>r3c2</td>
  Col 0: r3c1
  Col 1: r3c2

 

 

In my example there is called StringRegExp() for each row of table which is ineffective for many rows.

It works fine, but my question is if there is better and more effective approach, maybe some clever the only one RegExp pattern?

Or maybe using StringRegExp with option=4? I 'm not experienced with this option (array in array) and example in helpfile is not very clear to me so I don't know if this option=4 can be used also for HTML table parsing.

Edited by Zedna
Link to post
Share on other sites

I believe you can get rid of the question mark in the group (.*?) [but only if you use  (?U) at the start of your pattern :whistle:]. Your approach is the same as I would use. I've never used option=4. It would be nice to see more examples.

Edited by czardas
Link to post
Share on other sites

Well, not really sure, but here is a start point :

#include <Array.au3>


$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'


$aRes = StringRegExp($html, "(?is)<tr.*?>.*?(?:<td.*?>(.*?)<\/td>\s*)(?=(?:<td.*?>(.*?)<\/td>)?).*?<\/tr>", 4)

For $i = 0 To UBound($aRes) - 1
    If IsArray($aRes[$i]) Then
        $tab = $aRes[$i]
        _ArrayDisplay($tab)
    EndIf
Next


_ArrayDisplay($aRes)

I don't know why there is an empty last result...

Edit : I use <td.*?> to match this king of tag : <td id='toto' ....> (so out of your example)

Edited by jguinch
Link to post
Share on other sites
Edit : I use <td.*?> to match this king of tag : <td id='toto' ....> (so out of your example)

 

Nice example. Also silly me. The question mark is needed if you don't use (?U) at the start of the pattern. I tend to do that a lot which is why I thought the question mark wasn't needed. Sorry for the misinformation. :doh:

Edited by czardas
Link to post
Share on other sites

AFAIK there is no way to get a 2D array from a single regex :)

I personally use something like this

#include <Array.au3>

;~ $html = FileRead('table.html')
$html = '<tr><td>r1c1</td> <td>r1c2</td> </tr>  <tr><td>r2c1   </td> <td>r2c2</td><td>r2c3</td></tr>  <tr><td></td><td>r3c2</td> <td>   r3c3</td></tr>  <tr><td></td><td>r4c2</td> <td> </td></tr>'

$rows = StringRegExp($html, '(?is)<tr>(.*?)</tr>', 3)
Local $a[UBound($rows)][100], $icol = 0

For $i = 0 to UBound($rows) - 1
    $cols = StringRegExp($rows[$i], '(?is)<td>(.*?)</td>', 3)
    $icol = ($icol > UBound($cols)) ? $icol : UBound($cols)
    For $j = 0 to UBound($cols) - 1
        $a[$i][$j] = StringStripWS($cols[$j], 3)
    Next
Next
Redim $a[UBound($rows)][$icol]
_ArrayDisplay($a)
Link to post
Share on other sites

I couldn't get option=4 to work. :( Here's a slightly different approach.

#include <Array.au3>

Local $html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'
Local $aRes = StringRegExp($html, "(?isU)<tr>|(?:<td>)(.*)(?:</td>)", 3)
_ArrayDisplay($aRes)
Edited by czardas
Link to post
Share on other sites

Every options return an 1 dimension array, so you cannot get a 2D array, unless parsing yourself :(

Option 4 is not very interesting. Think like that, the "pattern" can be broken into "part". With option 3, you find "part". With option 4, you find entire "pattern", which returned at the element 0 of the nested array. And other "parts" is returned at other index.

I think the snippet you wrote is the best we can do. If performance is too important, you can use a C HTML/XML parsing library and DllCall().

99 little bugs in the code

99 little bugs!

Take one down, patch it around

117 little bugs in the code!

Link to post
Share on other sites

@czardas : very nice !

I'm suprised about the <tr> capturing, but AutoIt captures everything if no parenthese is used, so everything is <tr> :geek:

BTW, I think you are using useless non-capturing groups. It's also OK with this: (?isU)<tr>|<td>(.*)</td>

(?|(<tr>)|<td>(.*)</td>) for compatibility with regex101.com for example

Edited by jguinch
Link to post
Share on other sites

...

I think the snippet you wrote is the best we can do. If performance is too important, you can use a C HTML/XML parsing library and DllCall().

 

Here is another parsing snipet optimized for speed, only with one StringRegExp()

It's based on the premise of known number of columns.

Number of columns ($cols_on_row) can be checked by StringRegExp() before main For/Next loop if needed.

Rows needn't to be parsed by StringRegExp(), instead rows can be calculated by Mod() function.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then $row += 1
    ConsoleWrite("Row " & $row & "  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

 

Row 1  Col 0: r1c1

Row 1  Col 1: r1c2

Row 2  Col 0: r2c1

Row 2  Col 1: r2c2

Row 3  Col 0: r3c1

Row 3  Col 1: r3c2

Edited by Zedna
Link to post
Share on other sites

And here is slightly modified version to distinguish code part for row and for column

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then
        $row += 1
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    ConsoleWrite("  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Row 1
  Col 0: r1c1
  Col 1: r1c2
Row 2
  Col 0: r2c1
  Col 1: r2c2
Row 3
  Col 0: r3c1
  Col 1: r3c2

Edited by Zedna
Link to post
Share on other sites

It's very interesting, but you will also not get a 2D array :)
And this is a little modified version, dynamic column:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '(<tr>)*<td>(.*?)</td>', 4)
Local $aMatch = 0
Local $row = -1
Local $nCounter = 0
For $i = 0 To UBound($aArray) - 1
    $aMatch = $aArray[$i]
    If (StringLeft($aMatch[0], 3) = '<tr') Then
        $row += 1
        $nCounter = 0
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    $col = $aMatch[2]
    ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
    $nCounter += 1
Next

Or better, check close tag instead of open tag, and forget about the id/class/attributes.... Also, it will result smaller array.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 4)
Local $aMatch = 0
Local $row = 0
Local $nCounter = 0
ConsoleWrite("Row " & $row & @CRLF)

For $i = 0 To UBound($aArray) - 1
    $aMatch = $aArray[$i]
    $col = $aMatch[1]
    ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
    $nCounter += 1

    If (StringRight($aMatch[0], 5) = '</tr>') Then
        $row += 1
        $nCounter = 0
        ; Last ConsoleWrite should be ignored
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
Next

Output uglier string, but more performance in works
 
Or more better, use For... In to eliminate the array copy when assign variable $aMatch:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 4)
Local $aMatch = 0
Local $row = 0
Local $nCounter = 0
ConsoleWrite("Row " & $row & @CRLF)

If (IsArray($aArray)) Then
    For $aMatch In $aArray
        $col = $aMatch[1]
        ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
        $nCounter += 1

        If (StringRight($aMatch[0], 5) = '</tr>') Then
            $row += 1
            $nCounter = 0
            ; Last ConsoleWrite should be ignored
            ConsoleWrite("Row " & $row & @CRLF)
        EndIf
    Next
EndIf

And don't need flag=4, we can use flag=3, too, shorter (and maybe more performance) version:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 3)
Local $aMatch = 0
Local $row = 0
Local $col = 0
ConsoleWrite("Row " & $row & @CRLF)

If (IsArray($aArray)) Then
    For $ele In $aArray
        If ($ele <> '</tr>') Then
            ConsoleWrite("  Col " & $col & ': ' & $ele & @CRLF)
            $col += 1
        Else
            $row += 1
            ConsoleWrite("Row " & $row & @CRLF)
            $col = 0
        EndIf
    Next
EndIf
Edited by binhnx

99 little bugs in the code

99 little bugs!

Take one down, patch it around

117 little bugs in the code!

Link to post
Share on other sites

Here is another parsing snipet optimized for speed, only with one StringRegExp()

It's based on the premise of known number of columns.

Number of columns ($cols_on_row) can be checked by StringRegExp() before main For/Next loop if needed.

Rows needn't to be parsed by StringRegExp(), instead rows can be calculated by Mod() function.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then $row += 1
    ConsoleWrite("Row " & $row & "  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

 

Here is modified version also with dynamic checking for number of columns from table header tags <th> </th>:

;~ $html = FileRead('table.html')
$html = '<tr><th>column1</th> <th>column2</th></tr>  <tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$colnames = StringRegExp($html, '(?s)(?i)<th>(.*?)</th>', 3)
For $j = 0 to UBound($colnames) - 1
    ConsoleWrite("Col name " & $j & ': ' & $colnames[$j] & @CRLF)
Next
ConsoleWrite("Number of columns: " & UBound($colnames) & @CRLF & @CRLF)

$cols_on_row = UBound($colnames)
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)

For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then
        $row += 1
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    ConsoleWrite("  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Col name 0: column1

Col name 1: column2

Number of columns: 2

Row 1

  Col 0: r1c1

  Col 1: r1c2

Row 2

  Col 0: r2c1

  Col 1: r2c2

Row 3

  Col 0: r3c1

  Col 1: r3c2

 

 

This is final version which I will use in my project, because there is table with table header tags included.

Anyway thanks to all for given interesting RegExp ideas, feel free to add another ones ...  :-)

Edited by Zedna
Link to post
Share on other sites

A last one for me, with a mix of other codes :

#Include <Array.au3>

$sHtml = '<tr><th>column1</th> <th>column2</th></tr>  <tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)

Local $aResult[ UBound($aRes) ] [ UBound($aRes) ]
Local $iRow = 0, $iCol = 0, $iMaxRow = 0

For $i = 0 To UBound($aRes) - 1
    If $aRes[$i] = "/" Then
        $iRow += 1
        $iCol = 0
    Else
        $aResult[$iRow][$iCol] = $aRes[$i]
        $iCol += 1
        If $iCol > $iMaxRow Then $iMaxRow = $iCol
    EndIf
Next

Redim $aResult[$iRow][$iMaxRow]

_ArrayDisplay($aResult)
Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By SkysLastChance
      I am having trouble finding a good way to click these "button" below. 

      I only need to be able to click them when they have both yes/no. Otherwise I don't have to worry about them. For instance if they looked like this I would NOT have worry about clicking them and can just ignore them all togheter.(Below Picture)

      The problem is as mentioned in the title, all of the ID's  are dynamic. (Classes too)

      Here is what it looks like if yes is already selected.

      This is what I was using to select the the button. However, I need to know if the button has already been clicked/selected or not.
      _WD_LoadWait($sSession) $sElement = _WD_FindElement($sSession, $_WD_LOCATOR_ByXPath, "//span[text() = 'Offered access to electronic health information?']") Sleep(1000) _WD_ElementAction($sSession, $sElement, 'click') Sleep(500) _WD_Action($sSession, "actions", $sActionTab) Sleep(500) _WD_Action($sSession, "actions", $sActionEnter) Is there a way I can get the count of spans in the span class-"s_636" by tabbing over to the button? I am hoping someone might have some ideas on what I can try.
      Unfortunally, The site is for work so giving the site wont do any good. 
    • By RAMzor
      Hi everyone,
      I have this string:
      "main_lot      0x111” & @CRLF & “main_version          0xABC” & @CRLF & “main_number 0xDEAD123” & @CRLF & “main_version          0x333"
      And I'm trying to extract one specific hexadecimal number, actually main_version from this string by using StringRegExp:
      How to get 'ABC' from it?
      I'm not sure if the original string uses @CRLF, @CR or @LF as a line breaks (received from linux over ssh plink.exe) I have tried this code but it doesn't work 
      #include <Array.au3> $sLog = "main_lot 0x111” & @CRLF & “main_version 0xABC” & @CRLF & “main_number 0xDEAD123” & @CRLF & “main_version 0x333" $aVer = StringRegExp($sLog, "main_version\h*(.+)(?:0[xX][[:xdigit:]])", 3) _ArrayDisplay($aVer)  
       
    • By goku200
      I'm having an issue with my html paginated table. The script work as expected. It reads the html table and clicks on the Download button. However when it clicks on the next page its not iterating the items. instead it goes to the next URL from the spreadsheet and then iterates through the html table clicking the Download button and so on. Not sure why its doing that. I want it to click the next page and then continue iterating then after it has reached the end of the pagination go to the next url in the spreadsheet and repeat the process. Below is my script. Any help is appreciated 🙂
       
       
    • By jmp
      Hello.
      I have IETable,
      Get by this code
      $oTable = _IETableGetCollection ($oIE, 1) $aTableData = _IETableWriteToArray ($oTable) Local Const $iArrayNumberOfCols = UBound($aTableData, $UBOUND_COLUMNS) Local Const $iArrayNumberOfRows = UBound($aTableData, $UBOUND_ROWS) Local $aArraySubstringsRow[$iArrayNumberOfCols] ;~ Local $aExtract = _ArrayExtract ($aTableData, 1, 1, 1, -1) ;~ MsgBox(0, "", $iArrayNumberOfCols) ;~ _ArrayDisplay($aExtract) Local Const $iArrayRowIndex = 1 Local $sSubstring For $i = 0 To $iArrayNumberOfCols - 1     $sSubstring = StringLeft($aTableData[$iArrayRowIndex][$i], 2)     $aArraySubstringsRow[$i] = $sSubstring Next _ArrayDisplay($aArraySubstringsRow, "This is a row") and i want to use cell (52, 82, 18, 9,...10) one by one for selecting dropdown box in internet explorer.

      So, How to show/get/extract cell one by one (in msgbox)? 
    • By Hermes
      Hi, I have a site that has the following elements below: 
      <div>More element here</div> <div>More element here</div> <div>More element here</div> When I do this in Auto It:
      Local $oSelectDiv = _WD_FindElement($sSession, $_WD_LOCATOR_ByCSSSelector, "div") _WD_HighlightElement($sSession, $oSelectDiv, 1) I also tried to add [3], but it doesnt seems to work:
      Local $oSelectDiv = _WD_FindElement($sSession, $_WD_LOCATOR_ByCSSSelector, "div[3]") _WD_HighlightElement($sSession, $oSelectDiv, 1) It always highlight the first one, but I am trying to highlight the 3rd in the list. Is there anyway to select the 3rd div without having to add any class/id in the divs, and without using XPATH? The structure of the elements in that site were built that way.
×
×
  • Create New...