RegExp - effective HTML table parsing (rows,columns)

Zedna · October 15, 2014

I'm doing parsing of HTML file with <table>. I need to go through rows and columns of table, ideally to get two dimensional array.
I use this way with simple two levels of calling StrinRegExp() for rows and columns:

;~ $html = FileRead('table.html')
$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$rows = StringRegExp($html, '(?s)(?i)<tr>(.*?)</tr>', 3)
For $i = 0 to UBound($rows) - 1
    $row = $rows[$i]
    ConsoleWrite("Row " & $i & ': ' & $row & @CRLF)

    $cols = StringRegExp($row, '(?s)(?i)<td>(.*?)</td>', 3)
    For $j = 0 to UBound($cols) - 1
        $col = $cols[$j]
        ConsoleWrite("  Col " & $j & ': ' & $col & @CRLF)
    Next
Next

Output:

Row 0: <td>r1c1</td> <td>r1c2</td>
Col 0: r1c1
Col 1: r1c2
Row 1: <td>r2c1</td> <td>r2c2</td>
Col 0: r2c1
Col 1: r2c2
Row 2: <td>r3c1</td> <td>r3c2</td>
Col 0: r3c1
Col 1: r3c2

In my example there is called StringRegExp() for each row of table which is ineffective for many rows.

It works fine, but my question is if there is better and more effective approach, maybe some clever the only one RegExp pattern?

Or maybe using StringRegExp with option=4? I 'm not experienced with this option (array in array) and example in helpfile is not very clear to me so I don't know if this option=4 can be used also for HTML table parsing.

Edited October 15, 2014 by Zedna

czardas · October 15, 2014

~~I believe you can get rid of the question mark in the group (.*?)~~ [but only if you use (?U) at the start of your pattern :whistle: ]. Your approach is the same as I would use. I've never used option=4. It would be nice to see more examples.

Edited October 15, 2014 by czardas

jguinch · October 15, 2014

Well, not really sure, but here is a start point :

#include <Array.au3>


$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'


$aRes = StringRegExp($html, "(?is)<tr.*?>.*?(?:<td.*?>(.*?)<\/td>\s*)(?=(?:<td.*?>(.*?)<\/td>)?).*?<\/tr>", 4)

For $i = 0 To UBound($aRes) - 1
    If IsArray($aRes[$i]) Then
        $tab = $aRes[$i]
        _ArrayDisplay($tab)
    EndIf
Next


_ArrayDisplay($aRes)

I don't know why there is an empty last result...

Edit : I use <td.*?> to match this king of tag : <td id='toto' ....> (so out of your example)

Edited October 15, 2014 by jguinch

czardas · October 15, 2014

Edit : I use <td.*?> to match this king of tag : <td id='toto' ....> (so out of your example)

Nice example. Also silly me. The question mark is needed if you don't use (?U) at the start of the pattern. I tend to do that a lot which is why I thought the question mark wasn't needed. Sorry for the misinformation. :doh:

Edited October 15, 2014 by czardas

jguinch · October 15, 2014

Sorry, my example was not good. It works only with 2 cols, not more...

mikell · October 15, 2014

AFAIK there is no way to get a 2D array from a single regex

I personally use something like this

#include <Array.au3>

;~ $html = FileRead('table.html')
$html = '<tr><td>r1c1</td> <td>r1c2</td> </tr>  <tr><td>r2c1   </td> <td>r2c2</td><td>r2c3</td></tr>  <tr><td></td><td>r3c2</td> <td>   r3c3</td></tr>  <tr><td></td><td>r4c2</td> <td> </td></tr>'

$rows = StringRegExp($html, '(?is)<tr>(.*?)</tr>', 3)
Local $a[UBound($rows)][100], $icol = 0

For $i = 0 to UBound($rows) - 1
    $cols = StringRegExp($rows[$i], '(?is)<td>(.*?)</td>', 3)
    $icol = ($icol > UBound($cols)) ? $icol : UBound($cols)
    For $j = 0 to UBound($cols) - 1
        $a[$i][$j] = StringStripWS($cols[$j], 3)
    Next
Next
Redim $a[UBound($rows)][$icol]
_ArrayDisplay($a)

czardas · October 15, 2014

I couldn't get option=4 to work. Here's a slightly different approach.

#include <Array.au3>

Local $html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'
Local $aRes = StringRegExp($html, "(?isU)<tr>|(?:<td>)(.*)(?:</td>)", 3)
_ArrayDisplay($aRes)

Edited October 15, 2014 by czardas

binhnx · October 15, 2014

Every options return an 1 dimension array, so you cannot get a 2D array, unless parsing yourself

Option 4 is not very interesting. Think like that, the "pattern" can be broken into "part". With option 3, you find "part". With option 4, you find entire "pattern", which returned at the element 0 of the nested array. And other "parts" is returned at other index.

I think the snippet you wrote is the best we can do. If performance is too important, you can use a C HTML/XML parsing library and DllCall().

jguinch · October 15, 2014

@czardas : very nice !

I'm suprised about the <tr> capturing, but AutoIt captures everything if no parenthese is used, so everything is <tr> :geek:

BTW, I think you are using useless non-capturing groups. It's also OK with this: (?isU)<tr>|<td>(.*)</td>

(?|(<tr>)|<td>(.*)</td>) for compatibility with regex101.com for example

Edited October 15, 2014 by jguinch

mikell · October 15, 2014

Not surprising at all as it captures both sides of the alternation

More noticeable trying this (?isU)<tr|<td>(.*)</td>

Edited October 15, 2014 by mikell

czardas · October 15, 2014

Nice examples jguinch and mikell. I think you're always learning new things with regexp - I know I am.

Zedna · October 15, 2014

...
I think the snippet you wrote is the best we can do. If performance is too important, you can use a C HTML/XML parsing library and DllCall().

Here is another parsing snipet optimized for speed, only with one StringRegExp()

It's based on the premise of known number of columns.

Number of columns ($cols_on_row) can be checked by StringRegExp() before main For/Next loop if needed.

Rows needn't to be parsed by StringRegExp(), instead rows can be calculated by Mod() function.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then $row += 1
    ConsoleWrite("Row " & $row & "  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Row 1 Col 0: r1c1
Row 1 Col 1: r1c2
Row 2 Col 0: r2c1
Row 2 Col 1: r2c2
Row 3 Col 0: r3c1
Row 3 Col 1: r3c2

Edited October 15, 2014 by Zedna

Zedna · October 15, 2014

And here is slightly modified version to distinguish code part for row and for column

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then
        $row += 1
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    ConsoleWrite("  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Row 1
Col 0: r1c1
Col 1: r1c2
Row 2
Col 0: r2c1
Col 1: r2c2
Row 3
Col 0: r3c1
Col 1: r3c2

Edited October 15, 2014 by Zedna

binhnx · October 15, 2014

It's very interesting, but you will also not get a 2D array
And this is a little modified version, dynamic column:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '(<tr>)*<td>(.*?)</td>', 4)
Local $aMatch = 0
Local $row = -1
Local $nCounter = 0
For $i = 0 To UBound($aArray) - 1
    $aMatch = $aArray[$i]
    If (StringLeft($aMatch[0], 3) = '<tr') Then
        $row += 1
        $nCounter = 0
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    $col = $aMatch[2]
    ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
    $nCounter += 1
Next

Or better, check close tag instead of open tag, and forget about the id/class/attributes.... Also, it will result smaller array.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 4)
Local $aMatch = 0
Local $row = 0
Local $nCounter = 0
ConsoleWrite("Row " & $row & @CRLF)

For $i = 0 To UBound($aArray) - 1
    $aMatch = $aArray[$i]
    $col = $aMatch[1]
    ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
    $nCounter += 1

    If (StringRight($aMatch[0], 5) = '</tr>') Then
        $row += 1
        $nCounter = 0
        ; Last ConsoleWrite should be ignored
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
Next

Output uglier string, but more performance in works

Or more better, use For... In to eliminate the array copy when assign variable $aMatch:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 4)
Local $aMatch = 0
Local $row = 0
Local $nCounter = 0
ConsoleWrite("Row " & $row & @CRLF)

If (IsArray($aArray)) Then
    For $aMatch In $aArray
        $col = $aMatch[1]
        ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
        $nCounter += 1

        If (StringRight($aMatch[0], 5) = '</tr>') Then
            $row += 1
            $nCounter = 0
            ; Last ConsoleWrite should be ignored
            ConsoleWrite("Row " & $row & @CRLF)
        EndIf
    Next
EndIf

And don't need flag=4, we can use flag=3, too, shorter (and maybe more performance) version:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 3)
Local $aMatch = 0
Local $row = 0
Local $col = 0
ConsoleWrite("Row " & $row & @CRLF)

If (IsArray($aArray)) Then
    For $ele In $aArray
        If ($ele <> '</tr>') Then
            ConsoleWrite("  Col " & $col & ': ' & $ele & @CRLF)
            $col += 1
        Else
            $row += 1
            ConsoleWrite("Row " & $row & @CRLF)
            $col = 0
        EndIf
    Next
EndIf

Edited October 16, 2014 by binhnx

Zedna · October 16, 2014

Here is another parsing snipet optimized for speed, only with one StringRegExp()

It's based on the premise of known number of columns.

Number of columns ($cols_on_row) can be checked by StringRegExp() before main For/Next loop if needed.

Rows needn't to be parsed by StringRegExp(), instead rows can be calculated by Mod() function.
$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then $row += 1
    ConsoleWrite("Row " & $row & "  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next
Output:

Here is modified version also with dynamic checking for number of columns from table header tags <th> </th>:

;~ $html = FileRead('table.html')
$html = '<tr><th>column1</th> <th>column2</th></tr>  <tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$colnames = StringRegExp($html, '(?s)(?i)<th>(.*?)</th>', 3)
For $j = 0 to UBound($colnames) - 1
    ConsoleWrite("Col name " & $j & ': ' & $colnames[$j] & @CRLF)
Next
ConsoleWrite("Number of columns: " & UBound($colnames) & @CRLF & @CRLF)

$cols_on_row = UBound($colnames)
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)

For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then
        $row += 1
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    ConsoleWrite("  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Col name 0: column1
Col name 1: column2
Number of columns: 2
Row 1
Col 0: r1c1
Col 1: r1c2
Row 2
Col 0: r2c1
Col 1: r2c2
Row 3
Col 0: r3c1
Col 1: r3c2

This is final version which I will use in my project, because there is table with table header tags included.

Anyway thanks to all for given interesting RegExp ideas, feel free to add another ones ... :-)

Edited October 16, 2014 by Zedna

jguinch · October 16, 2014

A last one for me, with a mix of other codes :

#Include <Array.au3>

$sHtml = '<tr><th>column1</th> <th>column2</th></tr>  <tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)

Local $aResult[ UBound($aRes) ] [ UBound($aRes) ]
Local $iRow = 0, $iCol = 0, $iMaxRow = 0

For $i = 0 To UBound($aRes) - 1
    If $aRes[$i] = "/" Then
        $iRow += 1
        $iCol = 0
    Else
        $aResult[$iRow][$iCol] = $aRes[$i]
        $iCol += 1
        If $iCol > $iMaxRow Then $iMaxRow = $iCol
    EndIf
Next

Redim $aResult[$iRow][$iMaxRow]

_ArrayDisplay($aResult)

Sign In

RegExp - effective HTML table parsing (rows,columns)

Recommended Posts

Zedna

czardas

jguinch

czardas

jguinch

mikell

czardas

binhnx

jguinch

mikell

czardas

Zedna

Zedna

binhnx

Zedna

jguinch

Create an account or sign in to comment

Create an account

Sign in

Similar Content

Loading Large Table for lookup

How to have a GIF Animation without borders over a PNG with transparency

[SOLVED] - How to tick the checkbox in the table (Internet Explorer)

Chrome Webdriver - Dynamic ID's

Extract hex number from string

Browse

AutoIt Resources

Release

Beta