Jump to content

RegExp - effective HTML table parsing (rows,columns)


Zedna
 Share

Recommended Posts

I'm doing parsing of HTML file with <table>. I need to go through rows and columns of table, ideally to get two dimensional array.
I use this way with simple two levels of calling StrinRegExp() for rows and columns:

;~ $html = FileRead('table.html')
$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$rows = StringRegExp($html, '(?s)(?i)<tr>(.*?)</tr>', 3)
For $i = 0 to UBound($rows) - 1
    $row = $rows[$i]
    ConsoleWrite("Row " & $i & ': ' & $row & @CRLF)

    $cols = StringRegExp($row, '(?s)(?i)<td>(.*?)</td>', 3)
    For $j = 0 to UBound($cols) - 1
        $col = $cols[$j]
        ConsoleWrite("  Col " & $j & ': ' & $col & @CRLF)
    Next
Next

Output:

Row 0:  <td>r1c1</td> <td>r1c2</td>
  Col 0: r1c1
  Col 1: r1c2
Row 1:  <td>r2c1</td> <td>r2c2</td>
  Col 0: r2c1
  Col 1: r2c2
Row 2:  <td>r3c1</td> <td>r3c2</td>
  Col 0: r3c1
  Col 1: r3c2

 

 

In my example there is called StringRegExp() for each row of table which is ineffective for many rows.

It works fine, but my question is if there is better and more effective approach, maybe some clever the only one RegExp pattern?

Or maybe using StringRegExp with option=4? I 'm not experienced with this option (array in array) and example in helpfile is not very clear to me so I don't know if this option=4 can be used also for HTML table parsing.

Edited by Zedna
Link to comment
Share on other sites

I believe you can get rid of the question mark in the group (.*?) [but only if you use  (?U) at the start of your pattern :whistle:]. Your approach is the same as I would use. I've never used option=4. It would be nice to see more examples.

Edited by czardas
Link to comment
Share on other sites

Well, not really sure, but here is a start point :

#include <Array.au3>


$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'


$aRes = StringRegExp($html, "(?is)<tr.*?>.*?(?:<td.*?>(.*?)<\/td>\s*)(?=(?:<td.*?>(.*?)<\/td>)?).*?<\/tr>", 4)

For $i = 0 To UBound($aRes) - 1
    If IsArray($aRes[$i]) Then
        $tab = $aRes[$i]
        _ArrayDisplay($tab)
    EndIf
Next


_ArrayDisplay($aRes)

I don't know why there is an empty last result...

Edit : I use <td.*?> to match this king of tag : <td id='toto' ....> (so out of your example)

Edited by jguinch
Link to comment
Share on other sites

Edit : I use <td.*?> to match this king of tag : <td id='toto' ....> (so out of your example)

 

Nice example. Also silly me. The question mark is needed if you don't use (?U) at the start of the pattern. I tend to do that a lot which is why I thought the question mark wasn't needed. Sorry for the misinformation. :doh:

Edited by czardas
Link to comment
Share on other sites

AFAIK there is no way to get a 2D array from a single regex :)

I personally use something like this

#include <Array.au3>

;~ $html = FileRead('table.html')
$html = '<tr><td>r1c1</td> <td>r1c2</td> </tr>  <tr><td>r2c1   </td> <td>r2c2</td><td>r2c3</td></tr>  <tr><td></td><td>r3c2</td> <td>   r3c3</td></tr>  <tr><td></td><td>r4c2</td> <td> </td></tr>'

$rows = StringRegExp($html, '(?is)<tr>(.*?)</tr>', 3)
Local $a[UBound($rows)][100], $icol = 0

For $i = 0 to UBound($rows) - 1
    $cols = StringRegExp($rows[$i], '(?is)<td>(.*?)</td>', 3)
    $icol = ($icol > UBound($cols)) ? $icol : UBound($cols)
    For $j = 0 to UBound($cols) - 1
        $a[$i][$j] = StringStripWS($cols[$j], 3)
    Next
Next
Redim $a[UBound($rows)][$icol]
_ArrayDisplay($a)
Link to comment
Share on other sites

I couldn't get option=4 to work. :( Here's a slightly different approach.

#include <Array.au3>

Local $html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'
Local $aRes = StringRegExp($html, "(?isU)<tr>|(?:<td>)(.*)(?:</td>)", 3)
_ArrayDisplay($aRes)
Edited by czardas
Link to comment
Share on other sites

Every options return an 1 dimension array, so you cannot get a 2D array, unless parsing yourself :(

Option 4 is not very interesting. Think like that, the "pattern" can be broken into "part". With option 3, you find "part". With option 4, you find entire "pattern", which returned at the element 0 of the nested array. And other "parts" is returned at other index.

I think the snippet you wrote is the best we can do. If performance is too important, you can use a C HTML/XML parsing library and DllCall().

99 little bugs in the code

99 little bugs!

Take one down, patch it around

117 little bugs in the code!

Link to comment
Share on other sites

@czardas : very nice !

I'm suprised about the <tr> capturing, but AutoIt captures everything if no parenthese is used, so everything is <tr> :geek:

BTW, I think you are using useless non-capturing groups. It's also OK with this: (?isU)<tr>|<td>(.*)</td>

(?|(<tr>)|<td>(.*)</td>) for compatibility with regex101.com for example

Edited by jguinch
Link to comment
Share on other sites

...

I think the snippet you wrote is the best we can do. If performance is too important, you can use a C HTML/XML parsing library and DllCall().

 

Here is another parsing snipet optimized for speed, only with one StringRegExp()

It's based on the premise of known number of columns.

Number of columns ($cols_on_row) can be checked by StringRegExp() before main For/Next loop if needed.

Rows needn't to be parsed by StringRegExp(), instead rows can be calculated by Mod() function.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then $row += 1
    ConsoleWrite("Row " & $row & "  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

 

Row 1  Col 0: r1c1

Row 1  Col 1: r1c2

Row 2  Col 0: r2c1

Row 2  Col 1: r2c2

Row 3  Col 0: r3c1

Row 3  Col 1: r3c2

Edited by Zedna
Link to comment
Share on other sites

And here is slightly modified version to distinguish code part for row and for column

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then
        $row += 1
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    ConsoleWrite("  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Row 1
  Col 0: r1c1
  Col 1: r1c2
Row 2
  Col 0: r2c1
  Col 1: r2c2
Row 3
  Col 0: r3c1
  Col 1: r3c2

Edited by Zedna
Link to comment
Share on other sites

It's very interesting, but you will also not get a 2D array :)
And this is a little modified version, dynamic column:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '(<tr>)*<td>(.*?)</td>', 4)
Local $aMatch = 0
Local $row = -1
Local $nCounter = 0
For $i = 0 To UBound($aArray) - 1
    $aMatch = $aArray[$i]
    If (StringLeft($aMatch[0], 3) = '<tr') Then
        $row += 1
        $nCounter = 0
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    $col = $aMatch[2]
    ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
    $nCounter += 1
Next

Or better, check close tag instead of open tag, and forget about the id/class/attributes.... Also, it will result smaller array.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 4)
Local $aMatch = 0
Local $row = 0
Local $nCounter = 0
ConsoleWrite("Row " & $row & @CRLF)

For $i = 0 To UBound($aArray) - 1
    $aMatch = $aArray[$i]
    $col = $aMatch[1]
    ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
    $nCounter += 1

    If (StringRight($aMatch[0], 5) = '</tr>') Then
        $row += 1
        $nCounter = 0
        ; Last ConsoleWrite should be ignored
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
Next

Output uglier string, but more performance in works
 
Or more better, use For... In to eliminate the array copy when assign variable $aMatch:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 4)
Local $aMatch = 0
Local $row = 0
Local $nCounter = 0
ConsoleWrite("Row " & $row & @CRLF)

If (IsArray($aArray)) Then
    For $aMatch In $aArray
        $col = $aMatch[1]
        ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
        $nCounter += 1

        If (StringRight($aMatch[0], 5) = '</tr>') Then
            $row += 1
            $nCounter = 0
            ; Last ConsoleWrite should be ignored
            ConsoleWrite("Row " & $row & @CRLF)
        EndIf
    Next
EndIf

And don't need flag=4, we can use flag=3, too, shorter (and maybe more performance) version:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 3)
Local $aMatch = 0
Local $row = 0
Local $col = 0
ConsoleWrite("Row " & $row & @CRLF)

If (IsArray($aArray)) Then
    For $ele In $aArray
        If ($ele <> '</tr>') Then
            ConsoleWrite("  Col " & $col & ': ' & $ele & @CRLF)
            $col += 1
        Else
            $row += 1
            ConsoleWrite("Row " & $row & @CRLF)
            $col = 0
        EndIf
    Next
EndIf
Edited by binhnx

99 little bugs in the code

99 little bugs!

Take one down, patch it around

117 little bugs in the code!

Link to comment
Share on other sites

Here is another parsing snipet optimized for speed, only with one StringRegExp()

It's based on the premise of known number of columns.

Number of columns ($cols_on_row) can be checked by StringRegExp() before main For/Next loop if needed.

Rows needn't to be parsed by StringRegExp(), instead rows can be calculated by Mod() function.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then $row += 1
    ConsoleWrite("Row " & $row & "  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

 

Here is modified version also with dynamic checking for number of columns from table header tags <th> </th>:

;~ $html = FileRead('table.html')
$html = '<tr><th>column1</th> <th>column2</th></tr>  <tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$colnames = StringRegExp($html, '(?s)(?i)<th>(.*?)</th>', 3)
For $j = 0 to UBound($colnames) - 1
    ConsoleWrite("Col name " & $j & ': ' & $colnames[$j] & @CRLF)
Next
ConsoleWrite("Number of columns: " & UBound($colnames) & @CRLF & @CRLF)

$cols_on_row = UBound($colnames)
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)

For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then
        $row += 1
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    ConsoleWrite("  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Col name 0: column1

Col name 1: column2

Number of columns: 2

Row 1

  Col 0: r1c1

  Col 1: r1c2

Row 2

  Col 0: r2c1

  Col 1: r2c2

Row 3

  Col 0: r3c1

  Col 1: r3c2

 

 

This is final version which I will use in my project, because there is table with table header tags included.

Anyway thanks to all for given interesting RegExp ideas, feel free to add another ones ...  :-)

Edited by Zedna
Link to comment
Share on other sites

A last one for me, with a mix of other codes :

#Include <Array.au3>

$sHtml = '<tr><th>column1</th> <th>column2</th></tr>  <tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)

Local $aResult[ UBound($aRes) ] [ UBound($aRes) ]
Local $iRow = 0, $iCol = 0, $iMaxRow = 0

For $i = 0 To UBound($aRes) - 1
    If $aRes[$i] = "/" Then
        $iRow += 1
        $iCol = 0
    Else
        $aResult[$iRow][$iCol] = $aRes[$i]
        $iCol += 1
        If $iCol > $iMaxRow Then $iMaxRow = $iCol
    EndIf
Next

Redim $aResult[$iRow][$iMaxRow]

_ArrayDisplay($aResult)
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...