Jump to content
Sign in to follow this  
Zedna

RegExp - effective HTML table parsing (rows,columns)

Recommended Posts

Zedna

I'm doing parsing of HTML file with <table>. I need to go through rows and columns of table, ideally to get two dimensional array.
I use this way with simple two levels of calling StrinRegExp() for rows and columns:

;~ $html = FileRead('table.html')
$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$rows = StringRegExp($html, '(?s)(?i)<tr>(.*?)</tr>', 3)
For $i = 0 to UBound($rows) - 1
    $row = $rows[$i]
    ConsoleWrite("Row " & $i & ': ' & $row & @CRLF)

    $cols = StringRegExp($row, '(?s)(?i)<td>(.*?)</td>', 3)
    For $j = 0 to UBound($cols) - 1
        $col = $cols[$j]
        ConsoleWrite("  Col " & $j & ': ' & $col & @CRLF)
    Next
Next

Output:

Row 0:  <td>r1c1</td> <td>r1c2</td>
  Col 0: r1c1
  Col 1: r1c2
Row 1:  <td>r2c1</td> <td>r2c2</td>
  Col 0: r2c1
  Col 1: r2c2
Row 2:  <td>r3c1</td> <td>r3c2</td>
  Col 0: r3c1
  Col 1: r3c2

 

 

In my example there is called StringRegExp() for each row of table which is ineffective for many rows.

It works fine, but my question is if there is better and more effective approach, maybe some clever the only one RegExp pattern?

Or maybe using StringRegExp with option=4? I 'm not experienced with this option (array in array) and example in helpfile is not very clear to me so I don't know if this option=4 can be used also for HTML table parsing.

Edited by Zedna

Share this post


Link to post
Share on other sites
czardas

I believe you can get rid of the question mark in the group (.*?) [but only if you use  (?U) at the start of your pattern :whistle:]. Your approach is the same as I would use. I've never used option=4. It would be nice to see more examples.

Edited by czardas

Share this post


Link to post
Share on other sites
jguinch

Well, not really sure, but here is a start point :

#include <Array.au3>


$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'


$aRes = StringRegExp($html, "(?is)<tr.*?>.*?(?:<td.*?>(.*?)<\/td>\s*)(?=(?:<td.*?>(.*?)<\/td>)?).*?<\/tr>", 4)

For $i = 0 To UBound($aRes) - 1
    If IsArray($aRes[$i]) Then
        $tab = $aRes[$i]
        _ArrayDisplay($tab)
    EndIf
Next


_ArrayDisplay($aRes)

I don't know why there is an empty last result...

Edit : I use <td.*?> to match this king of tag : <td id='toto' ....> (so out of your example)

Edited by jguinch
  • Like 1

Share this post


Link to post
Share on other sites
czardas
Edit : I use <td.*?> to match this king of tag : <td id='toto' ....> (so out of your example)

 

Nice example. Also silly me. The question mark is needed if you don't use (?U) at the start of the pattern. I tend to do that a lot which is why I thought the question mark wasn't needed. Sorry for the misinformation. :doh:

Edited by czardas

Share this post


Link to post
Share on other sites
mikell

AFAIK there is no way to get a 2D array from a single regex :)

I personally use something like this

#include <Array.au3>

;~ $html = FileRead('table.html')
$html = '<tr><td>r1c1</td> <td>r1c2</td> </tr>  <tr><td>r2c1   </td> <td>r2c2</td><td>r2c3</td></tr>  <tr><td></td><td>r3c2</td> <td>   r3c3</td></tr>  <tr><td></td><td>r4c2</td> <td> </td></tr>'

$rows = StringRegExp($html, '(?is)<tr>(.*?)</tr>', 3)
Local $a[UBound($rows)][100], $icol = 0

For $i = 0 to UBound($rows) - 1
    $cols = StringRegExp($rows[$i], '(?is)<td>(.*?)</td>', 3)
    $icol = ($icol > UBound($cols)) ? $icol : UBound($cols)
    For $j = 0 to UBound($cols) - 1
        $a[$i][$j] = StringStripWS($cols[$j], 3)
    Next
Next
Redim $a[UBound($rows)][$icol]
_ArrayDisplay($a)

Share this post


Link to post
Share on other sites
czardas

I couldn't get option=4 to work. :( Here's a slightly different approach.

#include <Array.au3>

Local $html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'
Local $aRes = StringRegExp($html, "(?isU)<tr>|(?:<td>)(.*)(?:</td>)", 3)
_ArrayDisplay($aRes)
Edited by czardas
  • Like 1

Share this post


Link to post
Share on other sites
binhnx

Every options return an 1 dimension array, so you cannot get a 2D array, unless parsing yourself :(

Option 4 is not very interesting. Think like that, the "pattern" can be broken into "part". With option 3, you find "part". With option 4, you find entire "pattern", which returned at the element 0 of the nested array. And other "parts" is returned at other index.

I think the snippet you wrote is the best we can do. If performance is too important, you can use a C HTML/XML parsing library and DllCall().


99 little bugs in the code

99 little bugs!

Take one down, patch it around

117 little bugs in the code!

Share this post


Link to post
Share on other sites
jguinch

@czardas : very nice !

I'm suprised about the <tr> capturing, but AutoIt captures everything if no parenthese is used, so everything is <tr> :geek:

BTW, I think you are using useless non-capturing groups. It's also OK with this: (?isU)<tr>|<td>(.*)</td>

(?|(<tr>)|<td>(.*)</td>) for compatibility with regex101.com for example

Edited by jguinch

Share this post


Link to post
Share on other sites
mikell

Not surprising at all as it captures both sides of the alternation

More noticeable trying this (?isU)<tr|<td>(.*)</td>

:)

Edited by mikell

Share this post


Link to post
Share on other sites
czardas

Nice examples jguinch and mikell. I think you're always learning new things with regexp - I know I am. :)

Share this post


Link to post
Share on other sites
Zedna

...

I think the snippet you wrote is the best we can do. If performance is too important, you can use a C HTML/XML parsing library and DllCall().

 

Here is another parsing snipet optimized for speed, only with one StringRegExp()

It's based on the premise of known number of columns.

Number of columns ($cols_on_row) can be checked by StringRegExp() before main For/Next loop if needed.

Rows needn't to be parsed by StringRegExp(), instead rows can be calculated by Mod() function.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then $row += 1
    ConsoleWrite("Row " & $row & "  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

 

Row 1  Col 0: r1c1

Row 1  Col 1: r1c2

Row 2  Col 0: r2c1

Row 2  Col 1: r2c2

Row 3  Col 0: r3c1

Row 3  Col 1: r3c2

Edited by Zedna
  • Like 1

Share this post


Link to post
Share on other sites
Zedna

And here is slightly modified version to distinguish code part for row and for column

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then
        $row += 1
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    ConsoleWrite("  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Row 1
  Col 0: r1c1
  Col 1: r1c2
Row 2
  Col 0: r2c1
  Col 1: r2c2
Row 3
  Col 0: r3c1
  Col 1: r3c2

Edited by Zedna
  • Like 1

Share this post


Link to post
Share on other sites
binhnx

It's very interesting, but you will also not get a 2D array :)
And this is a little modified version, dynamic column:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '(<tr>)*<td>(.*?)</td>', 4)
Local $aMatch = 0
Local $row = -1
Local $nCounter = 0
For $i = 0 To UBound($aArray) - 1
    $aMatch = $aArray[$i]
    If (StringLeft($aMatch[0], 3) = '<tr') Then
        $row += 1
        $nCounter = 0
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    $col = $aMatch[2]
    ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
    $nCounter += 1
Next

Or better, check close tag instead of open tag, and forget about the id/class/attributes.... Also, it will result smaller array.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 4)
Local $aMatch = 0
Local $row = 0
Local $nCounter = 0
ConsoleWrite("Row " & $row & @CRLF)

For $i = 0 To UBound($aArray) - 1
    $aMatch = $aArray[$i]
    $col = $aMatch[1]
    ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
    $nCounter += 1

    If (StringRight($aMatch[0], 5) = '</tr>') Then
        $row += 1
        $nCounter = 0
        ; Last ConsoleWrite should be ignored
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
Next

Output uglier string, but more performance in works
 
Or more better, use For... In to eliminate the array copy when assign variable $aMatch:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 4)
Local $aMatch = 0
Local $row = 0
Local $nCounter = 0
ConsoleWrite("Row " & $row & @CRLF)

If (IsArray($aArray)) Then
    For $aMatch In $aArray
        $col = $aMatch[1]
        ConsoleWrite("  Col " & $nCounter & ': ' & $col & @CRLF)
        $nCounter += 1

        If (StringRight($aMatch[0], 5) = '</tr>') Then
            $row += 1
            $nCounter = 0
            ; Last ConsoleWrite should be ignored
            ConsoleWrite("Row " & $row & @CRLF)
        EndIf
    Next
EndIf

And don't need flag=4, we can use flag=3, too, shorter (and maybe more performance) version:

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr><tr><td>r4c1</td> <td>r4c2</td></tr>'

Local $aArray = StringRegExp($html, '<td>(.*?)</td>(</tr>)*', 3)
Local $aMatch = 0
Local $row = 0
Local $col = 0
ConsoleWrite("Row " & $row & @CRLF)

If (IsArray($aArray)) Then
    For $ele In $aArray
        If ($ele <> '</tr>') Then
            ConsoleWrite("  Col " & $col & ': ' & $ele & @CRLF)
            $col += 1
        Else
            $row += 1
            ConsoleWrite("Row " & $row & @CRLF)
            $col = 0
        EndIf
    Next
EndIf
Edited by binhnx

99 little bugs in the code

99 little bugs!

Take one down, patch it around

117 little bugs in the code!

Share this post


Link to post
Share on other sites
Zedna

Here is another parsing snipet optimized for speed, only with one StringRegExp()

It's based on the premise of known number of columns.

Number of columns ($cols_on_row) can be checked by StringRegExp() before main For/Next loop if needed.

Rows needn't to be parsed by StringRegExp(), instead rows can be calculated by Mod() function.

$html = '<tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$cols_on_row = 2 ; known number of columns
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)
For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then $row += 1
    ConsoleWrite("Row " & $row & "  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

 

Here is modified version also with dynamic checking for number of columns from table header tags <th> </th>:

;~ $html = FileRead('table.html')
$html = '<tr><th>column1</th> <th>column2</th></tr>  <tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$colnames = StringRegExp($html, '(?s)(?i)<th>(.*?)</th>', 3)
For $j = 0 to UBound($colnames) - 1
    ConsoleWrite("Col name " & $j & ': ' & $colnames[$j] & @CRLF)
Next
ConsoleWrite("Number of columns: " & UBound($colnames) & @CRLF & @CRLF)

$cols_on_row = UBound($colnames)
$row = 0
$cols = StringRegExp($html, '(?s)(?i)<td>(.*?)</td>', 3)

For $i = 0 to UBound($cols) - 1
    $col = $cols[$i]
    If Mod($i,$cols_on_row) = 0 Then
        $row += 1
        ConsoleWrite("Row " & $row & @CRLF)
    EndIf
    ConsoleWrite("  Col " & Mod($i,$cols_on_row) & ': ' & $col & @CRLF)
Next

Output:

Col name 0: column1

Col name 1: column2

Number of columns: 2

Row 1

  Col 0: r1c1

  Col 1: r1c2

Row 2

  Col 0: r2c1

  Col 1: r2c2

Row 3

  Col 0: r3c1

  Col 1: r3c2

 

 

This is final version which I will use in my project, because there is table with table header tags included.

Anyway thanks to all for given interesting RegExp ideas, feel free to add another ones ...  :-)

Edited by Zedna

Share this post


Link to post
Share on other sites
jguinch

A last one for me, with a mix of other codes :

#Include <Array.au3>

$sHtml = '<tr><th>column1</th> <th>column2</th></tr>  <tr><td>r1c1</td> <td>r1c2</td></tr>  <tr><td>r2c1</td> <td>r2c2</td></tr>  <tr><td>r3c1</td> <td>r3c2</td></tr>'

$aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)

Local $aResult[ UBound($aRes) ] [ UBound($aRes) ]
Local $iRow = 0, $iCol = 0, $iMaxRow = 0

For $i = 0 To UBound($aRes) - 1
    If $aRes[$i] = "/" Then
        $iRow += 1
        $iCol = 0
    Else
        $aResult[$iRow][$iCol] = $aRes[$i]
        $iCol += 1
        If $iCol > $iMaxRow Then $iMaxRow = $iCol
    EndIf
Next

Redim $aResult[$iRow][$iMaxRow]

_ArrayDisplay($aResult)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Similar Content

    • mLipok
      By mLipok
      In April 5, 2013 I ask @Lazycat 
      he answer:
      Then I change this tool a little.
      Now I back to this and make bigger changed.
      Here is new version.
      Update History: = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 2018/11/07 v3.0 * Changed: AU3Check compilant - mLipok * Changed: almost all Variables renamed - mLipok * Added: "Delete RegExp Results" - mLipok * Added: support for dual monitor - mLipok * Added: "full screen mode" - mLipok = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 2018/11/08 v3.1 * Added: colors for each Edit control - used GUICtrlSetBkColor() - mLipok * Added: FullScreen option (Checkbox + INI + Remarks in Tip) - mLipok * Added: _IsChecked() - mLipok * Changed: WinMove() - change size of window using: WindowWidth and WindowHeight - mLipok = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 2018/11/13 v3.2 * Added: If $bFullScreen Then GUICtrlSetFont() - mLipok * Added: WM_COMMAND , $EN_CHANGE - prevent CPU overheat - mLipok = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 2018/11/29 v3.3 * Changed: $_g_idCheckbox_Clear - also clear $_g_idEdit_Result - mLipok * Changed: ClearResult If GUICtrlRead($_g_idEdit_MatchText) = '' Or GUICtrlRead($_g_idEdit_MatchText) = '' - mLipok * Fixed: prevention CPU overheat - If $iGuiMsg <> 0 Then $_g_bWasAChange = True - any GUI change will fire RegExp result refresh - mLipok * Fixed: Top possition of $_g_idLabel_Dummy control - mLipok * Added: support for TabSwitch - CTRL+TAB and CTRL+SHIFT+TAB - mLipok = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =  
    • MrCheese
      By MrCheese
      argh, pulling my hair out.
      considering this post: 
       
      say for a string = "03a", how can I strip out the leading 0 and the a.
      I have tried:
      $new = StringRegExpReplace($string, '[^1-9][^0-9]', '')
       
      and various combinations:
      ^0+[^0-9]
      [^[:digit:]]
      "[^0].*"
      "^0*(d+)"
       
      I'm going loopy!
       
       
    • lavascript
      By lavascript
      I have a Word document containing a 9-column table where row 1 is the column headers. My goal is to read the table into a 2d array, remove some rows, update some fields, and add a few rows to the end. The resulting array will likely be a different length. Next, I want to write the data back into the table. If it's easier, I can write the data to a new document from a template containing the same table header with a blank 2nd row.
      Here's my early attempt:
      Local $oWord = _Word_Create() Local $oDoc = _Word_DocOpen($oWord, $sFile) Local $aData = _Word_DocTableRead($oDoc, 1) $aData[3][5] = "Something else" Local $oRange = _Word_DocRangeSet($oDoc, 0) $oRange = _Word_DocRangeSet($oDoc, $oRange, $wdCell, 9) _Word_DocTableWrite($oRange,$aData) This, unfortunately, writes the entire array into the first cell of row 2. What am I doing wrong?
       
    • ur
      By ur
      Is there any UDF to remove all anchor tags <a> with a particular class (and also its sub elements completely) in a html document.
      Here the classes are browse and breadcrumbs
      Like in the below image.


       
      I am not able to find that option in IE.au3
       
      Please suggest.
    • milkmoron
      By milkmoron
      I am trying to automate something in a web browser but i need some help with finding the html code to a web applet. How do I access the code.
×