Jump to content

Problem using regexp for parsing <div> tags


Recommended Posts

Hi, I made a regular expression long ago to extract div tags from a string:

(?i)(?s)<div whatever >.*?(?(?=<div).*?</div>)</div>

What it does is to extract everything between the first <div whatever > until its closing </div> and if it finds other <div tags it searches for everything until their closing </div> for each <div tag . This functioned well in a couple of scripts I had but in this one isn't functioning.

Here is the problem isolated (I can bring the full html code if necessary):

<div class="entry clearfix" >asd<div c> 
<div cl</div>
<div clas</div>
<div cot</div>
<div clas</div><div cls</div>   
</div> <!-- end .entry-content --><div clix">
                    
</div>
</div> <!-- end .entry -->  
 
 
<div class="entry clearfix" >asd<div c> 
<div cl</div>
<div clas</div>
<div cot</div>
<div clas</div><div cls</div>   
</div> <!-- end .entry-content --><div clix">
                    
</div>
</div> <!-- end .entry -->

I used PCRE toolkit to test the regular expression using flag 3. It should return and array with two cells with:

<div class="entry clearfix" >asd<div c>

<div cl</div>

<div clas</div>

<div cot</div>

<div clas</div><div cls</div>

</div> <!-- end .entry-content --><div clix">

</div>

</div>

in each of them but it returns an array with one cell with:

<div class="entry clearfix" >asd<div c>

<div cl</div>

Like it ignores the condition (?(?=<div).*?</div>) and just searches until it finds the first closing </div> :S I tested also making it greedy but then it captures the two blocks in one cell of the array. I'm stuck in this and I appreciate any help. Thanks!

Edited by Mithrandir
Link to comment
Share on other sites

This seems pretty complicated to me, but you could try a different approach. The following code will find opening and closing div tag positions in a string. From there you could perhaps figure something out. I imagine GOESoft will laugh, but I'm not sure how to fix the RegExp.

#include <Array.au3>
 
$sTest = "<div>data_A</div><div><div>data_B</div></div>"
 
Dim $iNumDiv = 1
While StringInStr($sTest, "<div", 0, $iNumDiv)
    $iNumDiv += 1
WEnd
$iNumDiv -=1 ; Since we started with a value of 1 before searching.
 
Dim $iNumEndDiv = 1
While StringInStr($sTest, "</div>", 0, $iNumEndDiv)
    $iNumEndDiv += 1
WEnd
$iNumEndDiv -=1
 
Dim $iBound = $iNumDiv + $iNumEndDiv ; This should be an even number
 
If $iBound = 0 Then
    MsgBox(0, "Error", "The string contains no div tags")
    Exit ; To avoid errors with he next part of the script
EndIf
 
Dim $aArray[$iBound][2], $iCount = 0
 
For $i = 1 To $iNumDiv
    $aArray[$iCount][0] = "<div whatever>"
    $aArray[$iCount][1] = StringInStr($sTest, "<div>", 0, $i)
    $iCount += 1
Next
 
For $i = 1 To $iNumEndDiv
    $aArray[$iCount][0] = "</div>"
    $aArray[$iCount][1] = StringInStr($sTest, "</div>", 0, $i)
    $iCount += 1
Next
 
_ArraySort($aArray, 0, 0, 0, 1) ; Sort the div tags in the order they appear
_ArrayDisplay($aArray)

Edit

The RegExp you have will never work anyway, since the opening and closing tags need to be paired. I believe a more sophisticated parsing method is required.

Edited by czardas
Link to comment
Share on other sites

  • 9 months later...

Using the HTML Dom, you can search for 'DIV' in x.getElementsByTagName(name) - get all elements with a specified tag name.

That will return a collection of all the DIV elements, which you can loop through to get the .text of each child (the inner text of the node).

IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...