Sign in to follow this  
Followers 0
zorphnog

IE DOM

6 posts in this topic

I've been using the _IE functions to attempt to convert some HTML tables into a database. Specifically, I have been using the _IETableWriteToArray function. I have no problems with the implementation of the _IE functions or the DOM interface. My question is more of a general IE DOM question.

Is the DOM that is exposed by an IE object representative of the rendered document or that of the document before rendering? I am led to think that it is the latter because I am running into some issues with badly formed HTML pages. The pages have <span> elements that are incorrectly used around <td> elements resulting in certain cells not being available through the DOM.

Share this post


Link to post
Share on other sites



It absolutely exposes the elements in their rendered state.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

Ok, so then the rendered document should ignore illegal <span> elements then correct? I can't post the exact HTML, but here is an example I derived from the page.

test.htm

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:p="urn:schemas-microsoft-com:office:powerpoint" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml">

<head>
<meta http-equiv="Content-Language" content="en-us" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Management Page</title>
<style type="text/css">
.style11 {
    font-size: xx-small;
    text-align: left;
}


.style1 {
    font-size: xx-small;
}
.style12 {
    color: #800000;
}
.style13 {
    background-color: #C0C0C0;
}
.style14 {
    font-size: x-small;
}
.style6 {
    font-size: x-small;
    text-decoration: none;
    line-height: 20px;
    font-style: normal;
    color: #000000;
    font-weight: normal;
}
.style15 {
    text-align: center;
}
.style2 {
    font-size: x-small;
}
.style17 {
    font-size: x-small;
    font-weight: bold;
}
.style5 {
    font-size: x-small;
}
.style18 {
    font-size: x-small;
    text-align: left;
}
.style20 {
    font-size: x-small;
    font-weight: normal;
}
.style21 {
    font-size: x-small;
    font-family: Tahoma;
}
.style23 {
    font-family: Tahoma;
}
.style4 {
    font-weight: normal;
}
.style9 {
    font-size: x-small;
}
* { /* IE5-6 font declaration */
    _font-size: inherit;
    _font-family: inherit;
    _font-color: inherit;
    _font-weight: inherit;
}
.style24 {
    font-size: x-small;
    text-align: left;
    font-family: Tahoma;
}
.style3 {
    font-size: x-small;
}
.style25 {
                font-size: x-small;
                text-align: center;
}
</style>
<base target="_blank" />
</head>

<body>

<table style="font-family: Tahoma; font-size: 7.5pt; text-align: center;">
    <tr>
        <td colspan="10">
        <p><b><span style="FONT-SIZE: 10pt; COLOR: maroon; FONT-FAMILY: Tahoma">
        2010 Alerts</span></b></p>
        </td>
    </tr>
    <tr style="background-color: silver; font-weight: bold;">
        <td width="94">
        <p><span>Notice<br />
        Number</span></p>
        </td>
        <td width="62">
        <p><span>Release<br />
        Date</span></p>
        </td>
        <td width="65">
        <p><span>Major.Minor<br />
        Revision</span></p>
        </td>
        <td width="60">
        <p><span>Revision<br />
        Date</span></p>
        </td>
        <td style="width: 88px">
        <p><span>CVE</span></p>
        </td>
        <td style="width: 295px">
        <p><span>Title</span></p>
        </td>
        <td width="93">
        <p><span>Status</span></p>
        </td>
        <td style="width: 94px">
        <p><span>A</span></p>
        </td>
        <td style="width: 65px">
        <p><span>B</span></p>
        </td>
        <td width="78" style="width: 97px">
        <p><span>C</span></p>
        </td>
    </tr>
    <tr>
        <td valign="top" width="94" class="style14">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201041</a></span></font></td>
        <td valign="top" width="62">11 Mar 10</td>
        <td valign="top" width="65">&nbsp;</td>
        <td valign="top" width="62">&nbsp;</td>
        <td valign="top" class="style18" style="width: 88px">
        <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0408">
        <span class="style5">CVE-2010-0408</span></a> <br />
        <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0425">
        CVE-2010-0425</a><br />
        <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0434">
        CVE-2010-0434</a><font size="2"><br />
        </font></td>
        <td valign="top" class="style14" style="width: 295px">
        Multiple Vulnerabilities in Apache httpd</td>
        <span class="style14">
        <td valign="top" width="93" class="style14">Active</td>
        <td valign="top" class="style14" style="width: 94px">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201041</a></span></font></td>
        </span>
            <td valign="top" style="width: 65px">
                <span class="style14">
                <span>
                <a href="">
                201007</a></span></span></td>
        <span class="style14">
        <td valign="top" width="94" class="style14" style="width: 97px">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201041</a></span></font></td>
        </span>
    </tr>
    <tr>
        <td valign="top" width="94" class="style14">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201036</a></span></font></td>
        <td valign="top" width="62">25 Feb 10</td>
        <td valign="top" width="65">1.1</td>
        <td valign="top" width="62">10 Mar 10</td>
        <td valign="top" class="style18" style="width: 88px">
        <font size="2" class="style14">
        <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0106">
        CVE-2010-0106</a><br />
        <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0107">
        CVE-2010-0107</a><br />
        <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0108">
        CVE-2010-0108</a><br />
        </font></td>
        <td valign="top" class="style14" style="width: 295px">
        Multiple Vulnerabilities in Symantec Products</td>
        <span class="style14">
        <td valign="top" width="93" class="style14">Active</td>
        <td valign="top" class="style14" style="width: 94px">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201036</a></span></font></td>
        </span>
            <td valign="top" style="width: 65px">
                <span class="style14">
                <span>
                <a href="">
                201002</a></span></span></td>
        <span class="style14">
        <td valign="top" width="94" class="style14" style="width: 97px">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201036</a></span></font></td>
        </span>
    </tr>
    <tr>
        <td valign="top" width="94" class="style14">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201031</a></span></font></td>
        <td valign="top" width="62"><span class="style14">18 Feb 10</td>
        <span class="style14">
        <td valign="top" width="65">0.1</td>
        <td valign="top" width="62">22 Feb 10</td>
        </span>
        <td valign="top" class="style18" style="width: 88px">
        <font size="2">
        <span style="mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA" class="style21">
        <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0020">
        CVE-2010-0020</a><br />
        <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0021">
        CVE-2010-0021</a><br />
        <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0022">
        CVE-2010-0022</a><br />
        <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0231">
        CVE-2010-0231</a></span><br />
        </font></td>
        <td valign="top" class="style14" style="width: 295px">
        Multiple Vulnerabilities in Microsoft SMB Server 
        <span style="mso-bidi-font-weight: normal">
        <span style="mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">
        (MS10-012)</span></span></td>
        <td valign="top" width="93">Active</td>
        <span class="style14">
        <td valign="top" class="style14" style="width: 94px">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201031</a></span></font></td>
            <td valign="top" style="width: 65px">
                <span>
                <a href="">
                201010</a></span></td>
        <td valign="top" width="94" class="style14" style="width: 97px">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201031</a></span></font></td>
        </span>
    </tr>
    <tr>
        <td valign="top" width="94" class="style14">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201030</a></span></font></td>
        <td valign="top" width="62"><span class="style14">18 Feb 10</td>
        <span class="style14">
        <td valign="top" width="65">0.1</td>
        <td valign="top" width="62">22 Feb 10</td>
        </span>
        <td valign="top" class="style18" style="width: 88px">
        <span style="mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA" class="style23">
        <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0239">
        <span style="COLOR: blue">
        CVE-2010-0239</span></a><br />
        <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0240">
        CVE-2010-0240</a><br />
        <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0241">
        CVE-2010-0241</a><br />
        <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-0242">
        CVE-2010-0242</a></span></td>
        <td valign="top" class="style14" style="width: 295px">
        Multiple Vulnerabilities in Microsoft Windows TCP/IP 
        <span style="mso-bidi-font-weight: normal">
        <span style="mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">
        (MS10-009)</span></span></td>
        <td valign="top" width="93">Active</td>
        <span class="style14">
        <td valign="top" class="style14" style="width: 94px">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201030</a></span></font></td>
            <td valign="top" style="width: 65px">
                <span>
                <a href="">
                201007</a></span></td>
        <td valign="top" width="94" class="style14" style="width: 97px">
        <font class="inplacedisplayid1siteid0"><span>
        <a target="_blank" href="">
        201030</a></span></font></td>
        </span>
    </tr>
    </table>

</body>

</html>

#include <Array.au3>
#include <IE.au3>

$sHtml = FileRead(@ScriptDir & "\test.htm")
$oIE = _IECreate()
_IEDocWriteHTML($oIE, $sHtml)
$oTable = _IETableGetCollection($oIE, 0)
$aTable = _IETableWriteToArray($oTable, True)
_ArrayDisplay($aTable)

; Now with <span> tags removed
$sHtml = StringRegExpReplace($sHtml, "(?i)(?U)(</{0,1}span.*>)", "")
_IEDocWriteHTML($oIE, $sHtml)
$oTable = _IETableGetCollection($oIE, 0)
$aTable = _IETableWriteToArray($oTable, True)
_ArrayDisplay($aTable)

Share this post


Link to post
Share on other sites

If the DOM is confused by malformed HTML its a crap shoot what you will get. Use DebugBar or _IEDocReadHTML to see what the DOM sees.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

The DOM sees the malformed HTML. It isn't until rendering time that the incorrect <span> tags are ignored. I don't mess around with DOM that often so this was more of a learning question for me. I was just under the impression that the DOM was an exact representation of what is seen in the browser (rendered), but it seems there are still some validation steps that take place before the DOM is drawn in the browser window.

I'm using _IEDocReadHTML, applying my SRE to strip the <span> tags, and _IEDocWriteHTML. This works fine. The DOM just didn't represent what I thought it did. Thanks for the replies though.

Share this post


Link to post
Share on other sites

The raw HTML gets rendered into a DOM document that is hosted and displayed in the browser. There is no DOM until the document is rendered.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0