Jump to content

IE view source differs from .innerHTML


Recommended Posts

Hey Guys,

I'm having problems when using the incredible ie.au3 library. When using the _IEBodyReadHTML function, it returns a different text from when I manually view source using IE. I know it is looking at the right window because that's the only window I have up and the text is similiar, just certain parts are coded out.

Was wondering if anyone has encountered this problem before. I'm not too sure if it's an encryption method the website is using though.

Link to comment
Share on other sites

Check out the differences between Body and Doc.

IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]
Link to comment
Share on other sites

  • Moderators

Dale can explain this better, but I will attempt to myself. When using _IEBodyReadHTML() you get the "generated source", which can differ from the real source do to javascript and such. When working with the IE.au3 Library you will want to go with the "generated source" for referencing the DOM.

Link to comment
Share on other sites

In addition to what Mike said... what you see with the View Source command in IE is a snapshot of the page that was loaded - it is not necessarily what is loaded when the page comes to rest and if scripted changes are made to the page (either by something external like AutoIt or internal like Javascript) they are not updated for View Source.

This is one of the really nice features of the _IE routines that they allow you to see and operate on the dynamic HTML.

Dale

Edit: and if you put all three of our answers together, you get one really good one! Pretty cool to see competition to answer IE.au3 questions...

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Both you guys are way better at explaining these things.

IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]
Link to comment
Share on other sites

Wow, Gee. Thanks guys, those were fast replies...

Thus, to get the 'real source', I would need to use ____ ?

I believe the most complete source comes from _IEDocReadHTML. _IEBodyReadHTML doesn't return anything outside the <body> tags. _IEDocReadHTML will return the full source, and in some cases be more complete than using View Source.

IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]
Link to comment
Share on other sites

Wow, Gee. Thanks guys, those were fast replies...

Thus, to get the 'real source', I would need to use ____ ?

I believe _INetGetSource ( $s_URL ) would get the 'real source'.

"So man has sown the wind and reaped the world. Perhaps in the next few hours there will no remembrance of the past and no hope for the future that might have been." & _"All the works of man will be consumed in the great fire after which he was created." & _"And if there is a future for man, insensitive as he is, proud and defiant in his pursuit of power, let him resolve to live it lovingly, for he knows well how to do so." & _"Then he may say once more, 'Truly the light is sweet, and what a pleasant thing it is for the eyes to see the sun.'" - The Day the Earth Caught Fire

Link to comment
Share on other sites

Hmm... Just tried but didn't work, _IEDocReadHTML didn't give me what I wanted. I think the website is using some script to change the code after it's loaded, I still see everything I want on view source.

I'll try _INetGetSource ( $s_URL ) next. Gotta find the library first.

Link to comment
Share on other sites

Hmm.... Interesting... INetGetSource works! But I can't exactly navigate from page to page based on the result I get from there. It's late here, will figure something out tomorrow. Thanks for the help guys.

Ok, glad i could help. Keep us updated on your progress. We are always glad to help. :P

"So man has sown the wind and reaped the world. Perhaps in the next few hours there will no remembrance of the past and no hope for the future that might have been." & _"All the works of man will be consumed in the great fire after which he was created." & _"And if there is a future for man, insensitive as he is, proud and defiant in his pursuit of power, let him resolve to live it lovingly, for he knows well how to do so." & _"Then he may say once more, 'Truly the light is sweet, and what a pleasant thing it is for the eyes to see the sun.'" - The Day the Earth Caught Fire

Link to comment
Share on other sites

I don't know what is meant by "real" source. _IEDocReadHTML will give you the rendered HTML source including the <HEAD> section and scripts. _IEBodyReadHTML returns the rendered HTML inside the <BODY></BODY> tags. _INetGetSource will return the original, unrendered HTML source of the page.

Let me demonstrate.

Start with this html file, tmp.htm on a server:

<html><body>
This text will change:
    <div id="foo">ORIGINAL TEXT</div>
&lt;script language="javascript">
    document.getElementById ("foo").innerHTML = "THIS IS DYNAMIC TEXT";
</script>
</body></html>

As you will see, it will initially display the words "ORIGINAL TEXT", but when the page is displayed, the Javascript immediately changes this to "THIS IS DYNAMIC TEXT".

We'll use this script that shows the output of _INetGetSource and the _IE functions:

#include <inet.au3>
ConsoleWrite("***** _INetGetSource *****" & @CR & _INetGetSource("http://localhost/tmp.htm") & @CR & @CR)

#include <IE.au3>
$oIE = _IECreate("http://localhost/tmp.htm")
ConsoleWrite("***** _IEBodyReadText *****" & @CR & _IEBodyReadText($oIE) & @CR & @CR)
ConsoleWrite("***** _IEDocReadHTML *****" & @CR & _IEDocReadHTML($oIE) & @CR & @CR)
ConsoleWrite("***** _IEBodyReadHTML *****" & @CR & _IEBodyReadHTML($oIE) & @CR & @CR)

You'll see that _INetGetSource displays the HTML just as it is stored on the server - this is what you see with the browser's View Source command. [EDIT: to be precise, it does not necessarily display the file as stored on the server, but rather the output after any server-side processing is performed - so if it an ASP file or a cgi script, you get the HTML output generated from the server processing of those files.]

All of the _IE* commands display the rendered document AFTER it has been updated by the Javascript.

One isn't right and the other wrong, it is just different and used for different purposes.

***** _INetGetSource *****
<html>
<body>

This text will change: <div id="foo">ORIGINAL TEXT</div>

&lt;script language="javascript">
    document.getElementById ("foo").innerHTML = "THIS IS DYNAMIC TEXT";
</script>

</body>
</html>

***** _IEBodyReadText *****
This text will change: 
THIS IS DYNAMIC TEXT

***** _IEDocReadHTML *****
<HTML><HEAD></HEAD>
<BODY>This text will change: 
<DIV id=foo>THIS IS DYNAMIC TEXT</DIV>
&lt;script language=javascript>
    document.getElementById ("foo").innerHTML = "THIS IS DYNAMIC TEXT";
</SCRIPT>
</BODY></HTML>

***** _IEBodyReadHTML *****
This text will change: 
<DIV id=foo>THIS IS DYNAMIC TEXT</DIV>
&lt;script language=javascript>
    document.getElementById ("foo").innerHTML = "THIS IS DYNAMIC TEXT";
</SCRIPT>

Dale

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Very thorough Dale....I learned something new today.

IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]
Link to comment
Share on other sites

Alright guys. Thanks for the explanations so far. But after a busy week I'm back at it, and I should have explained myself better. What Dale said about not being a 'right' and 'wrong' version makes sense.

What I want is actually the _INetGetSource version (I basically want the same text as view source), not the _IEReadHTML versions. But I believe that using _IE to open the page, then _INetGetSource to retrieve the unedited source would mean two separate calls to the server, I was wondering if there was anyway around this?

An alternative I was thinking of is to retrieve the view source directly from IE. This can 'easily' be done by just manually writing a script which just views the source, and then read the code from the new window, then closing that window. I'm personally trying to avoid this method.

So if anyone knows any better way around this, I'd be deeply appreciated. Thanks.

Link to comment
Share on other sites

I don't know of any way to do this other than the methods already mentioned.

Dale

Edit: typo

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

HAH!

Thanks for the replies guys, thought I should post my solution!

I was thinking about it and was wondering, what if I "caught" the code before they made any changes. So what I did was to do a

_IENavigate ($o_IE, "http://www.somewhere.com",0)

While($o_IE.readyState <> 3)

sleep(100)

WEnd

$text = _IEDocReadHTML($o_IE)

Edited by patlim4152
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...