Jump to content

'Html Scraping'


Recommended Posts

Html is too short to search for, so I was kind of forced to ask this question because I couldn't search it :).

Is there any way to get HTML code BEFORE client-side processing? Any browser's "View Source" shows only the HTML after the client as processed it. I need it before :/.

Any help is greatly appreciated - I really do not know where to start at.

Link to comment
Share on other sites

you always get non-edited html with any function. what are you trying to do?

I know that, but what I need is the stuff processed BEFORE the client processes. I believe javascript does this - it has an original value before the client processes the script and then that's what you see on your screen (the processed script). I want that value BEFORE my browser processes it.

Link to comment
Share on other sites

You need to tap into the actual data transmission, e.g. if you create an Autoit http proxy and point your browser to it would be another method (though I'm not sure there's a fully working Autoit proxy out there somewhere :)). For Firefox you could also use the Greasemonkey plugin.

Link to comment
Share on other sites

Html is too short to search for, so I was kind of forced to ask this question because I couldn't search it :).

Is there any way to get HTML code BEFORE client-side processing? Any browser's "View Source" shows only the HTML after the client as processed it. I need it before :/.

Any help is greatly appreciated - I really do not know where to start at.

If you're talking about server side scripting then it can't be done. That's kind of how it works. When you make a request for data, the server will transmit the contents of the file over the Internet connection. If it comes across any pre-processing tags it will do that processing before transmitting the results.

Client-side action such as Javascript is all performed in the browser before you get to see it on your screen. Your web browser will read the information passed to it and send any tasks to whatever is configured to handle such tasks (such as javascript). By the time you see it in your web browser, it's already been compiled and run for you.

So, in recap, you cannot get PHP or ASP source without downloading the files directly from the host folder using FTP.

Are you telling me something I need to know or something I want to know?

Link to comment
Share on other sites

You need to tap into the actual data transmission, e.g. if you create an Autoit http proxy and point your browser to it would be another method (though I'm not sure there's a fully working Autoit proxy out there somewhere :)). For Firefox you could also use the Greasemonkey plugin.

If you're talking about server side scripting then it can't be done. That's kind of how it works. When you make a request for data, the server will transmit the contents of the file over the Internet connection. If it comes across any pre-processing tags it will do that processing before transmitting the results.

Client-side action such as Javascript is all performed in the browser before you get to see it on your screen. Your web browser will read the information passed to it and send any tasks to whatever is configured to handle such tasks (such as javascript). By the time you see it in your web browser, it's already been compiled and run for you.

So, in recap, you cannot get PHP or ASP source without downloading the files directly from the host folder using FTP.

How does 'FireBug' do it then? (Plugin for Firefox). It allows me to click ANY element on my screen in the FireBug editor and it shows me the element HTML - BEFORE processing.

What about Inet*() functions.

I am aware of this function, but I don't think it gets the source code before client processing, it gets the same results as 'View Source' would.

Link to comment
Share on other sites

"How does 'FireBug' do it then? (Plugin for Firefox). It allows me to click ANY element on my screen in the FireBug editor and it shows me the element HTML - BEFORE processing."

I dont understand that, firebug needs to be on the webpage, so it has already processed the page, unless I'm missing something.

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

"How does 'FireBug' do it then? (Plugin for Firefox). It allows me to click ANY element on my screen in the FireBug editor and it shows me the element HTML - BEFORE processing."

I dont understand that, firebug needs to be on the webpage, so it has already processed the page, unless I'm missing something.

That's exactly what it does though, it's weird.

You didn't miss anything - Firebug literally tells me the elements HTML BEFORE client processing, but the page is already loaded on my screen. I would assume that it is storing the pre-processed HTML on each webpage loaded when Firebug is activated, so that when you go and use it, it just recalls that code.

Link to comment
Share on other sites

If you use "Show Page Source" in Firefox you get the original HTML that was sent to FF without any modifications. FireBug displays the current state of the DOM after the Javascript was executed.

*GERMAN* [note: you are not allowed to remove author / modified info from my UDFs]My UDFs:[_SetImageBinaryToCtrl] [_TaskDialog] [AutoItObject] [Animated GIF (GDI+)] [ClipPut for Image] [FreeImage] [GDI32 UDFs] [GDIPlus Progressbar] [Hotkey-Selector] [Multiline Inputbox] [MySQL without ODBC] [RichEdit UDFs] [SpeechAPI Example] [WinHTTP]UDFs included in AutoIt: FTP_Ex (as FTPEx), _WinAPI_SetLayeredWindowAttributes

Link to comment
Share on other sites

If you could give an example page and the desired result from that page it might help.

Firebug might do something like the following: (Unfounded speculation ahead!)

It reads any references to files in the header of the page (CSS, JS, Etc), then when you inspect an element in the page it will check if any of the scripts interact with that element and goes from there.

If page does not make a reference to a script in the header, only the results are seen. I believe this comes down to client side only.

Edit:

PS: You can search these forums using google like this

Edited by Tvern
Link to comment
Share on other sites

"How does 'FireBug' do it then? (Plugin for Firefox). It allows me to click ANY element on my screen in the FireBug editor and it shows me the element HTML - BEFORE processing."

I dont understand that, firebug needs to be on the webpage, so it has already processed the page, unless I'm missing something.

This sounds like google chrome's element inspector, all it's doing is breaking down how the browser goes about making a webpage come alive on your screen. If you want to see that then just rip the site or open the webpage source.

You will never see the pre-processing code unless you have the source code. all <?php echo "PHP Elements"; ?> will be stripped and processed as required before transmission.

Are you telling me something I need to know or something I want to know?

Link to comment
Share on other sites

IE's View Source and INet* both show you the source BEFORE client processing (INet* should be obvious, because there is no client rendering engine involved). The _IE* functions (like _IEDocReadHtml) all work on HTML AFTER client processing. Tools like DebugBar for IE and other DOM inspectors allow you to see HTML either before OR after client processing. Also, don't confuse server-side processing with this discussion... you have no control over that from the client end.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

This has nothing at all to do with what I'm doing, but I cannot disclose the exact information that I am working on (client confidentiality).

On this signup page, I found a similar (if not exact) scenario:

http://bigthink.com/signup

You'll notice if you View Source in FF or IE you will see something similar to this code:

(variations in the link probably, but you'll see that line in the source)

<iframe src="http://api.recaptcha.net/noscript?k=6Lc38QQAAAAAAF0mh4k2rLRBKh1L0YNNYN8e0PLe" height="300" width="500" frameborder="0"></iframe><br/>

However, in Firebug, if I use it's 'Trace' feature and click on the image, it returns this to me:

<img width="300" height="57" src="http://www.google.com/recaptcha/api/image?c=03AHJ_VutXlBZb-Mwf2z-KYfvNYUSwGv3R2L0OE0oU2kyMtHQwrB3PTEOgIQV49xOylnlg8c-RWMUZos83pwt31D5SN-2E5JVWVB9uJIDTu-7x_tlptIbSnkpaAgwmyx3ARvPIbdrWK4oWAfgc0d_HWSYfBwiJX90c6g" style="display: block;">

THAT is the code I need... not the javascript stuff in the 'View Source' example.

Any ideas? :/

Link to comment
Share on other sites

<img width="300" height="57" src="http://www.google.com/recaptcha/api/image?c=03AHJ_VutXlBZb-Mwf2z-KYfvNYUSwGv3R2L0OE0oU2kyMtHQwrB3PTEOgIQV49xOylnlg8c-RWMUZos83pwt31D5SN-2E5JVWVB9uJIDTu-7x_tlptIbSnkpaAgwmyx3ARvPIbdrWK4oWAfgc0d_HWSYfBwiJX90c6g" style="display: block;">

THAT is the code I need... not the javascript stuff in the 'View Source' example.

Guess the contents of the javascript if you'd follow the link...

Edit:

havascript. What's that supposed to be.

P.s. If the purpose of this is getting past a CAPTCHA, I wouldn't bother.

Edited by Tvern
Link to comment
Share on other sites

Guess the contents of the javascript if you'd follow the link...

Edit:

havascript. What's that supposed to be.

P.s. If the purpose of this is getting past a CAPTCHA, I wouldn't bother.

What do you mean guess the contents?

And no, captcha has nothing to do with what I'm doing, but it runs on the same kind of system that I am trying to read.

Link to comment
Share on other sites

You keep banging on about firebug this and that, see if I can put it in simple terms for you.

Firebug is using data from the page it is on, and firefox browser has rendered the page, and the code on it has offered up its secrets.

Firebug is not doing the wizardry you seem to think it is.

What you are seeing is the product of firebug following instructions from code on a page that has been succesfully identified as firefox browser.

It is NOT source before the page is loaded, is is one step after source HAS been loaded.

So what you are asking for makes no sense, you want browser rendered information before the browser renders it, but it must be before the page is loaded? and that wouldnt have the capcha image location anyway even if dark matter had engulfed it..

Whats going on here is you are flogging a dead horse.

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

You keep banging on about firebug this and that, see if I can put it in simple terms for you.

Firebug is using data from the page it is on, and firefox browser has rendered the page, and the code on it has offered up its secrets.

Firebug is not doing the wizardry you seem to think it is.

What you are seeing is the product of firebug following instructions from code on a page that has been succesfully identified as firefox browser.

It is NOT source before the page is loaded, is is one step after source HAS been loaded.

So what you are asking for makes no sense, you want browser rendered information before the browser renders it, but it must be before the page is loaded? and that wouldnt have the capcha image location anyway even if dark matter had engulfed it..

Whats going on here is you are flogging a dead horse.

Notice, I had ASSUMED that Firebug was reading the code before the client processed it... I really don't have a clue HOW Firebug it is, that's why I made this topic.

Whatever it is, I need to know HOW to do it, I don't care how I get there or what comes first or whatever.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...