Get web page contents -NOT source code- without using the clipboard

Cybergon · January 20, 2013

Hello everyone, the title is pretty much what I want to do but I'll elaborate below for details.

I'm currently using a script to ease out my workload, it uses StringRegExp to extract certain bits of info out of a text file that I make and then makes a neat record with whatever it got that I can easily check out later. The only manual labor that I'm forced to do is open a web page in my browser, press ctrl+a ctrl+c, paste it in notepad, replace every double tab character with a semicolon and save the text file. Unfortunately sometimes things get too hectic and I don't get the chance to sit down and sort that out, and while I'm busy somewhere else those precious bits of info just go to waste.

That's why I've decided to make the script completely autonomous, but after about two hours of searching through the forums, I've yet to find a definite solution for two reasons:

-I can't rely on _InetGetSource because it would be more than a nightmare to figure out a reliable regular expression to get my info out of the page's source code; it would be impossible, considering how complex and dynamic it is, whereas the displayed text on the browser is perfectly manageable.

-The pc is always being used, even when I'm not around and especially on hectic times, so automatically opening up the browser and making the mouse and keyboard fetch the displayed text into the clipboard would be a very unreliable approach (I tried it once, with BlockInput and everything, never again).

If you could please help me or point me to the right thread where they can help me with my problem, I'd be forever in your debt.

Thanks in advance!

Edited January 20, 2013 by Cybergon

TormentoRobots · January 20, 2013

you need a html scraper:

write it your self or use someone elses

theres lots of way to scrape text from html.

it would be more than a nightmare to figure out a reliable regular expression to get my info out of the page's source code; it would be impossible, considering how complex and dynamic it is, whereas the displayed text on the browser is perfectly manageable.

WHO wrote the browser? Don't try to lie to us, its possible!

Edited January 20, 2013 by TormentoRobots

Cybergon · January 20, 2013

Thanks TR, I'll be taking the web scraper approach then, I didn't know about that, wish me luck!

WHO wrote the browser? Don't try to lie to us, its possible!

If it was a simple page there'd be no problem, but in this one there's a lot of javascript magic going on and the info is all scattered in div tags that hide some stuff and show some other. It's pretty random and uses a lot of variables in the tags to make things worse. I wish I could explain it better, but it is frankly beyond me, sorry.

kylomas · January 20, 2013

The only manual labor that I'm forced to do is open a web page in my browser, press ctrl+a ctrl+c, paste it in notepad, replace every double tab character with a semicolon and save the text file.

If you do not want to go the screenscraper route the sequence you describe can be automated.

There is also a rich compliment of _IE functions for interacting with WEB pages.

What is the URL of the page?

kylomas

Cybergon · January 20, 2013

If you do not want to go the screenscraper route the sequence you describe can be automated.
There is also a rich compliment of _IE functions for interacting with WEB pages.

I've looked at all of the _IE functions thoroughly but none seem to be of any use in my particular situation, is there any you'd recommend me?

What is the URL of the page?

The particular one I'm interested in is only accessible from certain authorized places (as far as I understand) so there's no point, sorry.

kylomas · January 20, 2013

Without the URL or an example of the HTML I would only be guessing.

Cybergon · January 20, 2013

Without the URL or an example of the HTML I would only be guessing.

I'm telling you to forget about the source code, trust me, there's no way to get what I want without opening an IE window and I already gave you the reasons why that is unpractical. Unless there's a way to hide it and do everything behind curtains with _IEAction or something?

kylomas · January 20, 2013

NP, good luck...

DaleHohm · January 21, 2013

_IEBodyReadText() appears to be what you are looking for. Read about it in the help file.

If this needs to be done without disrupting a user on the PC, use the $f_visible flag in _IECreate to prevent the browser from displaying (and insure you use _IEQuit when you are done).

Dale

Robjong · January 21, 2013

ConsoleWrite(_INetGetText('http://autoitscript.com') & @LF)

Func _INetGetText($sURL)
    Local $bStr = InetRead($sURL, 19)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Local $oHTML = ObjCreate("HTMLFILE")
    If @error Then Return SetError(2, 0, 0)

    $oHTML.Open()
    $oHTML.Write(BinaryToString($bStr))

    ; $oHTML.... 

    Return SetError(0, 0, $oHTML.Body.InnerText)
EndFunc ;==>_INetGetText

Maybe this will help you get started...

AndreaTS · July 8, 2014

I simply use :

Send("^a^c")
$text = ClipGet()

Sign In

Get web page contents -NOT source code- without using the clipboard

Recommended Posts

Cybergon

TormentoRobots

Cybergon

kylomas

Cybergon

kylomas

Cybergon

kylomas

DaleHohm

Robjong

AndreaTS

Create an account or sign in to comment

Create an account

Sign in

Similar Content

Could Someone Guide me on Automating User Login and Data Extraction from Web Application?

Web Automation

Google Images Legacy Loader

find specific piece of link in web page

Cannot click button in IE

Browse

AutoIt Resources

Release

Beta