Jump to content
Sign in to follow this  
Cybergon

Get web page contents -NOT source code- without using the clipboard

Recommended Posts

Cybergon

Hello everyone, the title is pretty much what I want to do but I'll elaborate below for details.

I'm currently using a script to ease out my workload, it uses StringRegExp to extract certain bits of info out of a text file that I make and then makes a neat record with whatever it got that I can easily check out later. The only manual labor that I'm forced to do is open a web page in my browser, press ctrl+a ctrl+c, paste it in notepad, replace every double tab character with a semicolon and save the text file. Unfortunately sometimes things get too hectic and I don't get the chance to sit down and sort that out, and while I'm busy somewhere else those precious bits of info just go to waste.

That's why I've decided to make the script completely autonomous, but after about two hours of searching through the forums, I've yet to find a definite solution for two reasons:

-I can't rely on _InetGetSource because it would be more than a nightmare to figure out a reliable regular expression to get my info out of the page's source code; it would be impossible, considering how complex and dynamic it is, whereas the displayed text on the browser is perfectly manageable.

-The pc is always being used, even when I'm not around and especially on hectic times, so automatically opening up the browser and making the mouse and keyboard fetch the displayed text into the clipboard would be a very unreliable approach (I tried it once, with BlockInput and everything, never again).

If you could please help me or point me to the right thread where they can help me with my problem, I'd be forever in your debt.

Thanks in advance!

Edited by Cybergon

Share this post


Link to post
Share on other sites
TormentoRobots

you need a html scraper:

write it your self or use someone elses

theres lots of way to scrape text from html.

it would be more than a nightmare to figure out a reliable regular expression to get my info out of the page's source code; it would be impossible, considering how complex and dynamic it is, whereas the displayed text on the browser is perfectly manageable.

WHO wrote the browser? Don't try to lie to us, its possible!

Edited by TormentoRobots

#Include-once;TormentoRobots

Share this post


Link to post
Share on other sites
Cybergon

Thanks TR, I'll be taking the web scraper approach then, I didn't know about that, wish me luck!

WHO wrote the browser? Don't try to lie to us, its possible!

If it was a simple page there'd be no problem, but in this one there's a lot of javascript magic going on and the info is all scattered in div tags that hide some stuff and show some other. It's pretty random and uses a lot of variables in the tags to make things worse. I wish I could explain it better, but it is frankly beyond me, sorry.

Share this post


Link to post
Share on other sites
kylomas

The only manual labor that I'm forced to do is open a web page in my browser, press ctrl+a ctrl+c, paste it in notepad, replace every double tab character with a semicolon and save the text file.

If you do not want to go the screenscraper route the sequence you describe can be automated.

There is also a rich compliment of _IE functions for interacting with WEB pages.

What is the URL of the page?

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
Cybergon

If you do not want to go the screenscraper route the sequence you describe can be automated.

There is also a rich compliment of _IE functions for interacting with WEB pages.

I've looked at all of the _IE functions thoroughly but none seem to be of any use in my particular situation, is there any you'd recommend me?

What is the URL of the page?

The particular one I'm interested in is only accessible from certain authorized places (as far as I understand) so there's no point, sorry.

Share this post


Link to post
Share on other sites
kylomas

Without the URL or an example of the HTML I would only be guessing.


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
Cybergon

Without the URL or an example of the HTML I would only be guessing.

I'm telling you to forget about the source code, trust me, there's no way to get what I want without opening an IE window and I already gave you the reasons why that is unpractical. Unless there's a way to hide it and do everything behind curtains with _IEAction or something?

Share this post


Link to post
Share on other sites
kylomas

NP, good luck...


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
DaleHohm

_IEBodyReadText() appears to be what you are looking for. Read about it in the help file.

If this needs to be done without disrupting a user on the PC, use the $f_visible flag in _IECreate to prevent the browser from displaying (and insure you use _IEQuit when you are done).

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites
Robjong

ConsoleWrite(_INetGetText('http://autoitscript.com') & @LF)

Func _INetGetText($sURL)
    Local $bStr = InetRead($sURL, 19)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Local $oHTML = ObjCreate("HTMLFILE")
    If @error Then Return SetError(2, 0, 0)

    $oHTML.Open()
    $oHTML.Write(BinaryToString($bStr))

    ; $oHTML.... 

    Return SetError(0, 0, $oHTML.Body.InnerText)
EndFunc ;==>_INetGetText

Maybe this will help you get started...

Share this post


Link to post
Share on other sites
AndreaTS

I simply use :

Send("^a^c")
$text = ClipGet()

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Similar Content

    • 31290
      By 31290
      Hi everyone, hope you are doing fine
      Well, I'm currently writing a small script that goes to a certain web page, finds the first link of a specified section and download the file associated to this link.
      Depending on the computer that the tool is launched, the script gets the computer model and search in the (provided here) ini file which link to follow.
      At first, Dell was kind enough to provide only one link but now, they provide two of them. The first one is now a .txt file (  ) whereas my script has been designed to download only the fist and latest link released for the BIOS Update.

      Here's the current code which is working with only the first and latest link of the BIOS category:
      So the question is: 
      In the case of double links like shown in the picture above, how it is possible to tell the script to download only the link containing an the .exe file?
      Of course, I could have changed the array result to [1] instead of [0] [which is working] but it seems that Dell does that randomly and that I deal with a lot of computer models.
      Thanks for the help you can provide, 
      -31290-
       
      SEE_BIOS_LINKS.ini
    • Robdog1955
      By Robdog1955
      I'm trying to click a button on a web page. I have added a couple of MsgBox lines to allow me to watch what happens on the page. As you can see the first half of my script enters data into text boxes on the page. I have no problem there. I just cannot click on the region buttons. The "set focus" line causes an outline to appear around the EU button and the "click button" line causes the "Pick a Region" text to disappear. Here is the code I have so far.
      #include <IE.au3> Local $oIE = _IECreate("http://questchecker.com/") Local $iQuestID = "123456" Local $sCharacterName = "CharacterName" Local $colForms = _IEFormGetCollection($oIE) $iCount = 0 For $oForm In $colForms $oFormElements = _IEFormElementGetCollection($oForm) For $oFormElement In $oFormElements $iCount = $iCount + 1 Local $sTagName = StringLower($oFormElement.tagName) Local $sElementType = $oFormElement.type Local $sElementName = $oFormElement.name Switch $iCount Case 6 _IEFormElementSetValue($oFormElement, "MyRealm", 0) ; realm Case 7 _IEFormElementSetValue($oFormElement, $sCharacterName, 0) Case 8 _IEFormElementSetValue($oFormElement, $iQuestID, 0) EndSwitch Next Next Local $oButtons = _IEGetObjByName($oIE, "questForm") For $oButton In $oButtons If _IEFormElementGetValue($oButton) = "US" Then MsgBox(0, "", "Click Okay to set focus") _IEAction($oButton, "focus") MsgBox(0, "", "Click Okay to click button") _IEAction($oButton, "click") ExitLoop EndIf Next MsgBox(0, "", "Click Okay to quit") _IEQuit($oIE) Exit  
    • zenocon
      By zenocon
      Hi, After scouring the forums for many hours, I'm trying to compile the most up to date / recent information on the options available for integrating with JavaScript / DOM -- as it relates to scraping + automation of web pages.
      It's my understanding there is IE.au3 script for automation of IE through a COM interface.  But I believe this only works with IE and won't work with Edge, correct?  Is there a COM interface that works with Edge, or any other options for integrating with Edge (other than IUIAuatomation?)
      I know there was also a FF.au3 UDF, but Mozilla abandoned the support for their mozrepl in favor of Web Extensions, and my understanding is that the FF.au3 UDF no longer works, is that correct?
      There was also a Chrome.au3 UDF, but my read on the forums indicate that this also broke many Chrome releases past.
      Which leaves IUIAutomation which I have been using to automate / scrape Windows apps, but when I am trying it on a website, it is not as useful.  For example, if I know the exact DOM id / class, I can get at it and do whatever I need to in JavaScript very simply.  With IUIAutomation, the DOM properties are not available, and most tags / elements in DOM have no useful defining characteristics to be able to get at them reliably (if they are targetable at all).  Some things might be able to be done with IUIAutomation, but I see it's value in targeting website automation / scraping as fairly limited.
      At this point, it seems like my best option is to use IE.au3, but that forces users on IE, which is probably a showstopper.
      Is there another way to bridge into the DOM?  I have written Web Extensions for Chrome and Firefox before.  They can communicate with external processes via AJAX or messaging.  I'm wondering if I can build what I need in a WebExtension and then trigger it from AutoIT Script, and gather up the results somewhere.
      I know there was the ISimpleDOM.au3 and some Microsoft Accessability scripts, but they seem to only be partially supported in browsers, and I didn't have a lot of luck getting those examples to run correctly.
    • seppedelanghe
      By seppedelanghe
      Hi everyone,
      First of all sorry for my bad english.
      I'm trying to build a automated program/autoit that controls a web app.
      I created the script using mouseclick() , but i don't want the web browser to be visible.
      I tried ControlClick() , but the web app uses flash and the buttons/items to be clicked do not have an ID.
      I searched and visited a lott of autoit post and pages (even in german  ) , but could not find a way or get it to work.
      Any help is welcome!!!
      Thanks already
      Seppe
    • MagicFlute
      By MagicFlute
      Hi guys
      Looking to automate...I have PDF files...Need to convert them all to 2 pages each but more specifically, retain only the first and last.
      Searched a lot and hitting the wall....Any ideas? Just directions would do...
      Not sure how to "print to file" and I believe |ShellExecuteWait($sFilePath, " /h /p", "", "print", @SW_HIDE)| doesn't give options of what pages...
×