Jump to content

Capturing text from Internet Explorer pages


Recommended Posts

Hi all,

Can any one of you guide me how to capture all displayed text from IE browser page. I have set "WinDetectHiddenText" value to 1. Still it is not getting all the detail section of IE browser page.

Any help will be highly appreciated.

Thank you

Sharief Pareed

Intel Corporation

Link to comment
Share on other sites

Hi all,

Can any one of you guide me how to capture all displayed text from IE browser page. I have set "WinDetectHiddenText" value to 1. Still it is not getting all the detail section of IE browser page.

Any help will be highly appreciated.

Thank you

Sharief Pareed

Intel Corporation

<{POST_SNAPBACK}>

Hello Sharief Pareed,

In which programming language are you going to try this? As you posted your question in the ActiveX forum of AutoIt, I assume you are using an Object-aware programming language? In that case you don't have to use AutoItX. Just open the "winhttp.winhttprequest.5" object and you are 'in business'.

A short example (yes, this one is in AutoIt script, not AutoItX):

$httpObj = ObjCreate("winhttp.winhttprequest.5")
$httpObj.open("GET","http://your-url-here.com/etc/etc")
$httpObj.send()
$HTMLSource = $httpObj.Responsetext

The variable $HTMLSource contains the complete ASCII source code of the given HTML page.

More info about this Object can be found on: http://msdn.microsoft.com/library/en-us/wi...httprequest.asp

Hope this helps.

Regards,

-Sven

Link to comment
Share on other sites

The winhttp.winhttprequest.5 object doesn't seem to work for me. Which I find odd, seeing as according to the manual, "the winhttp.winhttprequest.5 object only exist on computers that have at least Internet Explorer version 5 installed." And I'm running IE6. But whenever I try that code I get "Variable must be of type "Object"."

Ah well.

Link to comment
Share on other sites

The winhttp.winhttprequest.5 object doesn't seem to work for me. Which I find odd, seeing as according to the manual, "the winhttp.winhttprequest.5 object only exist on computers that have at least Internet Explorer version 5 installed." And I'm running IE6. But whenever I try that code I get "Variable must be of type "Object"."

Ah well.

<{POST_SNAPBACK}>

Weird indeed. Maybe it's only present on Win2000/XP/2003 computers ?

You could check your registry whether this object is present or not.

It is located directly under HKEY_CLASSES_ROOT, search for the key starting with WinHttp.

Maybe on your computer it has a lower or higher version, e.g. "WinHttp.WinHttpRequest.5.1"

Regards,

-Sven

Link to comment
Share on other sites

  • 4 months later...

Here is the code to download the source of a web page

With CreateObject("MSXML2.XMLHTTP")
.open "GET", "http://finance.yahoo.com", False
.send
t = .responseText
MsgBox t
End With

Here is a full script to download and save to txt. (You have to create temp.htm in the same directory as the script before trying.)

CODE

'-------By Fredledingue------

'--------set constants--------

Const ForReading = 1, ForWriting = 2, ForAppending = 3

Const TristateUseDefault = -2, TristateTrue = -1, TristateFalse = 0

Const OPEN_FILE_FOR_APPENDING = 8

Set fso = CreateObject("Scripting.FileSystemObject")

'---------------------

theURL = "http://www.autoitscript.com"

With CreateObject("MSXML2.XMLHTTP")

.open "GET", theURL, False

.send

t =.responseText

't = Replace(Replace(t, Chr(10) , "_"), "_", VbCrlf) & VbCrlf

End With

t = Replace(t, ">", ">" & VbCrlf )

msgbox Len(t) & vbcrlf & t

Set objOutputFile = fso.OpenTextFile("temp.htm", ForWriting)

i=1

On Error Resume Next

Do Until i=Len(t)-1

objOutputFile.Write Mid(t,i,1)

i=i+1

Loop

objOutputFile.Close

msgbox "ok"

here is a partial code to convert this htm text to text (set f as temp.htm)

CODE

'------by Fredledingue--------

Set ts = f.OpenAsTextStream(ForReading, Tristatefalse)

Do Until ts.AtEndOfStream

t = ts.ReadLine

'-------lowercase------

t = Replace(Replace(t,"&nbsp;","")," }","")

t = Replace(t,"</tr>",VbCrlf)

t = Replace(t,"</br>",VbCrlf)

t = Replace(t,"<br>",VbCrlf)

t = Replace(t,"</dd>",VbCrlf)

t = Replace(t,"<p", VbCrlf & VbCrlf & VbTab & "<")

t = Replace(t,"<table", VbCrlf & VbCrlf & "<")

t = Replace(t,"</table>", VbCrlf & VbCrlf)

t = Replace(t,"<title>","Page Title: ")

t = Replace(t,"javascript>","")

t = Replace(t,"<style>","<")

t = Replace(t,"</style>",">")

t = Replace(t,"&quot;","""")

t = Replace(t,"&amp;","&")

t = Replace(t,"•","*")

t = Replace(t,"—","--")

t = Replace(t,"World&quot;","""")

t = Replace(t,"<!-- saved from url=","Saved from url: ")

t = Replace(t,"onclick=","<")

'-------uppercase------

t = Replace(Replace(t,"&NBSP;","")," }","")

t = Replace(t,"</TR>",VbCrlf)

t = Replace(t,"</BR>",VbCrlf)

t = Replace(t,"<BR>",VbCrlf)

t = Replace(t,"</DD>",VbCrlf)

t = Replace(t,"<P", VbCrlf & VbCrlf & VbTab & "<")

t = Replace(t,"<TABLE", VbCrlf & VbCrlf & "<")

t = Replace(t,"</TABLE>", VbCrlf & VbCrlf)

t = Replace(t,"<TITLE>","Page Title: ")

t = Replace(t,"Javascript>","")

t = Replace(t,"<STYLE type=text/css>","<")

t = Replace(t,"<STYLE>","<")

t = Replace(t,"</STYLE>",">")

t = Replace(t,"<!-- SAVED FROM URL=","Saved from url: ")

t = Replace(t,"onclick=","<")

'----------Save internet links?--------------

If KeepLinks = True Then

t = Replace(t,"HREF=", VbCrlf & ">_Link: ")

t = Replace(t,"ID="," <id=")

t = Replace(t,"href=", VbCrlf & ">_Link: ")

t = Replace(t,"id=","<id=")

End if

'------------------------

If InStr(t,"<") >0 Or InStr(t,">") >0 Then

i=0

u=""

Do Until i=Len(t)

i=i+1

c = Mid(t,i,1)

If c="<" Then

IsText = False

Else

If c=">" Then

IsText = True

End If

End If

If IsText = True And c <> ">" Then

u = u & c

End If

Loop

i=0

t=u

u=""

Text = Text & VbCrlf & t

Else

If t <> "-->" And IsText = True Then

Text = Text & VbCrlf & t

End If

End If

Loop

t=""

ts.Close

'----------delete useless blank lines------------------

Do while InStr(Text, " ")

Text = Replace(Text, " ", " ")

Loop

Do while InStr(Text, VbTab & " ")

Text = Replace(Text, VbTab & " ", VbTab)

Loop

Do while InStr(Text, " " & VbTab)

Text = Replace(Text, " " & VbTab, VbTab)

Loop

Do while InStr(Text, VbTab & VbCrlf)

Text = Replace(Text, VbTab & VbCrlf, VbCrlf)

Loop

Do while InStr(Text, " " & VbCrlf)

Text = Replace(Text, " " & VbCrlf, VbCrlf)

Loop

Do while InStr(Text, VbCrlf & VbCrlf & VbCrlf & VbCrlf)

Text = Replace(Text, VbCrlf & VbCrlf & VbCrlf & VbCrlf, VbCrlf & VbCrlf & VbCrlf)

Loop

Text = Replace(Text, "_Link: ", VbCrlf & "_Link: ")

'---------------------write to file----------------------

Set objOutputFile = fso.CreateTextFile(temp.txt, True)

objOutputFile.Write Text

objOutputFile.Close

Link to comment
Share on other sites

Uh... What the hell are you doing? You revive a 4 month dead thread, to reply with code that isn't even AutoIt??

And your code is huge. Why would you download the entire source of the page, then "convert" it into text when you could tap into an Internet Explorer COM object directly and just read it right off the page? No need to dump out your entire toolbox just to hammer a nail.

*Edit: Removed a rude comment, replace it with some more nagging dialog.

Edited by Saunders
Link to comment
Share on other sites

Saunders,

First, I didn't notice (and anyway don't care) that the thread is four months old. Would you stop checking this forum after 4 months?

I just noticed it was in the SECOND page of the forum and therefore still actual.

The code posted by SvenP is not VBS and even translated to vbs, it doesn't work. Anyway his code, if working, would exactely download the entire source of the page and you would still need my "huge" code to do something with it.

My codes are by no means huge, regarding to what they do.

The 1st code will just pop up a msgbox and it's not longer than that of SvenP.

The second code will download the source to a text file with an easy-to edit format and WITHOUT ERROR.

The 3d code is sorting out text and links from the htm source (here saved as text).

If you join code2 and code3, you practicaly have a text based web browser. Please try to do it smaller if you can.

Link to comment
Share on other sites

Please try to do it smaller if you can.

<{POST_SNAPBACK}>

Okay.

$o_object = ObjCreate("InternetExplorer.Application")
If IsObj($o_object) Then
    $o_object.visible = 0
    $o_object.navigate ("http://www.google.com/")
    While $o_object.busy
        Sleep(100)
    WEnd
    $s_Text = $o_object.document.body.innerText
    $o_object.quit()
EndIf

MsgBox(0, 'Page Text', $s_Text)

What really bothered me was not that you replied to a dead topic, was not that you provided large functions, but that you replied in an AutoIt forum with strictly VBS code.

The code posted by SvenP is not VBS and even translated to vbs, it doesn't work.

Wow, it's not VBS? Perhaps it's AutoIt code. Who would have guessed that would show up in an AutoIt forum?

First, I didn't notice (and anyway don't care) that the thread is four months old. Would you stop checking this forum after 4 months?

I just noticed it was in the SECOND page of the forum and therefore still actual.

Uh yeah, maybe it was on the second page, after we both replied to it... But I'll admit to reviving dead threads on occasion, it happens, no biggie. Like I said, it was the content of your post that bothered me more than the fact you brought this back up.

Anyway, I'll admit, maybe I flew off the handle a little bit, but I'd just finished reading several other stupid posts (which probably more deserved my flaming) and was in a grumpy sleep deprived mood when I hit yours and kind of just snapped.

Btw: The code I got from above was all taken from Dale's IE automation UDF set (I just took out the bits I needed for the example).

*Edit: Lots of rewording.

Edited by Saunders
Link to comment
Share on other sites

I'm posting in the forum for AutoItX which is a dll for VBS implementation, at least that's the way I use it.

I assume that if you want to talk strictly autoit code, you would post in another section of the forum.

About your code, I must admit it's shorter, but I know from experience that it's much slower than the .responsetext method with MSXML2.XMLHTTP object, especialy when downloading large amounts of pages.

It also doesn't allow you to manipulate the source code and extract other datas than text.

Edited by Fredledingue
Link to comment
Share on other sites

I'm posting in the forum for AutoItX which is a dll for VBS implementation, at least that's the way I use it.

<{POST_SNAPBACK}>

Perhaps, but it still seemed odd to me that you didn't even use or mention the AutoItX.dll. I just would have mentioned in my post that you could do it exclusivly in VB. Also, the .dll can be used for many languages, not just VB.

Anyway, I apologize for being rude. It was unbecoming of me. I'd had a bad couple of days and while that doesn't excuse my behaviour, I hope it provides a little understanding.

Link to comment
Share on other sites

i tried my own version of Inetget and Ie Automation i get a error at $iE.document.<- or if i try to set it like so $oDoc = $iE.document i still get a error something like cant do that operation on object :/ but heres my code

this should work iono why its not :)

func getTextToIE($strURL)
Dim $objError = @Error
Dim $oDoc
    Dim $strResult;
    
        
       ; Create the WinHTTPRequest ActiveX Object and IEObject.
        dim $WinHttpReq = ObjCreate("WinHttp.WinHttpRequest.5.1")
        dim $iE = ObjCreate("InternetExplorer.Application")
       ;  Create an HTTP request.
        Dim $temp = $WinHttpReq.Open("GET", $strURL, false)

       ;  Send the HTTP request.
        $WinHttpReq.Send()
        
       ;  Retrieve the response text.
        $strResult = $WinHttpReq.ResponseText
   
    
   ; Return the response text To Ie.
return $strResult
$iE.RegisterAsBrowser = 1 
;get to a blank page 
$iE.navigate("about:blank")
$iE.Visible = 1
;make a Doc Object for automation ?
$iE.document = $oDoc
$oDoc.body.insertAdjacentHtml(0, $strResult)

 
   If @Error = 1 then 
            $strResult = $objError 
        $strResult = $strResult & "WinHTTP returned error: " +_ 
            ($objError.number & 0xFFFF).toString()  
        $strResult = $strResult & $objError.description
    MsgBox(2, "Error Raised", $strResult)       
Endif
EndFunc
dim $objText = getTextToIE("http://www.google.com")
Edited by WSCPorts
http://www.myclanhosting.com/defiasVisit Join and contribute to a soon to be leader in Custumized tools development in [C# .Net 1.1 ~ 2.0/C/C++/MFC/AutoIt3/Masm32]
Link to comment
Share on other sites

You've got some funky syntax in here and I'm really not clear on what you are trying to do.

But you know what, I really wouldn't pile onto this post and keep it alive with your reply if I were you... too much water over the dam in this one. If you want some discussion on your code I'd suggest a new thread.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...