_ScreenScrape UDF

Alterego · March 19, 2005

This is my first UDF so go easy on me =) This has the sub-requirement of Larry's awesome RealFileReading functions, which I hope become part of the standard distro. If you don't have those yet just paste them at the bottom of a script running this function. (update: Scrape.au3 attached with all needed code. Just drop in your include dir!)

The syntax looks like this:

_ScreenScrape( 'URL', 'String before', 'String after', [..., 'before', 'after', 'before', 'after',...] )

Lets jump straight to the examples so you can see how easy it is: (if you're not sure what screen scraping is, try reading this article)

Examples

;;;scrape google for the number of web pages they index

$google = _ScreenScrape('http://www.google.com','Searching ',' web pages')
MsgBox(1,'',$google)

;;;scrape microsoft for the last time they updated their homepage

$microsoft = _ScreenScrape('http://www.microsoft.com','Last Updated: ',' Pacific Time')
MsgBox(1,'',$microsoft)

;;;scrape wikipedia for the total number of articles

$wikipedia = _ScreenScrape('http://en.wikipedia.org/','Statistics">','</a> articles.')
MsgBox(1,'',$wikipedia)

;;;advanced: scrape the wikipedia statistics page for six things at once!!
#include <array.au3>

Global $wikipediaStatistics = _ScreenScrape('http://en.wikipedia.org/wiki/Special:Statistics', _ 
                                           'Wikipedia currently has <b>','</b> <a href="/wiki/Wikipedia:What_is_an_article"', _
                                           'Including these, we have <b>','</b> pages.</p>', _ 
                                           '<p>Users have made <b>', '</b> edits since July 2002', _ 
                                           'an average of <b>','</b> edits per page.</p>', _ 
                                           '<p>We have <b>','</b> registered users', _ 
                                           'of which <b>','</b> are <a hr')
_ArrayDisplay($wikipediaStatistics,'')

_ScreenScrape:

;===============================================================================
;
; Function Name:    _ScreenScrape
; Description:    Easily screen scrape any web page for the text you want
; Parameter(s):  $ss_URL  - The website to scrape
;                  $ss_1  - The string occurring before the text you want
;                   $ss_2  - The string occuring after the text you want
;                  ...
;                   $ss_19 - The string occurring before the text you want
;                  $v_20 - The string occuring after the text you want.
; Requirement(s):   _UnFormat, _RealFileClose, _RealFileRead, _RealFileOpen
; Return Value(s):  If only one result will return a string. If more than one
;                  result, will return an array
; Author(s):        Alterego http://www.br1an.net
; Note(s):        Woot!
;
;===============================================================================

Func _ScreenScrape($ssURL, $ss_1, $ss_2, $ss_3 = 0, $ss_4 = 0, $ss_5 = 0, $ss_6 = 0, $ss_7 = 0, $ss_8 = 0, $ss_9 = 0, $ss_10 = 0, $ss_11 = 0, $ss_12 = 0, $ss_13 = 0, $ss_14 = 0, $ss_15 = 0, $ss_16 = 0, $ss_17 = 0, $ss_18 = 0, $ss_19 = 0, $ss_20 = 0)
    Local $ss_NumParam = @NumParams
    Local $ss_CountOdd = 1
    Local $ss_CountEven = 2
    Local $ss_Half = $ss_NumParam / 2
    Local $ss_Data[$ss_NumParam + 1]
    Local $ss_Return[$ss_Half]
    For $ss_Primer = 0 To $ss_NumParam - 1
        $ss_Data[$ss_Primer] = _UnFormat (Eval('ss_' & String($ss_Primer)))
    Next
    Global $file = @TempDir & "\" & Random(500000, 1000000, 1) & ".scrape"
    InetGet($ssURL, $file, 1, 0)
    Local $ss_Handle = _RealFileOpen ($file)
    Local $ss_ReadOnce = _RealFileRead ($ss_Handle, FileGetSize($file))
    Local $ss_PermanentStore = _UnFormat ($ss_ReadOnce[0])
    For $ss_Scrape = 0 to ($ss_NumParam - 2) / 2
        $ss_TemporaryStore = $ss_PermanentStore
        $ss_TemporaryStore = StringTrimLeft($ss_TemporaryStore, StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountOdd], 1, 1) + StringLen($ss_Data[$ss_CountOdd]) - 1)
        $ss_TemporaryStore = StringTrimRight($ss_TemporaryStore, StringLen($ss_TemporaryStore) - StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountEven]) + 1)
        $ss_CountOdd = $ss_CountOdd + 2
        $ss_CountEven = $ss_CountEven + 2
        $ss_Return[$ss_Scrape] = $ss_TemporaryStore
    Next
    _RealFileClose ($ss_Handle)
    FileDelete($file)   
    If UBound($ss_Return) = 1 Then 
        Return $ss_Return[0]
    Else
        Return $ss_Return
    EndIf
EndFunc

scrape.au3

Edited March 30, 2005 by Alterego

MHz · March 19, 2005

Interesting. But using code tags, would enable people to copy the code correctly.

steveR · March 19, 2005

I don't know why the codebox makes it all one line, that is so dumb.

Insolence · March 19, 2005

Multiple lines here, about 15.

Alterego · March 22, 2005

With this update (see original post) you can scrape the same page for several things at once, and still on only one line of code!. The fastest way to test it is to download scrape.au3 to your include dir and use that.

PS: even with this complete rewrite all old syntax still works. backwards compatability baby

Changelog

22 March 05: Complete rewrite allowing one to scrape the same page for multiple strings
19 March 05: Minor fixes

Edited March 22, 2005 by Alterego

Alterego · March 25, 2005

I've received several PMs asking for examples from this and also generating RSS feeds. AutoIt is powering this page, so that should help.

cybie · March 29, 2005

Hmm... Maybe I'm missing the obvious here, or I'm jumping ahead because I'm excited that this could be a big time-saver for me, so I'm overlooking the details, but I am missing the _ArrayDisplay function...

Am I doing something stupid here, or is there something else that should be included?

steveR · March 29, 2005

the _ArrayDislay() udf is part of the array.au3 file.

To use the udf functions, you have to #Include it in your script.

Example:

#include <array.au3>

$array = StringSplit("foo,bar", ",")
_ArrayDisplay($array, "test")

Alterego · March 30, 2005

my apologies. i added that to the example. my test script environment has all the includes in by default so i overlooked it

cybie · March 30, 2005

my apologies. i added that to the example. my test script environment has all the includes in by default so i overlooked it
<{POST_SNAPBACK}>

No problem, I just commented that stuff out and tried the rest as it was. Thanks for the reply SteveR.

I played with this a little bit, but most of what I work with are web pages and it's not really in my best interest to have code interjected in my scrape results, such as line breaks and text formatting. It would be really cool to have a script remove all of the code from a document before/after scraping. Maybe something that finds the first < then the next > and counts the spaces in between then trims the middle out to remove all of the obvious/standard bits of HTML.

I will try to play with this a little, but if someone beats me to it I won't be upset.

Excellent work so far! I am glad someone else is working on this!

Alterego · March 30, 2005

Func _html2txt($html)
    $Html2TxT = StringRegExpReplace($html, "<.[^>]*>" , "")
    Return $Html2TxT
EndFunc;==>html2txt

written by supergg02. i use it quite often and it works well. you must be using the latest beta for StringRegExpReplace

i also scrape all @CR, @LF, and all @CRLF both from your input and from the document to make matching easier

Edited March 30, 2005 by Alterego

cybie · March 30, 2005

Func _html2txt($html)
    Return StringRegExpReplace($html, "<.[^>]*>" , "")
EndFunc;==>html2txt

<{POST_SNAPBACK}>

Thanks Alterego!

I would also like to thank Larry for acting as the "Mr Clean"-inspired image would suggest and cleaning the code up. You're like the wise code janitor picking up after all of us. We appreciate it!

AutoIt · March 31, 2005

could this be used to parse the *entire* contents of a page?

for example, I've used a VB and now a vb.net application which inputs the entire web page and then proceeds to parse out one(1) character at a time and sends the pure text as an SMS (only to GSM enabled) phone message

could I process with this, for example http://www.cnn.com/ ?

(BTW, I downloaded the latest beta and the html2txt throws an error, unknown function name ??)

Edited March 31, 2005 by AutoIt

Alterego · March 31, 2005

not sure why RegEx's don't work for you...this is the best you're gonna' get without doing some serious keyword processing to filter java script:

$file = @HomeDrive & '\cnn.txt' 
InetGet('http://www.cnn.com', $file, 1)
$text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4)
ClipPut($text)

The best solution is to use the browser lynx, which returns not-bad-at-all output

$file = @HomeDrive & 'cnn.txt'
RunWait(@ComSpec & ' /c lynx -dump --accept_all_cookies -nolist http://www.cnn.com > ' & $file)
$text = StringStripWS(StringStripWS(StringStripWS(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)),1),2),4)
ClipPut($text)

you'll notice the final example does not contain a regex, so you can just use that as soon as you download lynx and put it in \Windows\

zcoacoaz · March 31, 2005

----Off Topic----

there is a lynx for windows

----Off Topic----

AutoIt · March 31, 2005

thanks for the sample, tried that and it throws an "unknown function error"

if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function starts

I downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/

uninstalled previous version, installed the 3.1.1 and still no joy

Alterego · March 31, 2005

----Off Topic----
there is a lynx for windows
----Off Topic----
<{POST_SNAPBACK}>

sure. you can either run cygwin or download one compiled for win

thanks for the sample, tried that and it throws an "unknown function error"
if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function starts
I downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/
uninstalled previous version, installed the 3.1.1 and still no joy
<{POST_SNAPBACK}>

right, i don't know why that is, but you should be able to get along fine using the last code example i provided, as it only uses functions in the last stable distribution.

AutoIt · March 31, 2005

the error points to line 3

StringRegExpReplace

apparently that is an "unknown function"

hmm... I've now tried this with version 3.1.0 and 3.1.1 (latest beta), then I read this page

http://www.autoitscript.com/forum/index.ph...&st=0&p=68496

Edited March 31, 2005 by AutoIt

March 31, 2005

I'm also having some trouble, permit me to post a couple questions and comments

1. I installed version 3.1.1 from the public beta download

2. The original sample from Alterego works (ie. get number from google.com)

3. html2txt and the www.cnn.com sample does not work, it shows "unknown function error"

4. I use an include to have all of larry's excellent file handling routines

this piece of code does not work due to the "StringRegExpReplace" being an unknown function

#include <E:\Program Files\AutoIt3\Examples\English\FileInclude.au3>

$file = @HomeDrive & '\cnn.txt'

InetGet('http://www.cnn.com', $file, 1)

$text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringRepl

ace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4)

ClipPut($text)

Alterego · March 31, 2005

they removed StringRegExpReplace from all releases I believe, so if you don't have it now you aren't going to get it. I'm not sure why this choice was made.

alternative is to use the lynx example i provided. it returns better output anyway. lynx requires no installation. just download it from somewhere and drop it in \windows\

_ScreenScrape UDF

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Guest Nina

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members