Sign in to follow this  
Followers 0

_ScreenScrape UDF

22 posts in this topic

Posted (edited)

This is my first UDF so go easy on me =) This has the sub-requirement of Larry's awesome RealFileReading functions, which I hope become part of the standard distro. If you don't have those yet just paste them at the bottom of a script running this function. (update: Scrape.au3 attached with all needed code. Just drop in your include dir!)

The syntax looks like this:

_ScreenScrape( 'URL', 'String before', 'String after', [..., 'before', 'after', 'before', 'after',...] )

Lets jump straight to the examples so you can see how easy it is: (if you're not sure what screen scraping is, try reading this article)

Examples

;;;scrape google for the number of web pages they index

$google = _ScreenScrape('http://www.google.com','Searching ',' web pages')
MsgBox(1,'',$google)

;;;scrape microsoft for the last time they updated their homepage

$microsoft = _ScreenScrape('http://www.microsoft.com','Last Updated: ',' Pacific Time')
MsgBox(1,'',$microsoft)

;;;scrape wikipedia for the total number of articles

$wikipedia = _ScreenScrape('http://en.wikipedia.org/','Statistics">','</a> articles.')
MsgBox(1,'',$wikipedia)

;;;advanced: scrape the wikipedia statistics page for six things at once!!
#include <array.au3>

Global $wikipediaStatistics = _ScreenScrape('http://en.wikipedia.org/wiki/Special:Statistics', _ 
                                           'Wikipedia currently has <b>','</b> <a href="/wiki/Wikipedia:What_is_an_article"', _
                                           'Including these, we have <b>','</b> pages.</p>', _ 
                                           '<p>Users have made <b>', '</b> edits since July 2002', _ 
                                           'an average of <b>','</b> edits per page.</p>', _ 
                                           '<p>We have <b>','</b> registered users', _ 
                                           'of which <b>','</b> are <a hr')
_ArrayDisplay($wikipediaStatistics,'')

_ScreenScrape:

;===============================================================================
;
; Function Name:    _ScreenScrape
; Description:    Easily screen scrape any web page for the text you want
; Parameter(s):  $ss_URL  - The website to scrape
;                  $ss_1  - The string occurring before the text you want
;                   $ss_2  - The string occuring after the text you want
;                  ...
;                   $ss_19 - The string occurring before the text you want
;                  $v_20 - The string occuring after the text you want.
; Requirement(s):   _UnFormat, _RealFileClose, _RealFileRead, _RealFileOpen
; Return Value(s):  If only one result will return a string. If more than one
;                  result, will return an array
; Author(s):        Alterego http://www.br1an.net
; Note(s):        Woot!
;
;===============================================================================

Func _ScreenScrape($ssURL, $ss_1, $ss_2, $ss_3 = 0, $ss_4 = 0, $ss_5 = 0, $ss_6 = 0, $ss_7 = 0, $ss_8 = 0, $ss_9 = 0, $ss_10 = 0, $ss_11 = 0, $ss_12 = 0, $ss_13 = 0, $ss_14 = 0, $ss_15 = 0, $ss_16 = 0, $ss_17 = 0, $ss_18 = 0, $ss_19 = 0, $ss_20 = 0)
    Local $ss_NumParam = @NumParams
    Local $ss_CountOdd = 1
    Local $ss_CountEven = 2
    Local $ss_Half = $ss_NumParam / 2
    Local $ss_Data[$ss_NumParam + 1]
    Local $ss_Return[$ss_Half]
    For $ss_Primer = 0 To $ss_NumParam - 1
        $ss_Data[$ss_Primer] = _UnFormat (Eval('ss_' & String($ss_Primer)))
    Next
    Global $file = @TempDir & "\" & Random(500000, 1000000, 1) & ".scrape"
    InetGet($ssURL, $file, 1, 0)
    Local $ss_Handle = _RealFileOpen ($file)
    Local $ss_ReadOnce = _RealFileRead ($ss_Handle, FileGetSize($file))
    Local $ss_PermanentStore = _UnFormat ($ss_ReadOnce[0])
    For $ss_Scrape = 0 to ($ss_NumParam - 2) / 2
        $ss_TemporaryStore = $ss_PermanentStore
        $ss_TemporaryStore = StringTrimLeft($ss_TemporaryStore, StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountOdd], 1, 1) + StringLen($ss_Data[$ss_CountOdd]) - 1)
        $ss_TemporaryStore = StringTrimRight($ss_TemporaryStore, StringLen($ss_TemporaryStore) - StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountEven]) + 1)
        $ss_CountOdd = $ss_CountOdd + 2
        $ss_CountEven = $ss_CountEven + 2
        $ss_Return[$ss_Scrape] = $ss_TemporaryStore
    Next
    _RealFileClose ($ss_Handle)
    FileDelete($file)   
    If UBound($ss_Return) = 1 Then 
        Return $ss_Return[0]
    Else
        Return $ss_Return
    EndIf
EndFunc

scrape.au3

Edited by Alterego

Share this post


Link to post
Share on other sites



Posted

Interesting. But using code tags, would enable people to copy the code correctly.

Share this post


Link to post
Share on other sites

Posted

I don't know why the codebox makes it all one line, that is so dumb.

Share this post


Link to post
Share on other sites

Posted

Multiple lines here, about 15.

Share this post


Link to post
Share on other sites

Posted (edited)

With this update (see original post) you can scrape the same page for several things at once, and still on only one line of code!. The fastest way to test it is to download scrape.au3 to your include dir and use that.

PS: even with this complete rewrite all old syntax still works. backwards compatability baby :)

Changelog

22 March 05: Complete rewrite allowing one to scrape the same page for multiple strings
19 March 05: Minor fixes
Edited by Alterego

Share this post


Link to post
Share on other sites

Posted

I've received several PMs asking for examples from this and also generating RSS feeds. AutoIt is powering this page, so that should help.

Share this post


Link to post
Share on other sites

Posted

Hmm... Maybe I'm missing the obvious here, or I'm jumping ahead because I'm excited that this could be a big time-saver for me, so I'm overlooking the details, but I am missing the _ArrayDisplay function...

Am I doing something stupid here, or is there something else that should be included? :)

Share this post


Link to post
Share on other sites

Posted

the _ArrayDislay() udf is part of the array.au3 file.

To use the udf functions, you have to #Include it in your script.

Example:

#include <array.au3>

$array = StringSplit("foo,bar", ",")
_ArrayDisplay($array, "test")

Share this post


Link to post
Share on other sites

Posted

my apologies. i added that to the example. my test script environment has all the includes in by default so i overlooked it

Share this post


Link to post
Share on other sites

Posted

my apologies. i added that to the example.  my test script environment has all the includes in by default so i overlooked it

<{POST_SNAPBACK}>

No problem, I just commented that stuff out and tried the rest as it was. Thanks for the reply SteveR. :D

I played with this a little bit, but most of what I work with are web pages and it's not really in my best interest to have code interjected in my scrape results, such as line breaks and text formatting. It would be really cool to have a script remove all of the code from a document before/after scraping. Maybe something that finds the first < then the next > and counts the spaces in between then trims the middle out to remove all of the obvious/standard bits of HTML.

I will try to play with this a little, but if someone beats me to it I won't be upset. :)

Excellent work so far! I am glad someone else is working on this!

Share this post


Link to post
Share on other sites

Posted (edited)

Func _html2txt($html)
    $Html2TxT = StringRegExpReplace($html, "<.[^>]*>" , "")
    Return $Html2TxT
EndFunc;==>html2txt

written by supergg02. i use it quite often and it works well. you must be using the latest beta for StringRegExpReplace

i also scrape all @CR, @LF, and all @CRLF both from your input and from the document to make matching easier

Edited by Alterego

Share this post


Link to post
Share on other sites

Posted

Func _html2txt($html)
    Return StringRegExpReplace($html, "<.[^>]*>" , "")
EndFunc;==>html2txt

<{POST_SNAPBACK}>

Thanks Alterego!

I would also like to thank Larry for acting as the "Mr Clean"-inspired image would suggest and cleaning the code up. You're like the wise code janitor picking up after all of us. We appreciate it! :)

Share this post


Link to post
Share on other sites

Posted (edited)

could this be used to parse the *entire* contents of a page?

for example, I've used a VB and now a vb.net application which inputs the entire web page and then proceeds to parse out one(1) character at a time and sends the pure text as an SMS (only to GSM enabled) phone message

could I process with this, for example http://www.cnn.com/ ?

(BTW, I downloaded the latest beta and the html2txt throws an error, unknown function name ??)

Edited by AutoIt

Share this post


Link to post
Share on other sites

Posted

not sure why RegEx's don't work for you...this is the best you're gonna' get without doing some serious keyword processing to filter java script:

$file = @HomeDrive & '\cnn.txt' 
InetGet('http://www.cnn.com', $file, 1)
$text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4)
ClipPut($text)

The best solution is to use the browser lynx, which returns not-bad-at-all output

$file = @HomeDrive & 'cnn.txt'
RunWait(@ComSpec & ' /c lynx -dump --accept_all_cookies -nolist http://www.cnn.com > ' & $file)
$text = StringStripWS(StringStripWS(StringStripWS(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)),1),2),4)
ClipPut($text)

you'll notice the final example does not contain a regex, so you can just use that as soon as you download lynx and put it in \Windows\

Share this post


Link to post
Share on other sites

Posted

----Off Topic----

there is a lynx for windows :)

----Off Topic----

Share this post


Link to post
Share on other sites

Posted

thanks for the sample, tried that and it throws an "unknown function error"

if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function starts

I downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/

uninstalled previous version, installed the 3.1.1 and still no joy

Share this post


Link to post
Share on other sites

Posted

----Off Topic----

there is a lynx for windows  :)

----Off Topic----

<{POST_SNAPBACK}>

sure. you can either run cygwin or download one compiled for win

thanks for the sample, tried that and it throws an "unknown function error"

if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function starts

I downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/

uninstalled previous version, installed the 3.1.1 and still no joy

<{POST_SNAPBACK}>

right, i don't know why that is, but you should be able to get along fine using the last code example i provided, as it only uses functions in the last stable distribution.

Share this post


Link to post
Share on other sites

Posted (edited)

the error points to line 3

StringRegExpReplace

apparently that is an "unknown function"

hmm... I've now tried this with version 3.1.0 and 3.1.1 (latest beta), then I read this page

http://www.autoitscript.com/forum/index.ph...&st=0&p=68496

Edited by AutoIt

Share this post


Link to post
Share on other sites

Posted

I'm also having some trouble, permit me to post a couple questions and comments

1. I installed version 3.1.1 from the public beta download

2. The original sample from Alterego works (ie. get number from google.com)

3. html2txt and the www.cnn.com sample does not work, it shows "unknown function error"

4. I use an include to have all of larry's excellent file handling routines

this piece of code does not work due to the "StringRegExpReplace" being an unknown function

#include <E:\Program Files\AutoIt3\Examples\English\FileInclude.au3>

$file = @HomeDrive & '\cnn.txt'

InetGet('http://www.cnn.com', $file, 1)

$text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringRepl

ace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4)

ClipPut($text)

Share this post


Link to post
Share on other sites

Posted

they removed StringRegExpReplace from all releases I believe, so if you don't have it now you aren't going to get it. I'm not sure why this choice was made.

alternative is to use the lynx example i provided. it returns better output anyway. lynx requires no installation. just download it from somewhere and drop it in \windows\

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0