Jump to content

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more here. X
X


Photo

_ScreenScrape UDF


  • Please log in to reply
21 replies to this topic

#1 Alterego

Alterego

    Polymath

  • Active Members
  • PipPipPipPip
  • 200 posts

Posted 19 March 2005 - 06:52 AM

This is my first UDF so go easy on me =) This has the sub-requirement of Larry's awesome RealFileReading functions, which I hope become part of the standard distro. If you don't have those yet just paste them at the bottom of a script running this function. (update: Scrape.au3 attached with all needed code. Just drop in your include dir!)

The syntax looks like this:
_ScreenScrape( 'URL', 'String before', 'String after', [..., 'before', 'after', 'before', 'after',...] )


Lets jump straight to the examples so you can see how easy it is: (if you're not sure what screen scraping is, try reading this article)

Examples
Plain Text         
;;;scrape google for the number of web pages they index $google = _ScreenScrape('http://www.google.com','Searching ',' web pages') MsgBox(1,'',$google) ;;;scrape microsoft for the last time they updated their homepage $microsoft = _ScreenScrape('http://www.microsoft.com','Last Updated: ',' Pacific Time') MsgBox(1,'',$microsoft) ;;;scrape wikipedia for the total number of articles $wikipedia = _ScreenScrape('http://en.wikipedia.org/','Statistics">','</a> articles.') MsgBox(1,'',$wikipedia) ;;;advanced: scrape the wikipedia statistics page for six things at once!! #include <array.au3> Global $wikipediaStatistics = _ScreenScrape('http://en.wikipedia.org/wiki/Special:Statistics', _                                            'Wikipedia currently has <b>','</b> <a href="/wiki/Wikipedia:What_is_an_article"', _                                            'Including these, we have <b>','</b> pages.</p>', _                                            '<p>Users have made <b>', '</b> edits since July 2002', _                                            'an average of <b>','</b> edits per page.</p>', _                                            '<p>We have <b>','</b> registered users', _                                            'of which <b>','</b> are <a hr') _ArrayDisplay($wikipediaStatistics,'')


_ScreenScrape:
Plain Text         
;=============================================================================== ; ; Function Name:    _ScreenScrape ; Description:    Easily screen scrape any web page for the text you want ; Parameter(s):  $ss_URL  - The website to scrape ;                  $ss_1  - The string occurring before the text you want ;                   $ss_2  - The string occuring after the text you want ;                  ... ;                   $ss_19 - The string occurring before the text you want ;                  $v_20 - The string occuring after the text you want. ; Requirement(s):   _UnFormat, _RealFileClose, _RealFileRead, _RealFileOpen ; Return Value(s):  If only one result will return a string. If more than one ;                  result, will return an array ; Author(s):        Alterego http://www.br1an.net ; Note(s):        Woot! ; ;=============================================================================== Func _ScreenScrape($ssURL, $ss_1, $ss_2, $ss_3 = 0, $ss_4 = 0, $ss_5 = 0, $ss_6 = 0, $ss_7 = 0, $ss_8 = 0, $ss_9 = 0, $ss_10 = 0, $ss_11 = 0, $ss_12 = 0, $ss_13 = 0, $ss_14 = 0, $ss_15 = 0, $ss_16 = 0, $ss_17 = 0, $ss_18 = 0, $ss_19 = 0, $ss_20 = 0)     Local $ss_NumParam = @NumParams     Local $ss_CountOdd = 1     Local $ss_CountEven = 2     Local $ss_Half = $ss_NumParam / 2     Local $ss_Data[$ss_NumParam + 1]     Local $ss_Return[$ss_Half]     For $ss_Primer = 0 To $ss_NumParam - 1         $ss_Data[$ss_Primer] = _UnFormat (Eval('ss_' & String($ss_Primer)))     Next     Global $file = @TempDir & "\" & Random(500000, 1000000, 1) & ".scrape"     InetGet($ssURL, $file, 1, 0)     Local $ss_Handle = _RealFileOpen ($file)     Local $ss_ReadOnce = _RealFileRead ($ss_Handle, FileGetSize($file))     Local $ss_PermanentStore = _UnFormat ($ss_ReadOnce[0])     For $ss_Scrape = 0 to ($ss_NumParam - 2) / 2         $ss_TemporaryStore = $ss_PermanentStore         $ss_TemporaryStore = StringTrimLeft($ss_TemporaryStore, StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountOdd], 1, 1) + StringLen($ss_Data[$ss_CountOdd]) - 1)         $ss_TemporaryStore = StringTrimRight($ss_TemporaryStore, StringLen($ss_TemporaryStore) - StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountEven]) + 1)         $ss_CountOdd = $ss_CountOdd + 2         $ss_CountEven = $ss_CountEven + 2         $ss_Return[$ss_Scrape] = $ss_TemporaryStore     Next     _RealFileClose ($ss_Handle)     FileDelete($file)       If UBound($ss_Return) = 1 Then         Return $ss_Return[0]     Else         Return $ss_Return     EndIf EndFunc

Attached Files


Edited by Alterego, 30 March 2005 - 01:26 AM.








#2 MHz

MHz

    Just simple

  • MVPs
  • 5,724 posts

Posted 19 March 2005 - 07:48 AM

Interesting. But using code tags, would enable people to copy the code correctly.

#3 steveR

steveR

    Computo, ergo sum.

  • Active Members
  • PipPipPipPipPipPip
  • 353 posts

Posted 19 March 2005 - 07:50 AM

I don't know why the codebox makes it all one line, that is so dumb.
AutoIt3 online docs Use it... Know it... Live it...MSDN libraryglobal Help and SupportWindows: Just another pane in the glass.

#4 Insolence

Insolence

    Not distastefully arrogant

  • Active Members
  • PipPipPipPipPipPip
  • 1,304 posts

Posted 19 March 2005 - 08:00 AM

Multiple lines here, about 15.
"I thoroughly disapprove of duels. If a man should challenge me, I would take him kindly and forgivingly by the hand and lead him to a quiet place and kill him." - Mark TwainPatient: "It hurts when I do $var_"Doctor: "Don't do $var_" - Lar.

#5 Alterego

Alterego

    Polymath

  • Active Members
  • PipPipPipPip
  • 200 posts

Posted 22 March 2005 - 10:02 AM

With this update (see original post) you can scrape the same page for several things at once, and still on only one line of code!. The fastest way to test it is to download scrape.au3 to your include dir and use that.

PS: even with this complete rewrite all old syntax still works. backwards compatability baby :)

Changelog
22 March 05: Complete rewrite allowing one to scrape the same page for multiple strings 19 March 05: Minor fixes

Edited by Alterego, 22 March 2005 - 06:51 PM.


#6 Alterego

Alterego

    Polymath

  • Active Members
  • PipPipPipPip
  • 200 posts

Posted 25 March 2005 - 08:01 PM

I've received several PMs asking for examples from this and also generating RSS feeds. AutoIt is powering this page, so that should help.

#7 cybie

cybie

    Seeker

  • Active Members
  • 18 posts

Posted 29 March 2005 - 05:08 PM

Hmm... Maybe I'm missing the obvious here, or I'm jumping ahead because I'm excited that this could be a big time-saver for me, so I'm overlooking the details, but I am missing the _ArrayDisplay function...

Am I doing something stupid here, or is there something else that should be included? :)
Writing damaged code since 1996.

#8 steveR

steveR

    Computo, ergo sum.

  • Active Members
  • PipPipPipPipPipPip
  • 353 posts

Posted 29 March 2005 - 07:23 PM

the _ArrayDislay() udf is part of the array.au3 file.

To use the udf functions, you have to #Include it in your script.

Example:

#include <array.au3> $array = StringSplit("foo,bar", ",") _ArrayDisplay($array, "test")

AutoIt3 online docs Use it... Know it... Live it...MSDN libraryglobal Help and SupportWindows: Just another pane in the glass.

#9 Alterego

Alterego

    Polymath

  • Active Members
  • PipPipPipPip
  • 200 posts

Posted 30 March 2005 - 01:25 AM

my apologies. i added that to the example. my test script environment has all the includes in by default so i overlooked it

#10 cybie

cybie

    Seeker

  • Active Members
  • 18 posts

Posted 30 March 2005 - 01:49 PM

my apologies. i added that to the example.  my test script environment has all the includes in by default so i overlooked it

<{POST_SNAPBACK}>

No problem, I just commented that stuff out and tried the rest as it was. Thanks for the reply SteveR. :D

I played with this a little bit, but most of what I work with are web pages and it's not really in my best interest to have code interjected in my scrape results, such as line breaks and text formatting. It would be really cool to have a script remove all of the code from a document before/after scraping. Maybe something that finds the first < then the next > and counts the spaces in between then trims the middle out to remove all of the obvious/standard bits of HTML.

I will try to play with this a little, but if someone beats me to it I won't be upset. :)

Excellent work so far! I am glad someone else is working on this!
Writing damaged code since 1996.

#11 Alterego

Alterego

    Polymath

  • Active Members
  • PipPipPipPip
  • 200 posts

Posted 30 March 2005 - 04:37 PM

Func _html2txt($html)     $Html2TxT = StringRegExpReplace($html, "<.[^>]*>" , "")     Return $Html2TxT EndFunc;==>html2txt


written by supergg02. i use it quite often and it works well. you must be using the latest beta for StringRegExpReplace

i also scrape all @CR, @LF, and all @CRLF both from your input and from the document to make matching easier

Edited by Alterego, 30 March 2005 - 04:48 PM.


#12 cybie

cybie

    Seeker

  • Active Members
  • 18 posts

Posted 30 March 2005 - 07:56 PM

Func _html2txt($html)     Return StringRegExpReplace($html, "<.[^>]*>" , "") EndFunc;==>html2txt

<{POST_SNAPBACK}>

Thanks Alterego!

I would also like to thank Larry for acting as the "Mr Clean"-inspired image would suggest and cleaning the code up. You're like the wise code janitor picking up after all of us. We appreciate it! :)
Writing damaged code since 1996.

#13 AutoIt

AutoIt

    Seeker

  • Active Members
  • 25 posts

Posted 31 March 2005 - 12:22 AM

could this be used to parse the *entire* contents of a page?

for example, I've used a VB and now a vb.net application which inputs the entire web page and then proceeds to parse out one(1) character at a time and sends the pure text as an SMS (only to GSM enabled) phone message

could I process with this, for example http://www.cnn.com/ ?

(BTW, I downloaded the latest beta and the html2txt throws an error, unknown function name ??)

Edited by AutoIt, 31 March 2005 - 12:38 AM.


#14 Alterego

Alterego

    Polymath

  • Active Members
  • PipPipPipPip
  • 200 posts

Posted 31 March 2005 - 01:08 AM

not sure why RegEx's don't work for you...this is the best you're gonna' get without doing some serious keyword processing to filter java script:

$file = @HomeDrive & '\cnn.txt' InetGet('http://www.cnn.com', $file, 1) $text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4) ClipPut($text)


The best solution is to use the browser lynx, which returns not-bad-at-all output

$file = @HomeDrive & 'cnn.txt' RunWait(@ComSpec & ' /c lynx -dump --accept_all_cookies -nolist <a href='http://www.cnn.com' class='bbc_url' title='External link' rel='nofollow external'>http://www.cnn.com</a> > ' & $file) $text = StringStripWS(StringStripWS(StringStripWS(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)),1),2),4) ClipPut($text)


you'll notice the final example does not contain a regex, so you can just use that as soon as you download lynx and put it in \Windows\

#15 zcoacoaz

zcoacoaz

    Forever writing useless yet interesting scripts

  • Active Members
  • PipPipPipPipPipPip
  • 805 posts

Posted 31 March 2005 - 02:01 AM

----Off Topic----
there is a lynx for windows :)
----Off Topic----
If anyone remembers me, I am back. Maybe to stay, maybe not.----------------------------------------------------------------------------------------------------------Things I am proud of: Pong! in AutoIt | SearchbarMy website: F.R.I.E.S.A little website that is trying to get started: http://thepiratelounge.net/ (not mine) ----------------------------------------------------------------------------------------------------------The newbies need to stop stealing avatars!!! It is confusing!!

#16 AutoIt

AutoIt

    Seeker

  • Active Members
  • 25 posts

Posted 31 March 2005 - 04:13 AM

thanks for the sample, tried that and it throws an "unknown function error"
if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function starts

I downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/

uninstalled previous version, installed the 3.1.1 and still no joy

#17 Alterego

Alterego

    Polymath

  • Active Members
  • PipPipPipPip
  • 200 posts

Posted 31 March 2005 - 04:41 AM

----Off Topic----
there is a lynx for windows  :)
----Off Topic----

<{POST_SNAPBACK}>

sure. you can either run cygwin or download one compiled for win

thanks for the sample, tried that and it throws an "unknown function error"
if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function starts

I downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/

uninstalled previous version, installed the 3.1.1 and still no joy

<{POST_SNAPBACK}>

right, i don't know why that is, but you should be able to get along fine using the last code example i provided, as it only uses functions in the last stable distribution.

#18 AutoIt

AutoIt

    Seeker

  • Active Members
  • 25 posts

Posted 31 March 2005 - 05:33 AM

the error points to line 3
StringRegExpReplace

apparently that is an "unknown function"

hmm... I've now tried this with version 3.1.0 and 3.1.1 (latest beta), then I read this page
http://www.autoitscript.com/forum/index.ph...&st=0&p=68496&#

Edited by AutoIt, 31 March 2005 - 05:48 AM.


#19 Guest_Nina_*

Guest_Nina_*
  • Guests

Posted 31 March 2005 - 05:12 PM

I'm also having some trouble, permit me to post a couple questions and comments

1. I installed version 3.1.1 from the public beta download
2. The original sample from Alterego works (ie. get number from google.com)
3. html2txt and the www.cnn.com sample does not work, it shows "unknown function error"

4. I use an include to have all of larry's excellent file handling routines

this piece of code does not work due to the "StringRegExpReplace" being an unknown function

#include <E:\Program Files\AutoIt3\Examples\English\FileInclude.au3>

$file = @HomeDrive & '\cnn.txt'
InetGet('http://www.cnn.com' $file, 1)
$text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringRepl


ace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4)
ClipPut($text)

#20 Alterego

Alterego

    Polymath

  • Active Members
  • PipPipPipPip
  • 200 posts

Posted 31 March 2005 - 11:22 PM

they removed StringRegExpReplace from all releases I believe, so if you don't have it now you aren't going to get it. I'm not sure why this choice was made.

alternative is to use the lynx example i provided. it returns better output anyway. lynx requires no installation. just download it from somewhere and drop it in \windows\




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users