Jump to content

Screen scrape ---> RSS


Alterego
 Share

Recommended Posts

This is my [current] method of downloading a web page and publishing an rss feed for it. A lot of it is reusable, and that's mostly the things that were tedius and took the longest, for example, the layout of the rss feed. I used the harvard copy of the spec.

It seems to me that with a GUI it is possible to have autoit go in and try to automatically generate RSS and ATOM feeds from a given web page by searching for overall patterns. There could be a text input box for any given field (see the tags below) where the user could copy/paste "hints" for the program to base its search for patters on. I'll ruminate on this for a while. Any pointers?

;;;suggest viewing in autoit3 so it looks pretty :)

#include <file.au3>
#include <Array.au3>
#Include <date.au3>
#NoTrayIcon

Do 

;;;how many days since sql dump last time we checked?
;;;reads the text file you specify in the following format
;;; xx MO-DA-YEAR HOUR:MIN:SEC where xx is the number of days since last update
FileOpen(@HomeDrive & "\Qwikly\wikipediadownload.txt",1)
$lines = _FileCountLines(@HomeDrive & "\Qwikly\wikipediadownload.txt")
$lastRead = FileReadLine(@HomeDrive & "\Qwikly\wikipediadownload.txt",$lines)
$lastReportedDaysSinceUpdate =  StringLeft($lastRead,StringInStr($lastRead," ",0,1))

;;;how many days since sql dump now?
FileDelete(@TempDir & "\index.html") ;;delete file from last time
InetGet("http://download.wikimedia.org/",@TempDir & "\index.html",1,0)

$a = ""
;;;sticks the web page on the clipboard for easy processing.
_FileReadToArray(@TempDir & "\index.html",$a)

;;;searches  the clipboard for the occurence of "Last dump made: " for the date
$h = StringTrimRight(StringTrimLeft(ClipGet(),StringInStr(ClipGet(),"Last dump made: ",1) + 15),StringLen(StringTrimLeft(ClipGet(),StringInStr(ClipGet(),"Last dump made: ",1) + 15)) - 10)
_ArrayToClip($a)

;;;does some string manipulations to isolate the date found on the clipboard and convert it to the format _DateDiff asks for
$lastDump = _DateDiff('D',StringLeft($h,4) & "/" & StringLeft(StringTrimLeft($h,5),2) & "/" & StringTrimLeft($h,8),@YEAR & "/" & @MON & "/" & @MDAY)

;;;writes to our log file xx MO-DA-YEAR HOUR:MIN:SEC
FileWrite(@HomeDrive & "\Qwikly\wikipediadownload.txt",$lastDump & " " & @MON & "-" & @MDAY & "-" & @YEAR & " " & @HOUR & ":"  & @MIN & ":" & @SEC & @CRLF)

;;;;compares the number of days since a dump found in our log file with the current number found on webpage.
;;;;if the webpage days is smaller than log file days, publish a new rss feed
If $lastDump < $lastReportedDaysSinceUpdate Then
    FileOpen(@HomeDrive & "\Qwikly\wikipediadownload.rss",1)
    $lastBuildDate = StringRight($lastRead,StringInStr($lastread," ",0,1))
    FileOpen(@HomeDrive & "\Qwikly\wikipediadownload.rss",2)
    FileWrite(@HomeDrive & "\Qwikly\wikipediadownload.rss", _
        '<?xml version="1.0"?>' & @CRLF & _ 
        '<rss version="2.0">' & @CRLF & _ 
        '   <channel>' & @CRLF & _ 
        '     <title>Wikipedia database download</title>' & @CRLF & _ 
        '     <link>http://download.wikimedia.org</link>' & @CRLF & _ 
        '     <description>SQL database dumps on download.wikimedia.org have historically updated approximately twice weekly, but updates are currently biweekly to monthly.</description>' & @CRLF & _ 
        '     <language>en-us</language>' & @CRLF &  _ 
        '     <copyright>All text is available under the terms of the GNU Free Documentation License</copyright>' & @CRLF & _ 
        '     <ttl>150</ttl>' & @CRLF & '     <pubDate>' & @YEAR & "/" & @MON & "/" & @MDAY & @HOUR & ":" & @MIN & @SEC & " Mountain Time" & '</pubDate>' & @CRLF & _ 
        '     <lastBuildDate>' & StringTrimLeft($lastRead,StringInStr($lastread," ",0,-2)) & '</lastBuildDate>' & @CRLF & '   <docs>http://www.wikipedia.org/wiki/Wikipedia:Database_download</docs>' & @CRLF & _ 
        '     <generator>Qwikly.com</generator>' & @CRLF &  _ 
        '     <managingEditor>reflection+qwikly@gmail.com</managingEditor>' & @CRLF & _ 
        '     <webMaster>simple@qwikly.com</webMaster>' & @CRLF & _ 
        '     <item>' & @CRLF & _ 
        '        <title>New SQL dump detected at ' &  @HOUR & ":"  & @MIN & " on " & @MON & "-" & @MDAY & "-" & @YEAR & '</title>' & @CRLF & _ 
        '        <link>http://download.wikimedia.org</link>' & @CRLF & _ 
        '        <description>All original textual content is licensed under the GNU Free Documentation License. Text written by some authors may be released under additional licenses or into the public domain. Some text (including quotations) may be used under fair use, usually where it is believed that the use will also be fair dealing outside the USA. Note that material used as "fair use" under United States law may not be legal to reproduce outside the US. See Fair use for more information.</description>' & @CRLF & _ 
        '        <pubDate>' & @YEAR & "/" & @MON & "/" & @MDAY & @HOUR & ":" & @MIN & @SEC & " Mountain Time" & '</pubDate>' & @CRLF & _ 
        '     </item>' & @CRLF & _ 
        '   </channel>' & @CRLF & _ 
        '</rss>')
   ;;;curl seems to be a good option for uploading via ftp...
    RunWait(@Comspec & ' /c curl -T' &  @HomeDrive & "\Qwikly\wikipediadownload.rss" & ' -u user:pass ftp://blah.com/public_html/',@HomeDrive & '\Qwikly\',@SW_HIDE)
    FileClose(@HomeDrive & "\Qwikly\wikipediadownload.txt")
EndIf
;;;sleep for 2 1/2 hours before we download the web page and calculate again
Sleep(9000000)
Until 1 = 0
Link to comment
Share on other sites

Great script, and thanks for mentioning me :lmao:

"I thoroughly disapprove of duels. If a man should challenge me, I would take him kindly and forgivingly by the hand and lead him to a quiet place and kill him." - Mark TwainPatient: "It hurts when I do $var_"Doctor: "Don't do $var_" - Lar.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...