Jump to content

Wikipedia in the palm of your hand


Alterego
 Share

Recommended Posts

A friend and I are launching a site soon that is going to convert the Wikipedia database into TomeRaider format in many languages for many platforms automagically. The problem is that for the bigger Wikipedia's, English for example, the database is 1.7GB, not counting a 10GB tarball of images. That whole process takes 3 days on a decent computer. So yeah, i'm fixing that problem :idiot: Here's what i've got so far. Nothing magical mind you, but i'm really just starting with this stuff. I hope to make this a complete package that will download and install TomeRaider if you don't have it, and grab all of these utilities for you. I'm having some troubles with talking AutoIt into downloading stuffs.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;                                                                ;;
;;                                                                ;;
;;            Automate the processing of Wikipedia              ;;
;;          TomeRaider 3 files in multiple languages.            ;;
;;      Huzzah!  v.1  by Brian @ http://www.br1an.net              ;;
;;                                                                ;;
;;                                                                ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;;Wikipedia in the palm of your hand! If you run this script as is, 
;;it will take DAYS to finsih! Don't worry, some friends
;;and I are starting a service soon that will alleviate this :p  If you
;;speak esperanto you're lucky - the entire process for just that language
;;takes about 30 minutes

;The program will process an entire file from start to finish - 
;one at a time for now. I'll do more than one thing at once in the future.
;No big deal since we have an entire week inbetween database dumps. If
;you just want to run your language on your platform, skip down to the
;very end of this script and comment out everything but you.

;This program assumes you have Bunzip2, wget, and WikiToTome.pl, and
;that the directories they are stored in are PATH environment variables;
;To edit PATH press window + pause break > advanced >
;environment variables > path > edit > put a semicolon at the very end >
;add the directories these tools can be found in.

;Bunzip2
;;;http://members.ams.chello.nl/epzachte/Wikipedia/bunzip2.exe
;Wget
;;;http://www.interlog.com/~tcharron/wgetwin-1_5_3_1-binary.zip
;WikiToTome
;;;http://eza.gemm.nl/Wikipedia/TomeRaider/WikiToTomeExe.zip
;you will need a perl converter. I use ActivePerl
;;;http://downloads.activestate.com/ActivePerl/Windows/5.8/ActivePerl-5.8.6.811-MSWin32-x86-122208.msi

;let's define our variables
Dim $lang = 0;what language? we got six to choose from
             ;EN (English), DE (German), FR (French), PL (Polish), NL (Dutch), EO (Esperanto) 
Dim $plat = 0;are we on PALM or (P)PC? chose (P)PC to run on TR3 for Windows
dim $bz = "bunzip2 -dkv cur_table.sql.bz2";bunzip2 DOS command
dim $db = "C:\Wikipedia\";directory where our databases are stored
dim $dl = "http://download.wikimedia.org/archives/" 
dim $sql = "cur_table.sql.bz2"
dim $img = "NOIMG";no images for now. we'll get them soon.
dim $litmus = 2

;Many thanks to MHz for helping with this!  It finds out when TomeRaider is done processing.
Func CheckControl()
If not WinActive( "File Import", "" ) Then WinActivate( "File Import", "" )
WinWaitActive( "File Import", "" )
    $a = ControlGetText('File Import', '', 'Button19')
    If $a = 'Stop' Then
        Do
            Sleep(1000)
            $a = ControlGetText('File Import', '', 'Button19')
        Until $a = 'Compile'
        ProcessClose ( "TomeRaider.exe" )       
    EndIf
EndFunc

;This stops us from downloading the file if we already did today. 
;Wikipedia updates the database weekly. See http://download.mediawiki.org
Func download()
    $timeDl = FileGetTime($db & $lang & "\" & $sql , 1)
    If NOT IsArray($timeDl) OR $timeDl[2] <> @MDAY Then
    RunWait(@ComSpec & " /c " & "cd /d " & $db & $lang & " && wget " & $dl & $lang & "/" & $sql & " && " & $bz & " && ren *.sql cur_table_" & $lang & ".sql", "")
    EndIf
EndFunc

;main function. 
Func main()
download()

;perl conversion script written by Erik Zachte @ http://members.chello.nl/epzachte/Wikipedia
;converts the raw Wikipedia SQL dump into TomeRaider 3 format.
RunWait(@ComSpec & " /c " & "cd /d " & $db & $lang & " && wikitotome " & $lang & " " & $plat & " " & $img, "")

;start TomeRaider and tell it to get ready to import
Run( "C:\Program Files\TomeRaider3\TomeRaider.exe", "C:\Program Files\TomeRaider3\", @SW_MAXIMIZE )
WinWait( "TomeRaider", "Categories" )
If not WinActive( "TomeRaider", "Categories" ) Then WinActivate( "TomeRaider", "Categories" )
WinWaitActive ( "TomeRaider", "Categories" )
Send( "{ALTDOWN}{ALTUP}{DOWN}{DOWN}{DOWN}{ENTER}" ) 

;changing settings, pushing buttons
    ControlClick("File Import", "", "Button3" )
    ControlClick("File Import", "", "Button9" )
    ControlClick("File Import", "", "Button5")
    ControlClick("File Import", "", "Button12")
    ControlSetText("File Import", "", "Edit1", "896");works, but TR3 doesn't seem to accept it. hrm.
ControlSetText("File Import", "", "Edit3", $db & $lang & "\WP_" & $lang & "_" & $plat & "_TXT.txt")
ControlClick("File Import", "", "Button19" );import!
CheckControl()
;I FTP my files off site when i'm finished. 
RunWait(@ComSpec & " /c " & "ftp -s:C:\Wikipedia\ftp\" & "ftp_" & $lang & "_" & $plat & ".txt", "")
EndFunc

;;Well, we haven't ACTUALLY done anything yet. Here we go!
;;Quick run on the Esperanto DB for PPC. It's the smallest at 9MB.bz2
$lang = "eo"
$plat = "(P)PC"
main()
;;Esperanto PALM
$plat = "PALM"
main()
;;;;;;;;;;;;;;;;;;;;;
;;Dutch PPC
$lang = "nl"
$plat = "(P)PC"
main()
;;Dutch PALM
$plat = "PALM"
main()
;;;;;;;;;;;;;;;;;;;;;
;;Polish PPC
$lang = "pl"
$plat = "(P)PC"
main()
;;Polish PALM
$plat = "PALM"
main()
;;;;;;;;;;;;;;;;;;;;;
;French PPC
$lang = "fr"
$plat = "(P)PC"
main()
;;French PALM
$plat = "PALM"
main()
;;;;;;;;;;;;;;;;;;;;
;English PPC
$lang = "en"
$plat = "(P)PC"
main()
;;English PALM
$plat = "PALM"
main()
;;;;;;;;;;;;;;;;;;;;
;English PPC
$lang = "de"
$plat = "(P)PC"
main()
;;English PALM
$plat = "PALM"
main()
Edited by Alterego
Link to comment
Share on other sites

Updated the download function to something that actually works ( didn't get to test it yesterday since the servers were kaput )

Func download()
$timeDl = FileGetTime( $db & "\" & $lang & "\" & $sql , 1 )
If NOT IsArray($timeDl) OR $timeDl[2] <> @MDAY Then 
    RunWait(@ComSpec & " /k " & "cd /d " & $db & "\" & $lang & " && wget " & $dl & "/" & $lang & "/" & $sql, "")
EndIf
EndFunc
Edited by Alterego
Link to comment
Share on other sites

Updated your download function:

Func download()
    $timeDl = FileGetTime( $db & "\" & $lang & "\" & $sql , 1 )
    If NOT IsArray($timeDl) OR $timeDl[2] <> @MDAY Then
        URLDownloadToFile($dl & "/" & $lang & "/" & $sql, $db & "\" & $lang)
    EndIf
EndFunc
Link to comment
Share on other sites

  • 3 weeks later...

I completely rewrote this program. It's a loop with 103 cycles and completely dynamic. Here is an .rtf version that's easier to read. I can't test it right now unfortunately (wikipedia is doing a major database conversion and haven't updated the sql dump in 20 days) I am really unsure if i am using my stringsplits and arrays correctly. could someone check those out for me? Specifically, is it ok for me to shove all those variables inside arrays? Also, any suggestions for making this further dynamic would be greatly appreciated. This thing is just going to have more and more asked of it in the future.

For $cliff = 1 to 3;wikipedia, wiktionary, wikiquote, rotation
Dim $dir = StringSplit('C:\Wikipedia\,C:\Wiktionary\,C:\Wikiquote\',',');work directory
Dim $url = StringSplit('archives/,archives_wiktionary/,archives_wikiquote/', ',');url location
Dim $wikiFull = StringSplit('Wikipedia,Wiktionary,Wikiquote', ',');full names for renaming the files later
    For $alice = 1 to 2;platform loop
    Dim $plat = StringSplit('(P)PC,PALM', ',');platform to run for
    Dim $platFull = StringSplit('Pocket_PC, PALM', ',');full names for renaming the files later 
        For $bob = 1 to 6;languages loop             
        Dim $lang = StringSplit('en,de,fr,pl,nl,eo', ',');languages
        Dim $langFull = StringSplit('English,German,French,Polish,Dutch,Esperanto', ',');full names for renaming the files later        
            For $mike = 1 to 3;file stages
            Dim $types['cur_table.sql.bz2']['cur_table.sql']['\WP_' & $lang[$bob] & '_' & $lang[$alice] & '_TXT.txt'];the different naming conventions to work with
            Dim $op[' & wget http://download.wikimedia.org/' & $url[$cliff] & $types[1]][' & bunzip2 -dkv' & $types[1]][' & wikitotome' & $lang[$bob] & ' ' & $plat[$alice] & ' NOIMG'];dos operations
            Dim $opFull = StringSplit('wget,bzip,wikitotome', ',');used for status reporting
            Dim $jack = FileGetTime($dir[$cliff] & $lang[$bob] & $types[$mike])         
                If NOT IsArray($jack) OR $jack[2] >=8 Then FileMove($dir[$cliff] & $lang[$bob] & $types[$mike],$dir[$cliff] & $types[$mike], 1)
                status()
                RunWait(@ComSpec & ' /c ' & 'cd /d ' & $dir[$cliff] & $lang[$bob] & $op[$mike])
                status()
                FileSetTime($dir[$cliff] & $lang[$bob] & $types[$mike], @YEAR & @MON & @WDAY & @HOUR & @MIN & @SEC, 1)
            Next
            tr3();final processing stage, and the sole reason i have to do this on windows with autoit =) 
        Next
    Next
Next

Func tr3();final stage
    
$file = FileOpen("C:\status.txt", 1);error reporting for tr3 phase
FileWrite($file, 'tr3 start' & ',' & $wikiFull[$cliff] & ',' & $langFull[$bob] &  ',' & @error & ',' & @WDAY & @HOUR & @MIN & @SEC)
FileClose($file)

Run( 'C:\Program Files\TomeRaider3\TomeRaider.exe", "C:\Program Files\TomeRaider3\', @SW_MAXIMIZE )
WinWait( "TomeRaider", "Categories" )
If not WinActive( "TomeRaider", "Categories" ) Then WinActivate( "TomeRaider", "Categories" )
WinWaitActive ( "TomeRaider", "Categories" )
Send( "{ALTDOWN}{ALTUP}{DOWN}{DOWN}{DOWN}{ENTER}" ) 

;changing settings, pushing buttons
ControlClick("File Import", "", "Button3" );segmentation fixed value radio dial
ControlClick("File Import", "", "Button9" );skip pictures check box.
ControlClick("File Import", "", "Button5");produce log file
ControlClick("File Import", "", "Button12");automatically sort unsorted entries radio dial
ControlSetText("File Import", "", "Edit1", "2000");2k segmentation blocks
ControlSetText("File Import", "", "Edit3", $dir[$cliff] & $lang[$bob] & '\WP_' & $lang[$bob] & '_' & $plat[$alice] & '_TXT.txt');file location box
ControlClick("File Import", "", "Button10");save
ControlClick("File Import", "", "Button19" );import

;following monitors tomeraider to find out when it's finished
$a = ControlGetText('File Import', '', 'Button19')
    If $a = 'Stop' Then
        Do          
            Sleep(1000)
            $a = ControlGetText('File Import', '', 'Button19')
        Until $a = 'Compile'
        ProcessClose ( "TomeRaider.exe" )   
    EndIf
            
FileSetTime($dir[$cliff] & $lang[$bob] & $types[$mike], @YEAR & @MON & @WDAY & @HOUR & @MIN & @SEC, 1)
FileMove($dir[$cliff] & $lang[$bob] & '\WP_' & $lang[$bob] & '_' & $plat[$alice] & '_TXT.tr3', $dir[$cliff] & 'final\' & $wikiFull[$cliff] & '-' & $platFull[$alice] & '-' & $langFull[$bob] & '.tr3')
$file = FileOpen("C:\status.txt", 1);error reporting for tr3 phase
FileWrite($file, 'tr3 stop' & ',' & $wikiFull[$cliff] & ',' & $langFull[$bob] &  ',' & @error & ',' & @WDAY & @HOUR & @MIN & @SEC)
FileClose($file)
EndFunc

Func status() ;used to calculate phase status, duration, and length in a web application - won't work for tr3()
$file = FileOpen("C:\status.txt", 1)
FileWrite($file, $opFull[$mike] & ',' & $wikiFull[$cliff] & ',' & $langFull[$bob] &  ',' & @error & ',' & @WDAY & @HOUR & @MIN & @SEC)
FileClose($file)
EndFunc
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...