Alterego Posted January 9, 2005 Posted January 9, 2005 (edited) A friend and I are launching a site soon that is going to convert the Wikipedia database into TomeRaider format in many languages for many platforms automagically. The problem is that for the bigger Wikipedia's, English for example, the database is 1.7GB, not counting a 10GB tarball of images. That whole process takes 3 days on a decent computer. So yeah, i'm fixing that problem Here's what i've got so far. Nothing magical mind you, but i'm really just starting with this stuff. I hope to make this a complete package that will download and install TomeRaider if you don't have it, and grab all of these utilities for you. I'm having some troubles with talking AutoIt into downloading stuffs. expandcollapse popup;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; ;; ;; ;; ;; Automate the processing of Wikipedia ;; ;; TomeRaider 3 files in multiple languages. ;; ;; Huzzah! v.1 by Brian @ http://www.br1an.net ;; ;; ;; ;; ;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;Wikipedia in the palm of your hand! If you run this script as is, ;;it will take DAYS to finsih! Don't worry, some friends ;;and I are starting a service soon that will alleviate this :p If you ;;speak esperanto you're lucky - the entire process for just that language ;;takes about 30 minutes ;The program will process an entire file from start to finish - ;one at a time for now. I'll do more than one thing at once in the future. ;No big deal since we have an entire week inbetween database dumps. If ;you just want to run your language on your platform, skip down to the ;very end of this script and comment out everything but you. ;This program assumes you have Bunzip2, wget, and WikiToTome.pl, and ;that the directories they are stored in are PATH environment variables; ;To edit PATH press window + pause break > advanced > ;environment variables > path > edit > put a semicolon at the very end > ;add the directories these tools can be found in. ;Bunzip2 ;;;http://members.ams.chello.nl/epzachte/Wikipedia/bunzip2.exe ;Wget ;;;http://www.interlog.com/~tcharron/wgetwin-1_5_3_1-binary.zip ;WikiToTome ;;;http://eza.gemm.nl/Wikipedia/TomeRaider/WikiToTomeExe.zip ;you will need a perl converter. I use ActivePerl ;;;http://downloads.activestate.com/ActivePerl/Windows/5.8/ActivePerl-5.8.6.811-MSWin32-x86-122208.msi ;let's define our variables Dim $lang = 0;what language? we got six to choose from ;EN (English), DE (German), FR (French), PL (Polish), NL (Dutch), EO (Esperanto) Dim $plat = 0;are we on PALM or (P)PC? chose (P)PC to run on TR3 for Windows dim $bz = "bunzip2 -dkv cur_table.sql.bz2";bunzip2 DOS command dim $db = "C:\Wikipedia\";directory where our databases are stored dim $dl = "http://download.wikimedia.org/archives/" dim $sql = "cur_table.sql.bz2" dim $img = "NOIMG";no images for now. we'll get them soon. dim $litmus = 2 ;Many thanks to MHz for helping with this! It finds out when TomeRaider is done processing. Func CheckControl() If not WinActive( "File Import", "" ) Then WinActivate( "File Import", "" ) WinWaitActive( "File Import", "" ) $a = ControlGetText('File Import', '', 'Button19') If $a = 'Stop' Then Do Sleep(1000) $a = ControlGetText('File Import', '', 'Button19') Until $a = 'Compile' ProcessClose ( "TomeRaider.exe" ) EndIf EndFunc ;This stops us from downloading the file if we already did today. ;Wikipedia updates the database weekly. See http://download.mediawiki.org Func download() $timeDl = FileGetTime($db & $lang & "\" & $sql , 1) If NOT IsArray($timeDl) OR $timeDl[2] <> @MDAY Then RunWait(@ComSpec & " /c " & "cd /d " & $db & $lang & " && wget " & $dl & $lang & "/" & $sql & " && " & $bz & " && ren *.sql cur_table_" & $lang & ".sql", "") EndIf EndFunc ;main function. Func main() download() ;perl conversion script written by Erik Zachte @ http://members.chello.nl/epzachte/Wikipedia ;converts the raw Wikipedia SQL dump into TomeRaider 3 format. RunWait(@ComSpec & " /c " & "cd /d " & $db & $lang & " && wikitotome " & $lang & " " & $plat & " " & $img, "") ;start TomeRaider and tell it to get ready to import Run( "C:\Program Files\TomeRaider3\TomeRaider.exe", "C:\Program Files\TomeRaider3\", @SW_MAXIMIZE ) WinWait( "TomeRaider", "Categories" ) If not WinActive( "TomeRaider", "Categories" ) Then WinActivate( "TomeRaider", "Categories" ) WinWaitActive ( "TomeRaider", "Categories" ) Send( "{ALTDOWN}{ALTUP}{DOWN}{DOWN}{DOWN}{ENTER}" ) ;changing settings, pushing buttons ControlClick("File Import", "", "Button3" ) ControlClick("File Import", "", "Button9" ) ControlClick("File Import", "", "Button5") ControlClick("File Import", "", "Button12") ControlSetText("File Import", "", "Edit1", "896");works, but TR3 doesn't seem to accept it. hrm. ControlSetText("File Import", "", "Edit3", $db & $lang & "\WP_" & $lang & "_" & $plat & "_TXT.txt") ControlClick("File Import", "", "Button19" );import! CheckControl() ;I FTP my files off site when i'm finished. RunWait(@ComSpec & " /c " & "ftp -s:C:\Wikipedia\ftp\" & "ftp_" & $lang & "_" & $plat & ".txt", "") EndFunc ;;Well, we haven't ACTUALLY done anything yet. Here we go! ;;Quick run on the Esperanto DB for PPC. It's the smallest at 9MB.bz2 $lang = "eo" $plat = "(P)PC" main() ;;Esperanto PALM $plat = "PALM" main() ;;;;;;;;;;;;;;;;;;;;; ;;Dutch PPC $lang = "nl" $plat = "(P)PC" main() ;;Dutch PALM $plat = "PALM" main() ;;;;;;;;;;;;;;;;;;;;; ;;Polish PPC $lang = "pl" $plat = "(P)PC" main() ;;Polish PALM $plat = "PALM" main() ;;;;;;;;;;;;;;;;;;;;; ;French PPC $lang = "fr" $plat = "(P)PC" main() ;;French PALM $plat = "PALM" main() ;;;;;;;;;;;;;;;;;;;; ;English PPC $lang = "en" $plat = "(P)PC" main() ;;English PALM $plat = "PALM" main() ;;;;;;;;;;;;;;;;;;;; ;English PPC $lang = "de" $plat = "(P)PC" main() ;;English PALM $plat = "PALM" main() Edited January 10, 2005 by Alterego This dynamic web page is powered by AutoIt 3.
Alterego Posted January 9, 2005 Author Posted January 9, 2005 (edited) Updated the download function to something that actually works ( didn't get to test it yesterday since the servers were kaput ) Func download() $timeDl = FileGetTime( $db & "\" & $lang & "\" & $sql , 1 ) If NOT IsArray($timeDl) OR $timeDl[2] <> @MDAY Then RunWait(@ComSpec & " /k " & "cd /d " & $db & "\" & $lang & " && wget " & $dl & "/" & $lang & "/" & $sql, "") EndIf EndFunc Edited January 9, 2005 by Alterego This dynamic web page is powered by AutoIt 3.
SlimShady Posted January 9, 2005 Posted January 9, 2005 Updated your download function: Func download() $timeDl = FileGetTime( $db & "\" & $lang & "\" & $sql , 1 ) If NOT IsArray($timeDl) OR $timeDl[2] <> @MDAY Then URLDownloadToFile($dl & "/" & $lang & "/" & $sql, $db & "\" & $lang) EndIf EndFunc
Alterego Posted January 25, 2005 Author Posted January 25, 2005 I completely rewrote this program. It's a loop with 103 cycles and completely dynamic. Here is an .rtf version that's easier to read. I can't test it right now unfortunately (wikipedia is doing a major database conversion and haven't updated the sql dump in 20 days) I am really unsure if i am using my stringsplits and arrays correctly. could someone check those out for me? Specifically, is it ok for me to shove all those variables inside arrays? Also, any suggestions for making this further dynamic would be greatly appreciated. This thing is just going to have more and more asked of it in the future.expandcollapse popupFor $cliff = 1 to 3;wikipedia, wiktionary, wikiquote, rotation Dim $dir = StringSplit('C:\Wikipedia\,C:\Wiktionary\,C:\Wikiquote\',',');work directory Dim $url = StringSplit('archives/,archives_wiktionary/,archives_wikiquote/', ',');url location Dim $wikiFull = StringSplit('Wikipedia,Wiktionary,Wikiquote', ',');full names for renaming the files later For $alice = 1 to 2;platform loop Dim $plat = StringSplit('(P)PC,PALM', ',');platform to run for Dim $platFull = StringSplit('Pocket_PC, PALM', ',');full names for renaming the files later For $bob = 1 to 6;languages loop Dim $lang = StringSplit('en,de,fr,pl,nl,eo', ',');languages Dim $langFull = StringSplit('English,German,French,Polish,Dutch,Esperanto', ',');full names for renaming the files later For $mike = 1 to 3;file stages Dim $types['cur_table.sql.bz2']['cur_table.sql']['\WP_' & $lang[$bob] & '_' & $lang[$alice] & '_TXT.txt'];the different naming conventions to work with Dim $op[' & wget http://download.wikimedia.org/' & $url[$cliff] & $types[1]][' & bunzip2 -dkv' & $types[1]][' & wikitotome' & $lang[$bob] & ' ' & $plat[$alice] & ' NOIMG'];dos operations Dim $opFull = StringSplit('wget,bzip,wikitotome', ',');used for status reporting Dim $jack = FileGetTime($dir[$cliff] & $lang[$bob] & $types[$mike]) If NOT IsArray($jack) OR $jack[2] >=8 Then FileMove($dir[$cliff] & $lang[$bob] & $types[$mike],$dir[$cliff] & $types[$mike], 1) status() RunWait(@ComSpec & ' /c ' & 'cd /d ' & $dir[$cliff] & $lang[$bob] & $op[$mike]) status() FileSetTime($dir[$cliff] & $lang[$bob] & $types[$mike], @YEAR & @MON & @WDAY & @HOUR & @MIN & @SEC, 1) Next tr3();final processing stage, and the sole reason i have to do this on windows with autoit =) Next Next Next Func tr3();final stage $file = FileOpen("C:\status.txt", 1);error reporting for tr3 phase FileWrite($file, 'tr3 start' & ',' & $wikiFull[$cliff] & ',' & $langFull[$bob] & ',' & @error & ',' & @WDAY & @HOUR & @MIN & @SEC) FileClose($file) Run( 'C:\Program Files\TomeRaider3\TomeRaider.exe", "C:\Program Files\TomeRaider3\', @SW_MAXIMIZE ) WinWait( "TomeRaider", "Categories" ) If not WinActive( "TomeRaider", "Categories" ) Then WinActivate( "TomeRaider", "Categories" ) WinWaitActive ( "TomeRaider", "Categories" ) Send( "{ALTDOWN}{ALTUP}{DOWN}{DOWN}{DOWN}{ENTER}" ) ;changing settings, pushing buttons ControlClick("File Import", "", "Button3" );segmentation fixed value radio dial ControlClick("File Import", "", "Button9" );skip pictures check box. ControlClick("File Import", "", "Button5");produce log file ControlClick("File Import", "", "Button12");automatically sort unsorted entries radio dial ControlSetText("File Import", "", "Edit1", "2000");2k segmentation blocks ControlSetText("File Import", "", "Edit3", $dir[$cliff] & $lang[$bob] & '\WP_' & $lang[$bob] & '_' & $plat[$alice] & '_TXT.txt');file location box ControlClick("File Import", "", "Button10");save ControlClick("File Import", "", "Button19" );import ;following monitors tomeraider to find out when it's finished $a = ControlGetText('File Import', '', 'Button19') If $a = 'Stop' Then Do Sleep(1000) $a = ControlGetText('File Import', '', 'Button19') Until $a = 'Compile' ProcessClose ( "TomeRaider.exe" ) EndIf FileSetTime($dir[$cliff] & $lang[$bob] & $types[$mike], @YEAR & @MON & @WDAY & @HOUR & @MIN & @SEC, 1) FileMove($dir[$cliff] & $lang[$bob] & '\WP_' & $lang[$bob] & '_' & $plat[$alice] & '_TXT.tr3', $dir[$cliff] & 'final\' & $wikiFull[$cliff] & '-' & $platFull[$alice] & '-' & $langFull[$bob] & '.tr3') $file = FileOpen("C:\status.txt", 1);error reporting for tr3 phase FileWrite($file, 'tr3 stop' & ',' & $wikiFull[$cliff] & ',' & $langFull[$bob] & ',' & @error & ',' & @WDAY & @HOUR & @MIN & @SEC) FileClose($file) EndFunc Func status() ;used to calculate phase status, duration, and length in a web application - won't work for tr3() $file = FileOpen("C:\status.txt", 1) FileWrite($file, $opFull[$mike] & ',' & $wikiFull[$cliff] & ',' & $langFull[$bob] & ',' & @error & ',' & @WDAY & @HOUR & @MIN & @SEC) FileClose($file) EndFunc This dynamic web page is powered by AutoIt 3.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now