Jump to content

HTML Text Extraction and Replacement


Recommended Posts

I am relatively new to AutoIt and I want to know if anyone knows how to extract text from HTML files (lots of them), analyse the string and then replace it with an appropriate alternative?

Essentially I am trying to typeset translated text into English HTML files. The translations are supplied in MS Excel spreadsheets, with the translated text in a cell immediately to the right of the English text.

I want to do the following:

- Read through the HTMLs and extract the text from between the tags.

- Determine what the appropriate translation is by comparing the extracted text with the English source contained in the Excel sheet.

- Replace the original HTML text with the translation whilst retaining all the formatting.

If the script could automatically open, save and close files too that'd be even better.

Can anyone help?

Andy

Link to comment
Share on other sites

Welcome to the forums!

This will illustrate one approach that may work for you for the HTML file reading and modification portion of that, play around with it some and see if you can get it to work for you.

This example only works for one phrase and it's conversion at a time, you'll want to look at the _Excel* functions in the Help File to read your Excel doc, or convert it to a .csv file and read it in using _FileReadToArray to explore looping through all of the phrases you want to convert.

#include <Array.au3>
#include <File.au3>

$varExampleStringToLookFor = "¿Cómo te llamas?"
$varExampleStringToConvertTo = "What is your name?"

$aFiles = _FileListToArray (@ScriptDir, "*.html")
If Not @error Then
    _ArrayDisplay ($aFiles) ; Just so you can see that we found files

    For $x = 1 To $aFiles[0]
        $var = FileRead (@ScriptDir & "\" & $aFiles[$x])
        $var = StringReplace ($var, $varExampleStringToLookFor, $varExampleStringToConvertTo)
        FileWrite(@ScriptDir & "\Modified-" & $aFiles[$x], $var)
    Next
Else
    MsgBox (0, "No HTML Files!", "There aren't any HTML files in this folder!")
EndIf
Edited by exodius
Link to comment
Share on other sites

Thanks for that! I've got it working to some degree - just one thing is causing trouble though, I'm trying to use the '_clipboard*' functions to allow a copy and paste from an ordered text file, but for some reason I can't seem to get the text into the clipboard. When I paste the clipboard contents, it just adds a blank line. Any idea what I might be doing wrong?

Welcome to the forums!

This will illustrate one approach that may work for you for the HTML file reading and modification portion of that, play around with it some and see if you can get it to work for you.

This example only works for one phrase and it's conversion at a time, you'll want to look at the _Excel* functions in the Help File to read your Excel doc, or convert it to a .csv file and read it in using _FileReadToArray to explore looping through all of the phrases you want to convert.

#include <Array.au3>
#include <File.au3>

$varExampleStringToLookFor = "¿Cómo te llamas?"
$varExampleStringToConvertTo = "What is your name?"

$aFiles = _FileListToArray (@ScriptDir, "*.html")
If Not @error Then
    _ArrayDisplay ($aFiles) ; Just so you can see that we found files

    For $x = 1 To $aFiles[0]
        $var = FileRead (@ScriptDir & "\" & $aFiles[$x])
        $var = StringReplace ($var, $varExampleStringToLookFor, $varExampleStringToConvertTo)
        FileWrite(@ScriptDir & "\Modified-" & $aFiles[$x], $var)
    Next
Else
    MsgBox (0, "No HTML Files!", "There aren't any HTML files in this folder!")
EndIf

Link to comment
Share on other sites

Here's my code - any help you can give me to get the clipboard function to work would be much appreciated.

Also as this will enevitably be a very slow process when used for large amounts of text and documents (I'm talking 800 to 1000 lines of text and 50-60 documents), any advice you can give me to speed it up would be great!

Thanks!

#cs================================= Description ======================================================

This script is intended to identify the contents of a directory and list the .doc files to an array.

The .doc files are then opened one at a time (I've not included the save and close bits yet cos I'm still

struggling with the core functionality) and the Find/Replace dialog opened.

Two external text files (English Source & Foreign Translation) with ordered lists of phrases are read to

arrays, which are then used as the source for filling the Find/Replace fields.

This loops until all the source array items have been read and used - then the next .doc file is opened

and the process repeated.

Please be gentle when criticizing my code. I'm only a newby and don't know any better!

=======================================================================================================

#ce

#Include <Clipboard.au3>

#include <Array.au3>

#Include <File.au3>

dim $i, $x, $ENG, $Tran, $Index_New, $Index_Orig, $Source, $Translation, $var, $To, $From, $File

$aFiles = _FileListToArray (@ScriptDir, "*.doc") ; Lists .doc files in script folder to array

$Source = FileOpenDialog("Select English Source", @DesktopDir, "(*.txt)") ; Locate English source text file

$Translation = FileOpenDialog("Select Translation Source", @DesktopDir, "(*.txt)") ; Locate Translated source text file

_FileReadToArray($Source, $ENG) ; Read English source file to array

_FileReadToArray($Translation, $Tran) ; Read translation source file to array

_ArrayDisplay($ENG) ; Display English Array

_ArrayDisplay($Tran) ; Display Translated Array

;~ For $x = 1 To 2

For $x = 1 To $aFiles[0] ; Repeat for each .doc file in folder

ShellExecute($aFiles[$x]) ; open .doc file

WinWait($aFiles & " - Microsoft Word","") ; wait for Word to load

If Not WinActive($aFiles & " - Microsoft Word","") Then WinActivate($aFiles & " - Microsoft Word","") ; check if Word is active

WinWaitActive($aFiles & " - Microsoft Word","") ; Wait until Word is active before continuing

send("^h") ; open Find/Replace dialog

WinWaitActive("Find and Replace","") ; wait for dialog to be active

for $i = 1 to $ENG[0] ; repeat for all elements in source array

If Not WinActive("Find and Replace","") Then WinWaitActive("Find and Replace","") ; check if Find/Replace is active

WinWaitActive("Find and Replace","") ; wait for Find/Replace to be active before continuing

$length1 = StringLen($From) ; Check length of English source string

$Length2 = StringLen($To) ; Check length of Foreign dource string

if $length1 > 254 Then ; If English string >254 characters then ...

filewriteline(@ScriptDir & "\Errors.txt", "Line No - " & $i & " - Too Long - " & $From) ; ...list string in error report file

ElseIf $length1 < 254 Then ; If English string <254 ...

If $Length2 > 254 Then ; ... but foreign string is > 254 then ...

filewriteline(@ScriptDir & "\Errors.txt", "Line No - " & $i & " - Too Long - " & $From) ; ...list string in error report file

Else ; Otherwise ...

_ArrayToClip($ENG[$i]) ; ... copy contents of English Source Array line to clipboard ...

sleep(200)

send("!n") ; ... Select the 'Find' field in the Find / Replace Dialog ...

sleep(100)

;~ send("Foo") ; (Test Expression)

ClipGet() ; ... Paste clipboard contents into field

sleep(200)

_ArrayToClip($Tran[$i]) ; Then copy equivalent data from Translation Source Array ...

Sleep(200)

Send("!i") ; ... Select the 'Replace' field in the Find / Replace Dialog ...

sleep(100)

;~ send("Bar") ; (Test Expression)

ClipGet() ; ... Paste clipboard contents into field

sleep(200)

send("!f") ; Find text

sleep(200)

send("!r") ; Replace text

sleep(1000)

send("{ENTER}") ; Close 'Completed' dialog box

sleep(500)

EndIf

EndIf

Next

Next

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...