Jump to content
Sign in to follow this  
alanstone

using AI3 to drive the RegExp search/replace feature of TextPad

Recommended Posts

alanstone

I managed to code the MS Word *.doc -> *.txt (plain text)

conversion part of my application. Now I need to:

- replace, amongst others, ascii character 156 with oe

(re. http://www.killersites.com/webDesignersHandbook/ascii_page3.htm )

- strip all leading blanks and tabs

- delete all empty lines

and thought of using TextPad's RegExp search/replace for that.

How do you do that ?

Thanks in advance,

Alan

WXP Home SP3

AI v3.3.4.0

Share this post


Link to post
Share on other sites
jchd

I managed to code the MS Word *.doc -> *.txt (plain text)

conversion part of my application. Now I need to:

- replace, amongst others, ascii character 156 with oe

(re. http://www.killersites.com/webDesignersHandbook/ascii_page3.htm )

- strip all leading blanks and tabs

- delete all empty lines

and thought of using TextPad's RegExp search/replace for that.

Welcome to the AutoIt forum.

Why do you want to use regexp from another executable while AutoIt has its own v8.0 PCRE routines? StringTrimWS, StringReplace and StringRegexpReplace can do a lot for you, without leaving your AutoIt home.

AutoIt help file is your best friend.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Malkey

This example is only using the StringRegExpReplace function. But as jchd suggested, other appropriate AutoIt string functions could be used.

Except for StringTrimWS, it should be StringStripWS.

Local $sFileName = "ExDoc.txt"
Local $sStr = FileRead($sFileName)

Local $sStrModified = StringRegExpReplace($sStr, "\x9c", "oe")                      ; replace ascii character 156 with oe.
$sStrModified = StringRegExpReplace($sStrModified, "(?s)^(\x09+)|(\h+)|(\v+\z)", ""); strip all leading tabs and blanks and last @CRLF.
$sStrModified = StringRegExpReplace($sStrModified, "\v+", @CRLF)                    ; delete all empty lines.

ConsoleWrite($sStrModified & @CRLF)

#cs ; Comment this line out to save file.
$file = FileOpen($sFileName, 2)
Local $sStr = FileWrite($file, $sStrModified)
FileClose($file)
#ce ; And comment this line out to save file.

Share this post


Link to post
Share on other sites
jchd

Except for StringTrimWS, it should be StringStripWS.

[autoit]

Ooops, but after all it was the result of all those useful String* function getting out of the helpfile by themselves all at once, jumping on the keyboard and shouting together "Tell him I can do the job, or at least a good part of it!".

Thanks for your auto-completion correction!


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
alanstone

thanks for your kind reply.

$sStrModified = StringRegExpReplace($sStr, "\x9c", "oe") ; replace ascii character 156 with oe.

unfortunately, this doesn't work

$sStrModified = StringRegExpReplace($sStrModified, "(?s)^(\x09+)|(\h+)|(\v+\z)", ""); strip all leading tabs and blanks and last @CRLF.

this deletes ALL blanks, even spaces between words

$sStrModified = StringRegExpReplace($sStrModified, "\v+", @CRLF); delete all empty lines

this does nothing at all

test file in attachment

regex_test.txt

Share this post


Link to post
Share on other sites
Malkey

Maybe this will work.

Local $sFileName = "regex_test.txt"
Local $sStr = FileRead($sFileName)

Local $sStrModified = StringReplace($sStr, Chr(156), "oe") ; replace ascii character 156 with oe.
$sStrModified = StringRegExpReplace($sStrModified, "(?m)(^\x09+)|(^\h+)|(\v+\z)", ""); strip all leading tabs and blanks and last @CRLF.
$sStrModified = StringRegExpReplace($sStrModified, "\v+", @CRLF) ; delete all empty lines.
MsgBox(0, "", $sStrModified)

#cs ; Comment this line out to save file.
    $file = FileOpen($sFileName, 2)
    Local $sStr = FileWrite($file, $sStrModified)
    FileClose($file)
#ce ; And comment this line out to save file.

Share this post


Link to post
Share on other sites
alanstone

That's much better indeed.

How do you replace multiple blank lines with one blank line ?

Share this post


Link to post
Share on other sites
jchd

How do you replace multiple blank lines with one blank line ?

Try this. I saw that Malkey (Hi!) was answering while I was preparing this one.

;; Warning, this is an UTF-8 + BOM file!

;; make it easier to test
Local $sStr = " Cependant il y a toujours un grand fossé entre les sons enregistrés par les instruments pointus et la limite où l'esprit de l'homme, raisonnant par analogie, place la frontière entre les niveaux de sons et d'autres formes de vibrations. " & @CRLF & _
"   " & @CRLF & _
"   Il y a des niveaux de lumières que l'œil humain ne voit pas, dont certains peuvent être détectés par des instruments plus pointus, mais il y a beaucoup d'autres niveaux si fins qu'aucun instrument n'a encore été inventé pour les détecter, bien que des progrès soient faits chaque année et que le champ de l'inconnu se réduise petit à petit. " & @CRLF & _
"       " & @CRLF & _
"   Alors que de nouveaux instruments sont inventés, de nouvelles vibrations sont enregistrées par ces instruments - et pourtant ces vibrations étaient autant réelles avant l'invention de l'instrument qu'après. " & @CRLF & _
"" & @CRLF & _
" " & ChrW(0x0153) & ChrW(0x0152) & ChrW(0x00e6) & ChrW(0x00c6) & ChrW(0xfb00) & ChrW(0xfb01) & ChrW(0xfb02) & ChrW(0xfb03) & ChrW(0xfb04) & @CRLF & _
"" & @CRLF & _
"" & @CRLF & _
""

;~ Local $sFileName = "ExDoc.txt"
;~ Local $sStr = FileRead($sFileName)       ; using the current version, FileRead will have auto-detected that the file was ANSI
;~                                          ; and convert its contents into the character set used by AutoIt, which is Unicode.

; now deal with ligatures, at least those in usage in Sarkozie
$sStr = StringReplace($sStr, ChrW(0x0153), "oe", 0, 1)          ; replace _Unicode_ character 0x0153 with oe.
$sStr = StringReplace($sStr, ChrW(0x0152), "OE", 0, 1)          ; don't forget uppercase version! 0x0152 --> 'OE'
$sStr = StringReplace($sStr, ChrW(0x00E6), "ae", 0, 1)          ; do the same for 0xE6 --> 'ae'
$sStr = StringReplace($sStr, ChrW(0x00C6), "AE", 0, 1)          ; don't forget uppercase version! 0xC6 --> 'AE'
$sStr = StringReplace($sStr, ChrW(0xFB00), "ff")            ; do the same for 0xFB00 --> 'ff'
$sStr = StringReplace($sStr, ChrW(0xFB01), "fi")            ; do the same for 0xFB01 --> 'fi'
$sStr = StringReplace($sStr, ChrW(0xFB02), "fl")            ; do the same for 0xFB02 --> 'fl'
$sStr = StringReplace($sStr, ChrW(0xFB03), "ffi")           ; do the same for 0xFB03 --> 'ffi'
$sStr = StringReplace($sStr, ChrW(0xFB04), "ffl")           ; do the same for 0xFB04 --> 'ffl'

$sStr = StringRegExpReplace($sStr, "(?m)(\A\h+)|(^\h+)|(\h+$)|(\h+\z)|(\s+\z)", "") ; strip leading and trailing whitespaces
$sStr = StringRegExpReplace($sStr, " ", " ")                                ; strip multiple whitespaces
$sStr = StringReplace($sStr, @CRLF & @CRLF, @CRLF)                          ; delete all empty lines.
$sStr = StringReplace($sStr, @CR & @CR, @CR)                                ; whatever the control char used
$sStr = StringReplace($sStr, @LF & @LF, @LF)                                ; even those ones

ConsoleWrite($sStr)

Is there a good reason you get the data to file and not directly from Word?


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
alanstone

Is there a good reason you get the data to file and not directly from Word?

Thanks for the code, I'll try it out - as well as to understand it :D

The good reason: storage and archiving.

Edited by alanstone

Share this post


Link to post
Share on other sites
alanstone

;; Warning, this is an UTF-8 + BOM file!
Excuse me my ignorance, do you mean it's better to convert *.doc -> *.txt UTF-8, instead of *.txt ANSI ? Edited by alanstone

Share this post


Link to post
Share on other sites
jchd

Excuse me my ignorance, do you mean it's better to convert *.doc -> *.txt UTF-8, instead of *.txt ANSI ?

Certainement mon cher !

Oops, this is an english forum!

Yes of course it would be much simpler to open the word doc and grab all or part of its contents into a variable. Both applications are Unicode so there is not conversion issue at this step.

You may need to switch the thing to ANSI for storage into a simple text file, but then it's possible to specify which encoding the text file should use (see options of FileOpen). If your text uses only a subset of Unicode that your user locale knows how to convert into the user ANSI codepage in force, then things should be correct. Double check this for ligatures, I'm not 100% certain they get converted correctly. If not, you've been shown how to deal with that.

We could possibly help you better if you explain a bit more your actual needs/constraints. That would allow focussing on the simplest way to achieve your aims in the most efficient way.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
GEOSoft

How do you replace multiple blank lines with one blank line ?

$sStr = StringRegExpReplace($sStr, "\v{3,}" @CRLF)

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites
Malkey

How do you replace multiple blank lines with one blank line ?

What are you calling a blank line?

@CRLF is the commonly used in Windows to end a line and start another line. It is made up of two vertical white spaces, the carriage return (@CR, chr(13), chr(0x0D), \r) and the linefeed (@LF, chr(10), chr(0x0A), \n).

These produce a single blank line, "@LF & @LF", or "@CRLF & @LF", or "@CRLF & @CRLF", which total two, three or four vertical white spaces, "\v", respectively.

Referring to this line in the script:-

$sStrModified = StringRegExpReplace($sStrModified, "\v+", @CRLF) ; delete all empty lines.

The regular expression pattern is "\v+".

From the help file under the function reference to StringRegExp:-

"\v" is used to signify any vertical white-space character, eg. CR or LF (similar to a horizontal white-space that is made by the space bar, only vertically).

"+" Repeat the previous character, set or group one or more times. Equivalent to {1,}

So this StringRegExpReplace function matches @CR, or @LF, or @CRLF, or "@CRLF & @CRLF & @CRLF & @CRLF", and replaces any of those matches with @CRLF.

"@LF & @LF", is rarely used in Windows to produce a blank line. As such, GEOSoft's example should work 99.9% of the time.

;"@LF & @LF", or "@CRLF & @LF", or "@CRLF & @CRLF" produce a single blank line. 
MsgBox(0, "Blank lines", "a" & @LF & @LF & "b" & @CRLF & _
                        "c" & @CRLF & @LF & "d" & @CRLF & _
                        "e" & @CRLF & @CRLF & "f")

Share this post


Link to post
Share on other sites
GEOSoft

What are you calling a blank line?

@CRLF is the commonly used in Windows to end a line and start another line. It is made up of two vertical white spaces, the carriage return (@CR, chr(13), chr(0x0D), \r) and the linefeed (@LF, chr(10), chr(0x0A), \n).

These produce a single blank line, "@LF & @LF", or "@CRLF & @LF", or "@CRLF & @CRLF", which total two, three or four vertical white spaces, "\v", respectively.

Referring to this line in the script:-

$sStrModified = StringRegExpReplace($sStrModified, "\v+", @CRLF) ; delete all empty lines.

The regular expression pattern is "\v+".

From the help file under the function reference to StringRegExp:-

"\v" is used to signify any vertical white-space character, eg. CR or LF (similar to a horizontal white-space that is made by the space bar, only vertically).

"+" Repeat the previous character, set or group one or more times. Equivalent to {1,}

So this StringRegExpReplace function matches @CR, or @LF, or @CRLF, or "@CRLF & @CRLF & @CRLF & @CRLF", and replaces any of those matches with @CRLF.

"@LF & @LF", is rarely used in Windows to produce a blank line. As such, GEOSoft's example should work 99.9% of the time.

;"@LF & @LF", or "@CRLF & @LF", or "@CRLF & @CRLF" produce a single blank line. 
MsgBox(0, "Blank lines", "a" & @LF & @LF & "b" & @CRLF & _
                        "c" & @CRLF & @LF & "d" & @CRLF & _
                        "e" & @CRLF & @CRLF & "f")

That will remove @CRLF at the end of everyline too and he wants to replace multiple empty lines with a single empty line.

I showed it as \v{3,} to avoid that and it was from memory but it runs in my mind that I actually had to adjust that to 4 or 6 to get what I wanted


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.