Jump to content

Fetch english words only


Recommended Posts

Greetings,

I have a string featuring english and non-english (spanish, german etc.) words

Is there any chance to fetch only english words and delete all the rest?

Please, help. I am stuck.

Thanks,

Link to comment
Share on other sites

As soon as you come up with an unambiguous (I mean algorithmically unambiguous) definition of an english word vs. a non-english word, chime again and we should come up with a workable script.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

French/German/Italian/Spanish... words all contain "English" letters, that's not a very valid criterium. By the way, they're considered Latin letters, not "English" letters.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

Je produis un parfait exemple de phrase utilisant des mots non anglais et en employant uniquement des lettres latines sans diacritiques.

Does this count for an valid sequence of englishwords?

Hint: Google translate that from French into English!

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

If your input is bilingual on a base-2 "basis", then the problem is entirely different.

Use step 2 with the For loop processing your input.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Make a text file including every english word in a dictionary.

http://wordlist.sourceforge.net/

http://www.mieliestronk.com/wordlist.html

Load that text file into however many arrays you'll need to fit them.

Then compare your word list with the dictionary arrays, and if you don't have any results, dump the word.

Efficient? no.

Will it work? I have no clue.

Edited by PowerCat
Link to comment
Share on other sites

PowerCat,

You seem to believe that the set of all possible words in a given language doesn't intersect with any other.

Not only that doesn't hold water but that also ignores that in some countries more than one language is widely used (easy examples: Canada, Belgium, ...).

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Même si un mot en anglais est de souche francaise, il devrait quand même se retrouver dans un dictionnaire anglais.

Je suis pas trop certain de comprendre ce que tu veux dire.

Si un mot se retrouve pas dans un dictionnaire anglais, n'est il pas un mot dans une autre langue?

PowerCat,

You seem to believe that the set of all possible words in a given language doesn't intersect with any other.

Not only that doesn't hold water but that also ignores that in some countries more than one language is widely used (easy examples: Canada, Belgium, ...).

Edited by PowerCat
Link to comment
Share on other sites

But when a word is found in both french and english dictionnaries, then what is it actually?

Anyway, the OP found that the sentences were alternating, so the point is moot.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

#Include<array.au3> ;; For _ArrayDisplay purposes only
$sFile = @ScriptDir & "\wordlist.txt" ;;  file attached
$sWordlist = FileRead($sFile)
If @Error Then
    MsgBox(0, "Error", "Unable to read the word list.")
    Exit
EndIf
$sStr = "This is some string enthält sowohl englischen und deutschen wörtern et quelques mots français followed by a spellink mistake."
$aStr = StringRegExp($sStr, "\S+", 3);; Change this to StringRegExp($sStr, "\S{2,", 3) to ignore single letter words like "a"
If NOT @Error Then
    _ArrayDisplay($aStr, "Full String")
    $sValid = ""
    For $i = 0 To UBound($aStr) - 1
        $aStr[$i] = StringRegExpReplace($aStr[$i], "[.!?,]", "");; Just in case we pick up punctuation
        If StringRegExp($sWordlist, "(?i)(?m:^)" & $aStr[$i] & "(?:\s|$)+", 0) Then $sValid &= $aStr[$i] & "|"
    Next
    $aStr = StringSplit(StringTrimRight($sValid, 1), "|", 2)
    _ArrayDisplay($aStr, "English Words")
EndIf

Someone will complain about me not allowing for punctuation in the initial array but it was done for a reason. One of those Bindar Dundat© things.

Edit:

You could also read the wordlist file into an SQLite database and then query that for the result.

EDIT 2

The wordlist can be found here

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Hey guys, can't we use the _StringExplode command to break the string into separate words, and find each word in the english words collection...if it is there, then save it separately into a temporary file and then show all the data collected in the temporary file as the output ? :unsure:

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.
Link to comment
Share on other sites

Why, _StringExplode just returns an array of the contents of whatever string you send to it. That's easily done anyway. The trick is to get all the words into an array which can be acconplished with

$aStr = StringRegExp($sStr, "\S+", 3)

Then you compare each of those to the text file that contains all the English words and put then into a string.

If StringRegExp($sWordlist, "(?i)(?m:^)" & $aStr[$i] & "(?:\s|$)+", 0) Then $sValid &= $aStr[$i] & "|"

After that you use StringSplit to change that string into an array if you want to.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Hey this is what I have made ! :> Check this out....I think this is what you were looking for ! Do reply about its usefulness and guys check for bugs and tell me please ! :unsure:

English Words Filter.rar

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.
Link to comment
Share on other sites

You still made it far more complicated that was required. Why would I want 26 files when I can do it in 1? _String Explode() is getting a little outdated there are several ways to skin the cat since that was written besides why include that whole file for the sake of 1 function?

Speaking of outdated; the 1990s called and they want their file archiver back.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Speaking of outdated; the 1990s called and they want their file archiver back.

Agree. Compression of GeoSofts file wordlist.txt:

Size on Disk Original (958,464 bytes):

.rar = 266,240 bytes

.zip = 245,760 bytes

.7z = 200,704 bytes

Even though 7-Zip compresses the most why use a third party compresstion utlity when Windows can compress .zip files already. Reason I have them is for people who use them and need to decompress. While 7-zip is my choice to handle all.

Only thing I have found Winrar useful for is automattically running a program after self-extracting, kinda like a Installer.

Edited by rogue5099
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...