Sign in to follow this  
Followers 0
face

extract words from text files

12 posts in this topic

#1 ·  Posted (edited)

I have an autoit program that extracts text from all text files in a folder and saves the extracted words in a text file word list.

I need to add an ignore characters option like a black list of words or single characters. Also I'm not sure if the program detects word fragments and spacing in Chinese text, it has to detect spacing in Chinese text so it doesn't extract entire phrases

heres the code

#include <File.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>

Local $oDictionary = ObjCreate("Scripting.Dictionary")
Local $mypath = @ScriptDir
Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
If @error Then
    MsgBox($MB_SYSTEMMODAL, "Error", "No files found")
    Exit
Else
    MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files")
EndIf

Local $aWords
For $i = 1 To $aFiles[0]
    $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)      ; change pattern to fit your definition of "word"
    Local $iError = @error
    If $iError = 0 Then
        For $Word In $aWords
            $oDictionary.ADD($Word, $Word)
        Next
    Else
        MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError)
    EndIf
Next

$aWords = $oDictionary.Items
FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF)) 
Edited by face

Share this post


Link to post
Share on other sites



any suggestions?

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

This is just a snippet of your code, but I think you'll see where I'm coming from. Now, I haven't tested this, and I'm sure you'll notice that I didn't close any of the three loops I started.

 

Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1)
Local $aWords

For $i = 1 To $aFiles[0]
    For $j = 1 To $bFiles[0]
        If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)
edit: Actually, the If loop doesn't need to be closed as I wrote it above, as there is only one line to execute, but change it as you need to.. Edited by somdcomputerguy

- Bruce /*somdcomputerguy */  If you change the way you look at things, the things you look at change.

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Can you please post a significant sample of text including Chinese text?

Remember that AutoIt implementation of PCRE (the regexp engine) is Unicode-aware but you need to use the (*UCP) option to correctly recognize non ANSI codepoints. Also s is probably not the condition you need.

Start with experimenting using this to grab "words" having length > 1 (I also allowed digits by this may be something you don't want; remove d in that case):

Local $sText = "A computer is a general purpose device that can be programmed to carry out a set of arithmetic or logical operations automatically" & _
    " কম্পিউটাৰক অসমীয়াত পৰিকলন যন্ত্ৰ বুলিও কোৱা হয়৷ ইংৰাজী কম্পিউটাৰ শব্দটো আহিছে লেটিন ভাষাৰ 'কম্পিউটে' শব্দৰ পৰা যাৰ অৰ্থ হৈছে গণনা৷" & _
    " Сучасны камп'ютар складаецца з абсталявання, якое ўяўляе фізічныя часткі камп'ютара (працэсар, клавіятура, манітор і г.д.)" & _
    " კომპიუტერი (ინგლ. computer) ინგლისური ზიტყვა რე დო გჷშმაკოროცხალს შანენს. თენა რე ელექტრონული გჷშმაკოროცხალი მანქანა" & _
    " 電腦或計算機係一台揸得指令(程式)操作資料嗰機器。" & _
    " '太字'コンピュータ(英: computer)は、自動計算機、とくに計算開始後は人手を介さずに計算終了まで動作する電子式汎用計算機。" & _
    " محتویات این مقاله ممکن است غیر قابل اعتماد و نادرست یا جانبدارانه باشد یا قوانین حقوق پدیدآورندگان را نقض کرده باشد. "
Local $res = StringRegExp($sText, "(*UCP)\b[\pL\d]{2,}", 3)
_ArrayDisplay($res)

The pL part means "any Unicode letter (in any language). It is a Unicode Character Property. See PCRE reference document (link in my signature) for more details about p and friends.

I'm not knowledgeable into asian languages and the spacing which has to be considered, so this naïve attempt is certainly far from the real thing.

Also you need to ensure that input text is Unicode and not one of the many multiple-byte encoding charset widely used in far Asia, like Big5 and countless others.

Lastly, I need to remind you that AutoIt currently uses the UCS-2 subset of Unicode, which limits to the plane 0 (co-called BMP). If your input contains codepoints from higher Unicode planes, then converting input to UTF16-LE first might work but I'm unsure of that. You need to try that possibility.

Edited by jchd
1 person likes this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

This is just a snippet of your code, but I think you'll see where I'm coming from. Now, I haven't tested this, and I'm sure you'll notice that I didn't close any of the three loops I started.

 

Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1)
Local $aWords

For $i = 1 To $aFiles[0]
    For $j = 1 To $bFiles[0]
        If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)
edit: Actually, the If loop doesn't need to be closed as I wrote it above, as there is only one line to execute, but change it as you need to..

 

 

2m5b6mo.png

i get this error msg

 

code looks like this:

#include <File.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>

Local $oDictionary = ObjCreate("Scripting.Dictionary")
Local $mypath = @ScriptDir
Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1)
Local $aWords

For $i = 1 To $aFiles[0]
    For $j = 1 To $bFiles[0]
        If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)
If @error Then
    MsgBox($MB_SYSTEMMODAL, "Error", "No files found")
    Exit
Else
    MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files")
EndIf

Local $aWords
For $i = 1 To $aFiles[0]
    $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)      ; change pattern to fit your definition of "word"
    Local $iError = @error
    If $iError = 0 Then
        For $Word In $aWords
            $oDictionary.ADD($Word, $Word)
        Next
    Else
        MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError)
    EndIf
Next

$aWords = $oDictionary.Items
FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF))

Share this post


Link to post
Share on other sites

Click Ctrl + T in SciTE and it will tell you how many Next you are missing, the code you posted is missing 2

Share this post


Link to post
Share on other sites

... I'm sure you'll notice that I didn't close any of the three loops I started.

 

There are 2 'Next' missing in your code

Share this post


Link to post
Share on other sites

There are 2 'Next' missing in your code

This is just a snippet of your code, but I think you'll see where I'm coming from. Now, I haven't tested this, and I'm sure you'll notice that I didn't close any of the loops I started.


- Bruce /*somdcomputerguy */  If you change the way you look at things, the things you look at change.

Share this post


Link to post
Share on other sites

somdcomputerguy,

Obviously I meant 'There are 2 'Next' missing in face's code'  :)

Share this post


Link to post
Share on other sites

#10 ·  Posted

Ah. A misunderstanding then.. :)


- Bruce /*somdcomputerguy */  If you change the way you look at things, the things you look at change.

Share this post


Link to post
Share on other sites

#11 ·  Posted

now it works perfect but it doesn't search in all sub folders

how can i make it find all text files from all sub folders

#include <File.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>

Local $oDictionary = ObjCreate("Scripting.Dictionary")
Local $mypath = @ScriptDir
Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
Local $aWords

If @error Then
    MsgBox($MB_SYSTEMMODAL, "Error", "No files found")
    Exit
Else
    MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files")
EndIf

Local $aWords
For $i = 1 To $aFiles[0]
    $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)      ; change pattern to fit your definition of "word"
    Local $iError = @error
    If $iError = 0 Then
        For $Word In $aWords
            $oDictionary.ADD($Word, $Word)
        Next
    Else
        MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError)
    EndIf
Next

$aWords = $oDictionary.Items
FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF))

Share this post


Link to post
Share on other sites

#12 ·  Posted

_FileListToArrayRec


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0