Jump to content

extract words from text files


face
 Share

Recommended Posts

I have an autoit program that extracts text from all text files in a folder and saves the extracted words in a text file word list.

I need to add an ignore characters option like a black list of words or single characters. Also I'm not sure if the program detects word fragments and spacing in Chinese text, it has to detect spacing in Chinese text so it doesn't extract entire phrases

heres the code

#include <File.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>

Local $oDictionary = ObjCreate("Scripting.Dictionary")
Local $mypath = @ScriptDir
Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
If @error Then
    MsgBox($MB_SYSTEMMODAL, "Error", "No files found")
    Exit
Else
    MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files")
EndIf

Local $aWords
For $i = 1 To $aFiles[0]
    $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)      ; change pattern to fit your definition of "word"
    Local $iError = @error
    If $iError = 0 Then
        For $Word In $aWords
            $oDictionary.ADD($Word, $Word)
        Next
    Else
        MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError)
    EndIf
Next

$aWords = $oDictionary.Items
FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF)) 
Edited by face
Link to comment
Share on other sites

This is just a snippet of your code, but I think you'll see where I'm coming from. Now, I haven't tested this, and I'm sure you'll notice that I didn't close any of the three loops I started.

 

Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1)
Local $aWords

For $i = 1 To $aFiles[0]
    For $j = 1 To $bFiles[0]
        If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)
edit: Actually, the If loop doesn't need to be closed as I wrote it above, as there is only one line to execute, but change it as you need to.. Edited by somdcomputerguy

- Bruce /*somdcomputerguy */  If you change the way you look at things, the things you look at change.

Link to comment
Share on other sites

Can you please post a significant sample of text including Chinese text?

Remember that AutoIt implementation of PCRE (the regexp engine) is Unicode-aware but you need to use the (*UCP) option to correctly recognize non ANSI codepoints. Also s is probably not the condition you need.

Start with experimenting using this to grab "words" having length > 1 (I also allowed digits by this may be something you don't want; remove d in that case):

Local $sText = "A computer is a general purpose device that can be programmed to carry out a set of arithmetic or logical operations automatically" & _
    " কম্পিউটাৰক অসমীয়াত পৰিকলন যন্ত্ৰ বুলিও কোৱা হয়৷ ইংৰাজী কম্পিউটাৰ শব্দটো আহিছে লেটিন ভাষাৰ 'কম্পিউটে' শব্দৰ পৰা যাৰ অৰ্থ হৈছে গণনা৷" & _
    " Сучасны камп'ютар складаецца з абсталявання, якое ўяўляе фізічныя часткі камп'ютара (працэсар, клавіятура, манітор і г.д.)" & _
    " კომპიუტერი (ინგლ. computer) ინგლისური ზიტყვა რე დო გჷშმაკოროცხალს შანენს. თენა რე ელექტრონული გჷშმაკოროცხალი მანქანა" & _
    " 電腦或計算機係一台揸得指令(程式)操作資料嗰機器。" & _
    " '太字'コンピュータ(英: computer)は、自動計算機、とくに計算開始後は人手を介さずに計算終了まで動作する電子式汎用計算機。" & _
    " محتویات این مقاله ممکن است غیر قابل اعتماد و نادرست یا جانبدارانه باشد یا قوانین حقوق پدیدآورندگان را نقض کرده باشد. "
Local $res = StringRegExp($sText, "(*UCP)\b[\pL\d]{2,}", 3)
_ArrayDisplay($res)

The pL part means "any Unicode letter (in any language). It is a Unicode Character Property. See PCRE reference document (link in my signature) for more details about p and friends.

I'm not knowledgeable into asian languages and the spacing which has to be considered, so this naïve attempt is certainly far from the real thing.

Also you need to ensure that input text is Unicode and not one of the many multiple-byte encoding charset widely used in far Asia, like Big5 and countless others.

Lastly, I need to remind you that AutoIt currently uses the UCS-2 subset of Unicode, which limits to the plane 0 (co-called BMP). If your input contains codepoints from higher Unicode planes, then converting input to UTF16-LE first might work but I'm unsure of that. You need to try that possibility.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

This is just a snippet of your code, but I think you'll see where I'm coming from. Now, I haven't tested this, and I'm sure you'll notice that I didn't close any of the three loops I started.

 

Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1)
Local $aWords

For $i = 1 To $aFiles[0]
    For $j = 1 To $bFiles[0]
        If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)
edit: Actually, the If loop doesn't need to be closed as I wrote it above, as there is only one line to execute, but change it as you need to..

 

 

2m5b6mo.png

i get this error msg

 

code looks like this:

#include <File.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>

Local $oDictionary = ObjCreate("Scripting.Dictionary")
Local $mypath = @ScriptDir
Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1)
Local $aWords

For $i = 1 To $aFiles[0]
    For $j = 1 To $bFiles[0]
        If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)
If @error Then
    MsgBox($MB_SYSTEMMODAL, "Error", "No files found")
    Exit
Else
    MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files")
EndIf

Local $aWords
For $i = 1 To $aFiles[0]
    $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)      ; change pattern to fit your definition of "word"
    Local $iError = @error
    If $iError = 0 Then
        For $Word In $aWords
            $oDictionary.ADD($Word, $Word)
        Next
    Else
        MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError)
    EndIf
Next

$aWords = $oDictionary.Items
FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF))
Link to comment
Share on other sites

now it works perfect but it doesn't search in all sub folders

how can i make it find all text files from all sub folders

#include <File.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>

Local $oDictionary = ObjCreate("Scripting.Dictionary")
Local $mypath = @ScriptDir
Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1)
Local $aWords

If @error Then
    MsgBox($MB_SYSTEMMODAL, "Error", "No files found")
    Exit
Else
    MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files")
EndIf

Local $aWords
For $i = 1 To $aFiles[0]
    $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)      ; change pattern to fit your definition of "word"
    Local $iError = @error
    If $iError = 0 Then
        For $Word In $aWords
            $oDictionary.ADD($Word, $Word)
        Next
    Else
        MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError)
    EndIf
Next

$aWords = $oDictionary.Items
FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF))
Link to comment
Share on other sites

_FileListToArrayRec

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

  • 6 years later...

this script throws an error? how to fix it?

#include <File.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>

Local $oDictionary = ObjCreate("Scripting.Dictionary")
Local $mypath = @ScriptDir
Local $aFiles = _FileListToArrayRec($mypath, "*.txt", 1, 1)
Local $aWords

If @error Then
    MsgBox($MB_SYSTEMMODAL, "Error", "No files found")
    Exit
Else
    MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files")
EndIf

Local $aWords
For $i = 1 To $aFiles[0]
    $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)      ; change pattern to fit your definition of "word"
    Local $iError = @error
    If $iError = 0 Then
        For $Word In $aWords
            $oDictionary.ADD($Word, $Word)
        Next
    Else
        MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError)
    EndIf
Next

$aWords = $oDictionary.Items
FileWrite("words.txt", _ArrayToString($aWords, @CRLF))

 

Capture.PNG

 

what is the regexp for Chinese characters?

 

 

 

Edited by vinnyMS
Link to comment
Share on other sites

The error probably occurs because the key to be added already exists, example :

$sd = ObjCreate("Scripting.Dictionary")
$sd.add("test", "1")
$sd.add("test", "2")
msgbox(0,"", $sd.Item("test"))

So you might try

If not $oDictionary.Exists($Word) Then $oDictionary.ADD($Word, $Word)

The regex should work for chinese chars, but you can add (*UCP) at the beginning of the pattern

Link to comment
Share on other sites

8 hours ago, mikell said:

The error probably occurs because the key to be added already exists, example :

$sd = ObjCreate("Scripting.Dictionary")
$sd.add("test", "1")
$sd.add("test", "2")
msgbox(0,"", $sd.Item("test"))

So you might try

If not $oDictionary.Exists($Word) Then $oDictionary.ADD($Word, $Word)

The regex should work for chinese chars, but you can add (*UCP) at the beginning of the pattern

Alternatively, you can use the assignment operator - in this case an item is either added if it does not exist or overwritten if it already exists:

$oDictionary("TheKey") = "TheValue"

 

Link to comment
Share on other sites

If want faster results, you could use MAP (see beta version) :

Const $mypath = @ScriptDir
  Local $aFiles = _FileListToArray($mypath, "*.txt", $FLTA_FILES)
  Local $mWord[] ; create map array
  Local $aWords
  For $i = 1 To $aFiles[0]
    $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)   ; change pattern to fit your definition of "word"
    If Not IsArray($aWords) Then ContinueLoop
    For $Word In $aWords
      $mWord[$Word] = 1
    Next
  Next
  $aWords = MapKeys($mWord)
  ConsoleWrite (UBound($aWords) & @CRLF)

 

Edited by Nine
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...