Fuzzy Trim, v.03a
16 August 2020, Samuel Murray

Version 3a: 18 August 2020

The scripts search for similar sentences and groups them together.

- The input file is a plain text file, UTF8 with BOM, with one sentence per line.
- The first output file contains all the original sentences, but similar sentences are grouped together, while non-similar sentences remain in their original position in the file.  Found instances of similar sentences are closed up in double curly brackets.  The first column in the output file shows the number of seconds since the start of the process -- this can be useful to determine which combination of settings work best for your types of texts.  The third column contains either the fuzzy match percentage or a comment about the sentence.
- The second output file contains the first sentence in the first column, the match percentage in the second column, and the fuzzy matched sentence in the third column.

The four scripts are identical except that each one uses a different match searching method.  Type 1 appears to be the fastest for long texts, type 2 appears to be the fastest for short texts.

Relevant URLs and credits to authors of the functions:
https://www.autoitscript.com/forum/topic/144812-levenshtein-distance-function-problem/?tab=comments#comment-1022120
https://www.autoitscript.com/forum/topic/179886-comparing-strings/?tab=comments#comment-1291067
https://www.autoitscript.com/forum/topic/40843-calculate-string-similarity/?tab=comments#comment-642358

These are AutoIt scripts, so you must have AutoIt installed on your computer to use them.

==

Usage

1. Double-click a script and then select an input file.
2. The script will ask for various settings. 
3. Then, the script will create the output file and start searching for fuzzy matches.
4. The script will write processed sentences and their matches to the output file.
5. At the end, a short report will be written to the bottom of the output file.

Notes

1. You can stop the script at any time by right-clicking its icon in the system tray.
2. The script will show tooltips in the top left of your screen, but you can get rid of the tooltips by editing the AU3 files and adding a semi-colon and space (; ) in front of the two lines that start with "ToolTip".

==

Settings

- Fuzzy threshold: 75 means '75% similarity or higher'. A lower number will result in more matches. Default is 75%.
- Length threshold: 1.0 means sentences must be identical; 0.75 means sentences outside of 75% of each other's length are not checked. Default is 0.75.
- Spaces limit: 5 means sentences of 4, 3, 2, 1 or 0 spaces are not checked. Type 'N' to disable spaces limit. Default is 5.
- Global space limit (Y/N): If YES, then the spaces limit will also apply to the second of the two sentences being compared; if NO, then the spaces limit applies only to the first of the two sentences being compared.
- Length limit: Sentences SHORTER than this are not checked. Type 'N' to disable length limit. Default is 10.
- Upper length limit: Sentences LONGER than this are not checked. Type 'N' to disable length limit. Default is 'N'.

==

Tweaks:

1. Generally, sorting your file alphabetically or by length does not speed up the process.
2. If you can reduce multiple exact duplicate sentences to just one, it can speed things up.

==

Issues:

1. I have no idea if fuzzy matching is done case-sensitive or case-insensitive.
2. To pause or cancel a script, right-click it's icon in the system tray.

==

Types

At this time, we have these four "types" of fuzzy match seeking methods that you can try:

- Type 1: WideBoyDixon's implementation of the Levenshtein method
- Type 2: WideBoyDixon's implementation of the Sift2 method
- Type 3: jguinch's implementation of the Levenshtein method
- Type 4: chemistRE's implementation of the Levenshtein method

==

Version 1: The original scripts
Version 2: I was testing an alternative, potentially faster checking method. The alternative method worked well in simple tests, but failed terribly on real-world files.
Version 3: Cleaned up the variable names, etc. Added several options to exclude sentences.

