Best practise for best 'search in filenames' hits?

Guy_ · October 24, 2020

I'm guessing there must be classic methods or scripts for this but I'm not sure how to search for them...

I've read some topics on 'search engines' so far.

Let's say I have a large array with book filenames and I want to be able to check if I already have a title — without missing too much hits because one or more characters are different...

My current thinking is to do a StringRegexp through each row and so I would first build a regex for my search term.

I'm guessing I'd have to escape potential regex characters, replace a bunch of non-word chars with .[any char] , and replace any letter that can have variations with a regex for all those variations?

In other words, if a book I have would be called "L'été Indien", and my search term was "L'ete Indien", my regex pattern could (roughly) be...

"L.[eéèëê]t[eéèëê]\h[iï]nd[iï][eéèëê]n"

Is that good thinking and how it's done, or am I missing the obvious and classic scripts...?

Thanks!

October 24, 2020

1 hour ago, Guy_ said:

Let's say I have a large array with book filenames

Could you please post such a list as .txt or .csv . It would be much easier if we had an already existing base of data.

jchd · October 24, 2020

I would remove diacritics, vowels, punctuation and spaces, leaving only uppercase consonants. Indeed it's very unlikely that a search based on such reduction would return a large number of hits: in 99.99% of real-world runs, you would get only one hit. You can also guard against typos by using a fuzzy search, allowing an increasing number of typing errors until something get caught. I can provide code for all of this.

Of course, handling only latin-based scripts (languages) would be much easier than non-latin scripts like Arabic, Hebrew, Thai, Indic, ... due to their peculiar/complex writing rules. See https://r12a.github.io/scripts/tutorial/part3 to get a grasp about what I'm talking about.

Your large array of books is typically loaded a runtime, hence it must reside in a disk file. Then why not use a database (SQLite would be perfect)? I also have SQLite extensions for fuzzy search and such. The idea would be to create a table with full title, reduced title and possibly many other attributes (ISBN, author(s), editor, edition, ...). The precise schema will have to take your exact requirements into consideration. Anyway an SQLite-based solution would be orders of magnitude faster than bare AutoIt code.

Nine · October 24, 2020

Here something you could start with :

#include <Constants.au3>

$sText = "L'été des indiens : (äëôìû)."

$sText = ReFormat ($sText)
MsgBox ($MB_SYSTEMMODAL,"",$sText)

Func ReFormat ($sString)
  Local $aRepl = [["[ÄÀÁÂ]","A"],["[ËÈÉÊ]","E"],["[ÏÌÍÎ]","I"],["[ÖÒÓÔ]","O"],["Ç","C"],["[ÜÙÚÛ]","U"],["[':;\.,()«»""°]"," "]]

  $sString = StringUpper($sString)
  For $i = 0 to UBound($aRepl)-1
    $sString = StringRegExpReplace($sString,$aRepl[$i][0],$aRepl[$i][1])
  Next
  Return StringStripWS($sString, 7)
EndFunc

jchd · October 24, 2020

Here's an exemple you can play with: a .CSV file and an SQLite database with the same data.

You can use SQLite Expert (free version) to navigate the database. It has 4 columns: title, upcase (uppercase unaccented), shorter (without whitespaces) and shortest (without vowels).

Removing diacritics with _Unaccent(): there is the option Full, to remove more aggressively (depends on language script).

; Unicode Normalization Forms
Global Enum $UNF_NormC = 1, $UNF_NormD, $UNF_NormKC = 5, $UNF_NormKD


Func _UNF_Change($sIn, $iForm)
    If $iForm = $UNF_NormC Or $iForm = $UNF_NormD Or $iForm = $UNF_NormKC Or $iForm = $UNF_NormKD Then
        Local $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", 0, "int", 0)
        Local $tOut = DllStructCreate("wchar[" & 2 * ($aRet[0] + 20) & "]")
        $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", DllStructGetPtr($tOut, 1), "int", 2 * ($aRet[0] + 20))
        Return DllStructGetData($tOut, 1)
    Else
        SetError(1, 0, $sIn)
    EndIf
EndFunc   ;==>_UNF_Change


Func _Unaccent($s, $Full = 0)
    Local Static $aPat = [ _
            "(*UCP)[\x{300}-\x{36F}`'¨^¸¯]", _    ; combining accents only
            "(*UCP)\p{Mn}|\p{Lm}|\p{Sk}" _        ;      "       "    + all modifying letters
            ]
    Return StringRegExpReplace(_UNF_Change($s, $UNF_NormD), $aPat[Mod($Full, 2)], "")
EndFunc   ;==>_Unaccent

A relatively slow version of a fuzzy search (much faster version in my SQLite extension):

; Computes the number of typos (Damerau-Levenshtein distance) between two short strings.
; Four types of differences are counted:
;       insertion of a character,     abcd     ab#cd
;       deletion of a character,      abcd     acd
;       exchange of a character       abcd     ab$d
;       inversion of adjacent chars   abcd     acbd
;
; This function does NOT satisfy the so-called "triangle inequality", which means
; more simply that it makes NO attempt to compute the MINIMUM edit distance in all
; cases.  If you need that, you should use more complex algorithms.
;
; This simple function allows a fuzzy compare for e.g. recovering from typical
; human typos in short strings like names, address, cities... while getting rid of
; minor scripting differences.
;
; Strings are lowercased.
; String $st2 can be used as a pattern similar to the SQL 'LIKE' operator:
; '_' and trailing '%' act as in LIKE. These wildcards can be passed as parameters
; but these should contain exactly one character for the function to work properly.
;
; Complexity is in O(n^2) so don't use with long strings!
;
Func _Typos(Const $st1, Const $st2, $anychar = '_', $anytail = '%')
    Local $s1, $s2, $pen, $del, $ins, $subst
    If Not IsString($st1) Then Return SetError(-1, -1, -1)
    If Not IsString($st2) Then Return SetError(-2, -2, -1)
    If $st2 = '' Then Return StringLen($st1)
    If $st2 == $anytail Then Return 0
    If $st1 = '' Then
        Return(StringInStr($st2 & $anytail, $anytail, 1) - 1)
    EndIf
;~  $s1 = StringSplit(_LowerUnaccent($st1)), "", 2)     ;; _LowerUnaccent() addon function not available here
;~  $s2 = StringSplit(_LowerUnaccent($st2)), "", 2)     ;; _LowerUnaccent() addon function not available here
    $s1 = StringSplit(StringLower($st1), "", 2)
    $s2 = StringSplit(StringLower($st2), "", 2)
    Local $l1 = UBound($s1), $l2 = UBound($s2)
    Local $r[$l1 + 1][$l2 + 1]
    For $x = 0 To $l2 - 1
        Switch $s2[$x]
            Case $anychar
                If $x < $l1 Then
                    $s2[$x] = $s1[$x]
                EndIf
            Case $anytail
                $l2 = $x
                If $l1 > $l2 Then
                    $l1 = $l2
                EndIf
                ExitLoop
        EndSwitch
        $r[0][$x] = $x
    Next
    $r[0][$l2] = $l2
    For $x = 0 To $l1
        $r[$x][0] = $x
    Next
    For $x = 1 To $l1
        For $y = 1 To $l2
            $pen = Not ($s1[$x - 1] == $s2[$y - 1])
            $del = $r[$x-1][$y] + 1
            $ins = $r[$x][$y-1] + 1
            $subst = $r[$x-1][$y-1] + $pen
            If $del > $ins Then $del = $ins
            If $del > $subst Then $del = $subst
            $r[$x][$y] = $del
            If ($pen And $x > 1 And $y > 1 And $s1[$x-1] == $s2[$y-2] And $s1[$x-2] == $s2[$y-1]) Then
                If $r[$x][$y] >= $r[$x-2][$y-2] Then $r[$x][$y] = $r[$x-2][$y-2] + 1
                $r[$x-1][$y-1] = $r[$x][$y]
            EndIf
        Next
    Next
    Return ($r[$l1][$l2])
EndFunc



;; ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
;;             example usage
;; ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#include <Array.au3>

Local $reference = "lexicographically"
Local $Words[11][2] = [ _
    [$reference], _
    ["Lexicôgraphicaly"], _
    ["lexkographicaly"], _
    ["Lexico9raphically"], _
    ["lexioo9raphically"], _
    ["Lexicographical"], _
    ["lexicographlcally"], _
    ["Lex1cogr@phically"], _
    ["lexic0graphïca1yl"], _
    ["lexIcOgraphically"], _
    ["Lexlcographically"] _
]

For $i = 0 To UBound($Words) - 1
    $Words[$i][1] = _Typos($Words[$i][0], $reference)
Next
_ArrayDisplay($Words, "Number of typos")

_ConsoleWrite("Usage of '_' and '%' wildcards in pattern:" & @LF & @TAB & "_Typos('lex1c0gr@fhlâofznho', 'LEx_c_gr%') = " & _Typos('lex1c0gr@fhlofznho', 'lex_c_gr%') & @LF)

; Unicode-aware ConsoleWrite
Func _ConsoleWrite($s)
    ConsoleWrite(BinaryToString(StringToBinary($s, 4), 1))
EndFunc   ;==>__ConsoleWrite

Library.csv Library.sq3

Guy_ · October 24, 2020

Oops, posted before I had written anything.

Seems awesome, thanks!! I was still thinking over what I was gonna ask you and what would be ideal. A fuzzy search is surely part of it!

I will come back to this

Edited October 24, 2020 by Guy_

jchd · October 24, 2020

~~In the future, please only quote smallest relevant part(s) of what was posted you have a question about.~~

No problem, chime when you have questions.

Edited October 24, 2020 by jchd

Guy_ · October 24, 2020

8 hours ago, Musashi said:

Could you please post such a list as .txt or .csv . It would be much easier if we had an already existing base of data.

I was just giving a clear to understand example. It could be documents, audio, etc.

Some (obscure) sites may share what you want. I know, for example, IMDB shares a database for all movies, actors, ...

Edited October 24, 2020 by Guy_

Guy_ · October 24, 2020

56 minutes ago, jchd said:

No problem, chime when you have questions.

Maybe a few general questions and ponderings, keeping books and taking notes as an example. Nothing urgent here ...

(1) Since I prefer not to have Windows live search on, I had found this fast freebie with typeahead that you can set on some specified folders and that I could use instead — for example, to quickly check if I already have a book title.
https://antibody-software.com/web/software/software/wizfile-finds-your-files-fast/

Then I thought this was still too awkward and made a hotkey that copies a selected title text and checks against my array with book titles that I freshly load once per session. This loading of 8,000+ files on an 8-year-old PC is only taking between 1 to 3 seconds. After that, if I keep this array in RAM, the search results are immediate — even with an unsorted array. So it seems the speed is covered for most "regular folks" with say, under 50,000 filenames. My guess is that if not all titles are 260 chars, this array would just take a few MB of RAM?

So that mainly left a wish for a better search through the titles (and therefore this topic).

So I'm not targeting scientists who want this to work on a million research papers. But if they offered those files to me, I wouldn't mind if my program worked great for that too — so I guess that is where a database *has to* come in...?

BTW, another issue is that I don't trust my programming enough to cater to scientists that would want to depend on it. But for myself I'm ok to start learning and playing with this.

(2) I've loved programming for FileMaker some years of my life and once tried a MySQL/PHP website that got a bit too tedious for me. So I've never looked into it for AutoIt cos it scared me off a little. At the same time I saw it can be "very simple" and my needs are too (I think).
So I guess an advantage would also be is if I search on an author, the database can show me all kinds of files I have with him from different folders. Hm, I think my current system can do this too if I gather filenames from all involved folders...

So is an advantage more that it can be indexed faster or something? (OTOH, I could also save a sorted array to a file and reload that)

I seem to remember a day when every program started including a database and the seemingly many extra files it involved, and that I "hated" that...

I like that what my program may someday become is basically just one .exe and wondering if working with a database would require an install and dozens of extra files? Maybe it's just an extension as you say.

I will for sure already look into what you posted and glance over some of your sig links in the coming week In principle, I love databases and they can become too addicting. OTOH, most of my FileMaker databases I stopped using, and, for example, now favoring taking notes in WordPad instead (less RAM and less likely to get damaged).

WordPad does have its own issues after which I thought I might wanna start using LibreOffice, or anything slightly more advanced that can use styling as well. When I see that LibreOffice takes like 1 GB of RAM, I guess that's where it becomes interesting to write your own Evernote thing. The major advantage I see here is that I would find a hits report for a search throughout my notes, while in WordPad I would be searching under specific headings more, or regularly convert to pdf to be able to have an Acrobat search result list.

Also, for potentially sharing/selling, I think people feel safer with actual files they are familiar with (WordPad, Word, etc.), instead of feeling their data is "hidden in database files" somewhere and could get lost or damaged any second, or become inaccessible when the software goes down, etc. It's kind of my feeling at least, although I know these files are very simple text files, and you could probably turn them out in .rtf format at any time if you program that (I used to do that when I used FileMaker)

jchd · October 24, 2020

FYI, an SQLite DB is just one file, not a herd of files, which is portable verbatim on any hardware/software platform you can think of. SQLite is zero installation and zero maintainance.

Also SQLite is by far the most widespread DB engine ever and one of the top most used software piece: https://www.sqlite.org/mostdeployed.html So you already use SQLite several times daily, without even noticing.

A DB file is by far more reliable that anything else, thanks to the ACID properties of DBs. SQLite file format has not changed since introduction of version 3, 2004-07-22, something very few software can be proud of. Yet there have been an enormous number of powerful features added to the library over the years. Given the "around 10¹²" implementations in current use all over the world, it's guaranteed that this format is going to last very long (authors says at least till 2050). SQLite is free: public-domain open source, just needs a C compiler to build from source. Standard SQLite DBs can reach 281 TB, more than you'll ever need!

All in all, an SQLite DB is ideal for storage, maintainance and query of any-sized dataset, with the added benefit of possible concurrency and powerful querying language. No other file format offers the same power, speed and flexibility.

Of course, there is always a little price to pay: learn to use the features you'll need, but that's what help foras are for, aren't they?

Edited October 24, 2020 by jchd

Guy_ · October 24, 2020

25 minutes ago, jchd said:

FYI, an SQLite DB is just one file, not a herd of files [...]

Thanks for all that! Yeah, for myself ideas are already bubbling up. LIke seeing replies from just the author of my choice, etc. It could be my old FileMaker database but with more dependable files.

I'll probably build my "simple" version first and have peeks at this extension for a future version or burst of enthusiasm.

And first I'll dig into the code above a.s.a.p.

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached) Debug Messages Monitor UDF	Screen Scraping Round Corner GUI UDF	Multi-Threading Made Easy Interface Object based on Tag

Sign In

Best practise for best 'search in filenames' hits?

Recommended Posts

Guy_

Guest

jchd

Nine

jchd

Guy_

jchd

Guy_

Guy_

jchd

Guy_

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta