Jump to content

Best practise for best 'search in filenames' hits?


Guy_
 Share

Recommended Posts

I'm guessing there must be classic methods or scripts for this but I'm not sure how to search for them...

I've read some topics on 'search engines' so far.

Let's say I have a large array with book filenames and I want to be able to check if I already have a title — without missing too much hits because one or more characters are different...

My current thinking is to do a StringRegexp through each row and so I would first build a regex for my search term.

I'm guessing I'd have to escape potential regex characters, replace a bunch of non-word chars with .[any char] , and replace any letter that can have variations with a regex for all those variations?

In other words, if a book I have would be called "L'été Indien", and my search term was "L'ete Indien", my regex pattern could (roughly) be... 

"L.[eéèëê]t[eéèëê]\h[iï]nd[iï][eéèëê]n"

Is that good thinking and how it's done, or am I missing the obvious and classic scripts...?

Thanks! :)

Link to comment
Share on other sites

1 hour ago, Guy_ said:

Let's say I have a large array with book filenames

Could you please post such a list as .txt or .csv . It would be much easier if we had an already existing base of data.

Musashi-C64.png

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."

Link to comment
Share on other sites

I would remove diacritics, vowels, punctuation and spaces, leaving only uppercase consonants. Indeed it's very unlikely that a search based on such reduction would return a large number of hits: in 99.99% of real-world runs, you would get only one hit. You can also guard against typos by using a fuzzy search, allowing an increasing number of typing errors until something get caught. I can provide code for all of this.

Of course, handling only latin-based scripts (languages) would be much easier than non-latin scripts like Arabic, Hebrew, Thai, Indic, ... due to their peculiar/complex writing rules. See https://r12a.github.io/scripts/tutorial/part3 to get a grasp about what I'm talking about.

Your large array of books is typically loaded a runtime, hence it must reside in a disk file. Then why not use a database (SQLite would be perfect)? I also have SQLite extensions for fuzzy search and such. The idea would be to create a table with full title, reduced title and possibly many other attributes (ISBN, author(s), editor, edition, ...). The precise schema will have to take your exact requirements into consideration. Anyway an SQLite-based solution would be orders of magnitude faster than bare AutoIt code.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Here something you could start with :

#include <Constants.au3>

$sText = "L'été des indiens : (äëôìû)."

$sText = ReFormat ($sText)
MsgBox ($MB_SYSTEMMODAL,"",$sText)

Func ReFormat ($sString)
  Local $aRepl = [["[ÄÀÁÂ]","A"],["[ËÈÉÊ]","E"],["[ÏÌÍÎ]","I"],["[ÖÒÓÔ]","O"],["Ç","C"],["[ÜÙÚÛ]","U"],["[':;\.,()«»""°]"," "]]

  $sString = StringUpper($sString)
  For $i = 0 to UBound($aRepl)-1
    $sString = StringRegExpReplace($sString,$aRepl[$i][0],$aRepl[$i][1])
  Next
  Return StringStripWS($sString, 7)
EndFunc

 

Link to comment
Share on other sites

Here's an exemple you can play with: a .CSV file and an SQLite database with the same data.

You can use SQLite Expert (free version) to navigate the database. It has 4 columns: title, upcase (uppercase unaccented), shorter (without whitespaces) and shortest (without vowels).

Removing diacritics with _Unaccent(): there is the option Full, to remove more aggressively (depends on language script).

; Unicode Normalization Forms
Global Enum $UNF_NormC = 1, $UNF_NormD, $UNF_NormKC = 5, $UNF_NormKD


Func _UNF_Change($sIn, $iForm)
    If $iForm = $UNF_NormC Or $iForm = $UNF_NormD Or $iForm = $UNF_NormKC Or $iForm = $UNF_NormKD Then
        Local $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", 0, "int", 0)
        Local $tOut = DllStructCreate("wchar[" & 2 * ($aRet[0] + 20) & "]")
        $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", DllStructGetPtr($tOut, 1), "int", 2 * ($aRet[0] + 20))
        Return DllStructGetData($tOut, 1)
    Else
        SetError(1, 0, $sIn)
    EndIf
EndFunc   ;==>_UNF_Change


Func _Unaccent($s, $Full = 0)
    Local Static $aPat = [ _
            "(*UCP)[\x{300}-\x{36F}`'¨^¸¯]", _    ; combining accents only
            "(*UCP)\p{Mn}|\p{Lm}|\p{Sk}" _        ;      "       "    + all modifying letters
            ]
    Return StringRegExpReplace(_UNF_Change($s, $UNF_NormD), $aPat[Mod($Full, 2)], "")
EndFunc   ;==>_Unaccent

A relatively slow version of a fuzzy search (much faster version in my SQLite extension):

; Computes the number of typos (Damerau-Levenshtein distance) between two short strings.
; Four types of differences are counted:
;       insertion of a character,     abcd     ab#cd
;       deletion of a character,      abcd     acd
;       exchange of a character       abcd     ab$d
;       inversion of adjacent chars   abcd     acbd
;
; This function does NOT satisfy the so-called "triangle inequality", which means
; more simply that it makes NO attempt to compute the MINIMUM edit distance in all
; cases.  If you need that, you should use more complex algorithms.
;
; This simple function allows a fuzzy compare for e.g. recovering from typical
; human typos in short strings like names, address, cities... while getting rid of
; minor scripting differences.
;
; Strings are lowercased.
; String $st2 can be used as a pattern similar to the SQL 'LIKE' operator:
; '_' and trailing '%' act as in LIKE. These wildcards can be passed as parameters
; but these should contain exactly one character for the function to work properly.
;
; Complexity is in O(n^2) so don't use with long strings!
;
Func _Typos(Const $st1, Const $st2, $anychar = '_', $anytail = '%')
    Local $s1, $s2, $pen, $del, $ins, $subst
    If Not IsString($st1) Then Return SetError(-1, -1, -1)
    If Not IsString($st2) Then Return SetError(-2, -2, -1)
    If $st2 = '' Then Return StringLen($st1)
    If $st2 == $anytail Then Return 0
    If $st1 = '' Then
        Return(StringInStr($st2 & $anytail, $anytail, 1) - 1)
    EndIf
;~  $s1 = StringSplit(_LowerUnaccent($st1)), "", 2)     ;; _LowerUnaccent() addon function not available here
;~  $s2 = StringSplit(_LowerUnaccent($st2)), "", 2)     ;; _LowerUnaccent() addon function not available here
    $s1 = StringSplit(StringLower($st1), "", 2)
    $s2 = StringSplit(StringLower($st2), "", 2)
    Local $l1 = UBound($s1), $l2 = UBound($s2)
    Local $r[$l1 + 1][$l2 + 1]
    For $x = 0 To $l2 - 1
        Switch $s2[$x]
            Case $anychar
                If $x < $l1 Then
                    $s2[$x] = $s1[$x]
                EndIf
            Case $anytail
                $l2 = $x
                If $l1 > $l2 Then
                    $l1 = $l2
                EndIf
                ExitLoop
        EndSwitch
        $r[0][$x] = $x
    Next
    $r[0][$l2] = $l2
    For $x = 0 To $l1
        $r[$x][0] = $x
    Next
    For $x = 1 To $l1
        For $y = 1 To $l2
            $pen = Not ($s1[$x - 1] == $s2[$y - 1])
            $del = $r[$x-1][$y] + 1
            $ins = $r[$x][$y-1] + 1
            $subst = $r[$x-1][$y-1] + $pen
            If $del > $ins Then $del = $ins
            If $del > $subst Then $del = $subst
            $r[$x][$y] = $del
            If ($pen And $x > 1 And $y > 1 And $s1[$x-1] == $s2[$y-2] And $s1[$x-2] == $s2[$y-1]) Then
                If $r[$x][$y] >= $r[$x-2][$y-2] Then $r[$x][$y] = $r[$x-2][$y-2] + 1
                $r[$x-1][$y-1] = $r[$x][$y]
            EndIf
        Next
    Next
    Return ($r[$l1][$l2])
EndFunc



;; ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
;;             example usage
;; ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#include <Array.au3>

Local $reference = "lexicographically"
Local $Words[11][2] = [ _
    [$reference], _
    ["Lexicôgraphicaly"], _
    ["lexkographicaly"], _
    ["Lexico9raphically"], _
    ["lexioo9raphically"], _
    ["Lexicographical"], _
    ["lexicographlcally"], _
    ["Lex1cogr@phically"], _
    ["lexic0graphïca1yl"], _
    ["lexIcOgraphically"], _
    ["Lexlcographically"] _
]

For $i = 0 To UBound($Words) - 1
    $Words[$i][1] = _Typos($Words[$i][0], $reference)
Next
_ArrayDisplay($Words, "Number of typos")

_ConsoleWrite("Usage of '_' and '%' wildcards in pattern:" & @LF & @TAB & "_Typos('lex1c0gr@fhlâofznho', 'LEx_c_gr%') = " & _Typos('lex1c0gr@fhlofznho', 'lex_c_gr%') & @LF)

; Unicode-aware ConsoleWrite
Func _ConsoleWrite($s)
    ConsoleWrite(BinaryToString(StringToBinary($s, 4), 1))
EndFunc   ;==>__ConsoleWrite

 

Library.csv Library.sq3

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Oops, posted before I had written anything.

Seems awesome, thanks!!  I was still thinking over what I was gonna ask you and what would be ideal. A fuzzy search is surely part of it!

I will come back to this :)

Edited by Guy_
Link to comment
Share on other sites

In the future, please only quote smallest relevant part(s) of what was posted you have a question about.

No problem, chime when you have questions.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

8 hours ago, Musashi said:

Could you please post such a list as .txt or .csv . It would be much easier if we had an already existing base of data.

I was just giving a clear to understand example. It could be documents, audio, etc.

Some (obscure) sites may share what you want. I know, for example, IMDB shares a database for all movies, actors, ...

Edited by Guy_
Link to comment
Share on other sites

56 minutes ago, jchd said:

No problem, chime when you have questions.

Maybe a few general questions and ponderings, keeping books and taking notes as an example. Nothing urgent here ...

(1) Since I prefer not to have Windows live search on, I had found this fast freebie with typeahead that you can set on some specified folders and that I could use instead — for example, to quickly check if I already have a book title.
https://antibody-software.com/web/software/software/wizfile-finds-your-files-fast/

Then I thought this was still too awkward and made a hotkey that copies a selected title text and checks against my array with book titles that I freshly load once per session. This loading of 8,000+ files on an 8-year-old PC is only taking between 1 to 3 seconds. After that, if I keep this array in RAM, the search results are immediate — even with an unsorted array. So it seems the speed is covered for most "regular folks" with say, under 50,000 filenames. My guess is that if not all titles are 260 chars, this array would just take a few MB of RAM?

So that mainly left a wish for a better search through the titles (and therefore this topic).

So I'm not targeting scientists who want this to work on a million research papers. But if they offered those files to me, I wouldn't mind if my program worked great for that too — so I guess that is where a database *has to* come in...?

BTW, another issue is that I don't trust my programming enough to cater to scientists that would want to depend on it. But for myself I'm ok to start learning and playing with this.

(2) I've loved programming for FileMaker some years of my life and once tried a MySQL/PHP website that got a bit too tedious for me. So I've never looked into it for AutoIt cos it scared me off a little. At the same time I saw it can be "very simple" and my needs are too (I think).
So I guess an advantage would also be is if I search on an author, the database can show me all kinds of files I have with him from different folders. Hm, I think my current system can do this too if I gather filenames from all involved folders...

So is an advantage more that it can be indexed faster or something? (OTOH, I could also save a sorted array to a file and reload that)

I seem to remember a day when every program started including a database and the seemingly many extra files it involved, and that I "hated" that...

I like that what my program may someday become is basically just one .exe and wondering if working with a database would require an install and dozens of extra files? Maybe it's just an extension as you say.

I will for sure already look into what you posted and glance over some of your sig links in the coming week :) In principle, I love databases and they can become too addicting. OTOH, most of my FileMaker databases I stopped using, and, for example, now favoring taking notes in WordPad instead (less RAM and less likely to get damaged).

WordPad does have its own issues after which I thought I might wanna start using LibreOffice, or anything slightly more advanced that can use styling as well. When I see that LibreOffice takes like 1 GB of RAM, I guess that's where it becomes interesting to write your own Evernote thing. The major advantage I see here is that I would find a hits report for a search throughout my notes, while in WordPad I would be searching under specific headings more, or regularly convert to pdf to be able to have an Acrobat search result list.

Also, for potentially sharing/selling, I think people feel safer with actual files they are familiar with (WordPad, Word, etc.), instead of feeling their data is "hidden in database files" somewhere and could get lost or damaged any second, or become inaccessible when the software goes down, etc. It's kind of my feeling at least, although I know these files are very simple text files, and you could probably turn them out in .rtf format at any time if you program that :) (I used to do that when I used FileMaker)

Link to comment
Share on other sites

FYI, an SQLite DB is just one file, not a herd of files, which is portable verbatim on any hardware/software platform you can think of. SQLite is zero installation and zero maintainance.

Also SQLite is by far the most widespread DB engine ever and one of the top most used software piece: https://www.sqlite.org/mostdeployed.html So you already use SQLite several times daily, without even noticing.

A DB file is by far more reliable that anything else, thanks to the ACID properties of DBs. SQLite file format has not changed since introduction of version 3, 2004-07-22, something very few software can be proud of. Yet there have been an enormous number of powerful features added to the library over the years. Given the "around 10¹²" implementations in current use all over the world, it's guaranteed that this format is going to last very long (authors says at least till 2050). SQLite is free: public-domain open source, just needs a C compiler to build from source. Standard SQLite DBs can reach 281 TB, more than you'll ever need!

All in all, an SQLite DB is ideal for storage, maintainance and query of any-sized dataset, with the added benefit of possible concurrency and powerful querying language. No other file format offers the same power, speed and flexibility.

Of course, there is always a little price to pay: learn to use the features you'll need, but that's what help foras are for, aren't they?

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

25 minutes ago, jchd said:

FYI, an SQLite DB is just one file, not a herd of files [...]

Thanks for all that!  Yeah, for myself ideas are already bubbling up. LIke seeing replies from just the author of my choice, etc. It could be my old FileMaker database but with more dependable files.

I'll probably build my "simple" version first and have peeks at this extension for a future version or burst of enthusiasm.

And first I'll dig into the code above a.s.a.p. :)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...