Biatu

Longest Common Substring help

12 posts in this topic

Hello I am trying to find the LCS in an array of strings.

Goal: To traverse the array, comparing string against string, and extracting similar groups of words, then single words given that they are longer than the "~!SubStrID", being be a number starting with "~!" to point to the substitute.

Problem: I keep getting too much junk, or parts of words, or numbers, and not full sub strings/words, etc.

Here is the code:

#Include <Array.au3>
Local $aCommons[0], $vTest
$vTest=StringSplit(FileRead("Strings.txt"),@CRLF,1);@LF Only from now on!
For $i=1 To $vTest[0]
    For $j=1 To $vTest[0]
        If $i=$j Then ContinueLoop
        $sCommon=_LCS($vTest[$i],$vTest[$j])
        If StringInStr($sCommon,"~!") Then ContinueLoop
        $iMax=UBound($aCommons,1)
        For $k=0 To $iMax-1
            If $sCommon=$aCommons[$k] Then ContinueLoop 2
        Next
        ReDim $aCommons[$iMax+1]
        $aCommons[$iMax]=$sCommon
        $vTest[$j]=StringReplace($vTest[$j],$sCommon,"~!SubStrID")
    Next
Next
_ArrayDisplay($aCommons)

Func _LCS($sStringA,$sStringB)
    Local $sTmp,$sRet
    For $i=1 To StringLen($sStringA)
        $sTmp&=StringMid($sStringA,$i,1)
        If StringInStr($sStringB,$sTmp) Then
            If StringLen($sTmp)>StringLen($sRet) Then
                $sRet=$sTmp
            EndIf
        Else
            $sTmp=""
        EndIf
    Next
    Return $sRet
EndFunc

 


What is what? What is what.

Share this post


Link to post
Share on other sites



Can you post a significant sample of the strings you have to process?


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Here are 2 samples, and there will be like hundreds of lists that will need processing. 
StringsA.txt
StringsB.txt


What is what? What is what.

Share this post


Link to post
Share on other sites

@Biatu, are you trying to detect which string in list A also exists in string B (i.e. find the intersection)? this is by far easier than LCS - and in general, LCS is not the most efficient way when working with sorted lists.

Share this post


Link to post
Share on other sites

The string pool will probably be fairly large, given the content of StringsB.txt for instance. Doing a comparison character by character, string after string in such a huge list of all notebook models (just that) will take forever.

Instead and if what you're going to do is match one input string with one or more entries in the huge list of "reference strings", if I were you I'd use a 3-step approach.

0- Before anything else you need to build an SQLite database of all reference strings, possibly categorized to avoid mixing peripheral names and notebook/desktop models. I bet you aren't searching completely blindly. Be sure to add an FTS (Full Text Search) table of the text part. You might have to remove special characters to keep only alphanumeric, spaces and possibly hyphens and such. All this can be done automagically.

1- For every input string, put it in a separate table along its own FTS table. Using the suitable FTS tokenizer you can end up with input strings broken down into only significant words, just like the reference strings are.

2- Querying the input FTS table you get a list of words in the input string, which you can then use to build up a query in the reference FTS table, finally retrieving the candidate reference string(s) matching at least the same list of words as the input string.

All this may seem a little bit complex, but in practice and once the relevant queries are set up and debugged, things will be extremely fast.

3- Once you have a list of reference strings matching the words in the input string, you can go the pedestrian way to decide which one is a/the best match of the input. You can certainly display a listview and let the user choose, or implement some best match algorithm suitable to your actual context.

All in all I believe this will reveal the most efficient approach, at least with the kind of examples you've shown and the volume implied by the nature of the second example file (hundreds of such is no problem).

The standard SQLite DLLs include FTS4 which offers a number of features (http://www.sqlite.org/fts3.html).

The new FTS5 extension is very fast and even more powerful (http://www.sqlite.org/fts5.html#full_text_query_syntax).

1 person likes this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

@orbs No, those are separate lists and I am looking to get the LCS between all the entries in the list. Basically I'm trying to reduce the file size of each list by replacing commons strings first, then common words with pointers to another list index. 
@jchd That's a good approach, however I cannot use SQLite, this script needs to have no dependencies, and I'm not too worried about it being slow.

 

 


What is what? What is what.

Share this post


Link to post
Share on other sites

You can always fileinstall the DLL, or inline it in binary like it was in old AutoIt versions and even run it direct from memory.

Anyway I whish you best of luck.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
5 hours ago, Biatu said:

Basically I'm trying to reduce the file size of each list by replacing commons strings first, then common words with pointers to another list index. 

This is a very interesting problem. I have pondered it myself: not with words but with binary. The problem is that it takes forever.

1 person likes this

Share this post


Link to post
Share on other sites

Guys, you're rediscovering compression algorithms.

2 people like this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#10 ·  Posted

Everything I come up with turns out that way. ;)

1 person likes this

Share this post


Link to post
Share on other sites

#11 ·  Posted

here is a quick attempt in a brute force way.
It will find all possible substrings of the passed words and will store results in an SQLite table. Then will extract with an sql query the most recurring substrings.
This is just a quick draft. Could it be a way to go?
note: to use this script you have to save the Sqlite3.dll in the same path of the script.

#include <array.au3>
#include <MsgBoxConstants.au3>
#include <SQLite.au3>
; #include <SQLite.dll.au3>
Local $aResult, $iRows, $iColumns, $iRval


Local $sString[19] = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
; Local $sString = FileReadToArray(@ScriptDir & "\StringsA.txt") ; <--- use this to load a list of words from a text file.
; _ArrayDisplay($sString, "Readed file")

_SQLite_Startup(@ScriptDir & "\sqlite3.dll")
If @error Then
    MsgBox($MB_SYSTEMMODAL, "SQLite Error", "SQLite3.dll Can't be Loaded!")
    Exit -1
EndIf
ConsoleWrite("_SQLite_LibVersion=" & _SQLite_LibVersion() & @CRLF)
Local $sDbName = @ScriptDir & "\SubStrings.sqlite"
Local $hDskDb = _SQLite_Open() ; Open a temp disk database
If @error Then
    MsgBox($MB_SYSTEMMODAL, "SQLite Error", "Can't open or create a permanent Database!")
    Exit -1
EndIf
_SQLite_Exec(-1, "Create table tblSubstrings (substring, stringID int);")

; The following nested loops will generate any possible substring out of the passed string
; and results are stored into the sqlite archive for later analysis
For $n = 0 To UBound($sString) - 1
    ; save the whole string
    _SQLite_Exec(-1, "Insert into tblSubstrings values ('" & $sString[$n] & "'," & $n & ");")
    If StringLen($sString[$n]) > 1 Then
        For $x = 2 To StringLen($sString[$n]) - 1
            For $i = 1 To StringLen($sString[$n]) - $x + 1
                ; ConsoleWrite(StringMid($sString[$n], $i, $x) & @CRLF)
                _SQLite_Exec(-1, "Insert into tblSubstrings values ('" & StringMid($sString[$n], $i, $x) & "'," & $n & ");")
            Next
        Next
    EndIf
Next

; Query
; this query extract all the substrings ordered from the most common to the least recurring
$iRval = _SQLite_GetTable2d(-1, "SELECT substring, COUNT(substring) AS subs_sum FROM tblSubstrings GROUP BY substring HAVING subs_sum>1 ORDER BY subs_sum DESC;", $aResult, $iRows, $iColumns)

If $iRval = $SQLITE_OK Then

_ArrayDisplay($aResult)

Else
    MsgBox($MB_SYSTEMMODAL, "SQLite Error: " & $iRval, _SQLite_ErrMsg())
EndIf

; close sqlite
_SQLite_Close($hDskDb)
_SQLite_Shutdown()

 

1 person likes this

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#12 ·  Posted

1 hour ago, Chimp said:

here is a quick attempt in a brute force way.
It will find all possible substrings of the passed words and will store results in an SQLite table. Then will extract with an sql query the most recurring substrings.
This is just a quick draft. Could it be a way to go?
note: to use this script you have to save the Sqlite3.dll in the same path of the script.

#include <array.au3>
#include <MsgBoxConstants.au3>
#include <SQLite.au3>
; #include <SQLite.dll.au3>
Local $aResult, $iRows, $iColumns, $iRval


Local $sString[19] = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
; Local $sString = FileReadToArray(@ScriptDir & "\StringsA.txt") ; <--- use this to load a list of words from a text file.
; _ArrayDisplay($sString, "Readed file")

_SQLite_Startup(@ScriptDir & "\sqlite3.dll")
If @error Then
    MsgBox($MB_SYSTEMMODAL, "SQLite Error", "SQLite3.dll Can't be Loaded!")
    Exit -1
EndIf
ConsoleWrite("_SQLite_LibVersion=" & _SQLite_LibVersion() & @CRLF)
Local $sDbName = @ScriptDir & "\SubStrings.sqlite"
Local $hDskDb = _SQLite_Open() ; Open a temp disk database
If @error Then
    MsgBox($MB_SYSTEMMODAL, "SQLite Error", "Can't open or create a permanent Database!")
    Exit -1
EndIf
_SQLite_Exec(-1, "Create table tblSubstrings (substring, stringID int);")

; The following nested loops will generate any possible substring out of the passed string
; and results are stored into the sqlite archive for later analysis
For $n = 0 To UBound($sString) - 1
    ; save the whole string
    _SQLite_Exec(-1, "Insert into tblSubstrings values ('" & $sString[$n] & "'," & $n & ");")
    If StringLen($sString[$n]) > 1 Then
        For $x = 2 To StringLen($sString[$n]) - 1
            For $i = 1 To StringLen($sString[$n]) - $x + 1
                ; ConsoleWrite(StringMid($sString[$n], $i, $x) & @CRLF)
                _SQLite_Exec(-1, "Insert into tblSubstrings values ('" & StringMid($sString[$n], $i, $x) & "'," & $n & ");")
            Next
        Next
    EndIf
Next

; Query
; this query extract all the substrings ordered from the most common to the least recurring
$iRval = _SQLite_GetTable2d(-1, "SELECT substring, COUNT(substring) AS subs_sum FROM tblSubstrings GROUP BY substring HAVING subs_sum>1 ORDER BY subs_sum DESC;", $aResult, $iRows, $iColumns)

If $iRval = $SQLITE_OK Then

_ArrayDisplay($aResult)

Else
    MsgBox($MB_SYSTEMMODAL, "SQLite Error: " & $iRval, _SQLite_ErrMsg())
EndIf

; close sqlite
_SQLite_Close($hDskDb)
_SQLite_Shutdown()

I'll check it out, thanks guys.

 


What is what? What is what.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now