Jump to content

Directory Enquiries Challenge


Recommended Posts

Just now, czardas said:

Can they? I really doubt that. South American countries have some odd rules, but the last 7 or 8 digits remain unchanged.

That doesn't mean that the number your searching for has been inputted correctly to match what your hoping to find ;-) and see a rule for every country becomes cumbersome. Deviation can exist anywhere for the input side because of human error.

Link to comment
Share on other sites

51 minutes ago, czardas said:

Typos.au3 is unsuitable for this. The method returns matches which are obviously wrong. It's an interesting idea, but fails terribly.

Do you mean that there are no errors in the list of queries? If so I agree that this kind of fuzzy search isn't the tool to use and that I misunderstood your goal. But then a regex would do.

But if you expect to match a query "local" or short number (no country code, no area code) against a real-world long list of actual numbers, then you'll get erroneous/misleading results as well.

I look at the problem as ill-posed: "I failed to know my data at the right time and I'm now facing an untractable mess". Phone numbers aren't given by martians or a random lottery, they come from some source along with a meaning: they aren't data, but information. E.g. this is the phone number of one customer in Denmark, this other one is for a friend in India, a.s.o. Failing to turn the raw data (the series of digits and signs) into valid, useable information in the first place is the actual issue. The same applies to the query list: you're supposed to know where you live and make a difference between 123456 being the local UK number of your neighbour, 123456 being the middle part of the number of a company in Singapore and the same 123456 being the real local number of your aunt in Colorado.

With no information but instead just a pile of raw data, then exact matches could possibly be "reliably" obtained by regexp but meaningless for acual processing.

Another big catch is that phone numbers continuously change over time all around the world. You must be some telco entity to track those changes reliably and adjust your list accordingly.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

20 minutes ago, jchd said:

You must be some telco entity to track those changes reliably and adjust your list accordingly.

That is a problem.

For the rest, you have no idea where the number originated. I thought I had made that clear in the first post. If someone travels around the globe, they may have contacts with duplicate phone number entries. On the other hand, if someone doesn't travel abroad, they won't include country codes (sometimes there is no area code). If these people use an application to search for a phone number, you could simply use StringInStr(). However, many people find phone numbers on the Internet and add them to databases etc... So there is a potential need to automate recognition. Naturally there will always be some false positives.

Edited by czardas
Link to comment
Share on other sites

But that's exactly my point!

Gathering partial (unspecified) numbers from around the globe in a database is just piling meaningless data. Would you sensibly store 123456 in three distinct phonebook entries without associating them with some clues about location and number owner? If yes how are you going to call your aunt if you're in hollidays in Germany, or selling insurance contracts while in London?

It's just guesswork to me.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I have made some modifications to fix the false negative results in my previous code. @jchd is right in pointing out the impossibilities of being able to produce a 100% reliable match. It's impossible to differentiate between short country codes and long local area codes.

My code below does not differentiate between different countries: instead it only checks that country code formatting  is the likely cause of a mismatch before returning the result as a potential match. In no way is it perfect, but it's also not so likely that many people will store international telephone numbers that will cause collision. Unless you are a telephone company, having the same subscriber number (in different countries) appearing in your contacts is likely to be a fluke coincidence that happens once in a lifetime. You can also see that they have different prefixes as soon as you check the search results.

MsgBox(0, "Malta +356", TelCompare('21 12345678', '0011 356 12345678'))

MsgBox(0, "Argentina +54", TelCompare('0 22 15 12345678', '010 54 9 22 12345678'))

MsgBox(0, "Hungary +36", TelCompare('06 12345678', '+36 12345678'))

Func TelCompare($sTelNum1, $sTelNum2, $iMinMatch = 3) ; Maximum Length = 25 probably
    ; get rid of typical delimiters
    $sTelNum1 = StringRegExpReplace($sTelNum1, '[ \+\(\)\-]', '')
    $sTelNum2 = StringRegExpReplace($sTelNum2, '[ \+\(\)\-]', '')
    If $sTelNum1 = $sTelNum2 Then Return True ; no need to go any further

    Local $iLen1 = StringLen($sTelNum1), $iLen2 = StringLen($sTelNum2)

    If $iLen2 < $iLen1 Then ; make $sTelNum1 the shorter number
        Local $vTemp = $iLen1
        $iLen1 = $iLen2
        $iLen2 = $vTemp

        $vTemp = $sTelNum1
        $sTelNum1 = $sTelNum2
        $sTelNum2 = $vTemp
    EndIf

    If $iLen1 <= $iMinMatch Then Return False ; insufficient information

    If StringRight($sTelNum1, $iMinMatch) <> StringRight($sTelNum2, $iMinMatch) Then Return False ; minimum match failed

    $sTelNum1 = StringReverse($sTelNum1) ; to simplify parsing later
    $sTelNum2 = StringReverse($sTelNum2) ; dito

    ; the algorithm [international dialing codes all begin with zero]
    Local $sDigit1, $sDigit2
    For $i = $iMinMatch +1 To $iLen1
        $sDigit1 = StringMid($sTelNum1, $i, 1)
        $sDigit2 = StringMid($sTelNum2, $i, 1)

        If $sDigit1 <> $sDigit2 Then ; let's find out why
            Local $iOffSet = $iLen2 - $iLen1
            If $i = $iLen1 Then ; we have reached the first digit - test the first single digit omission theory with country codes (reversed)
                ; maybe omitted in $sTelNum2 or different international dialing code
                Return (StringRegExp($sDigit1, '[078]') And StringRegExp(StringRight($sTelNum2, $iOffSet +1), '(\d){1,3}(00|1100|010|110)?'))

            Else ; odd exceptions or differences in international dialing codes (reversed)
                If $i = $iLen1 -1 Then ; we have reached the penultimate digit
                    Local $sSub = StringRight($sTelNum1, 2)
                    If $sSub = '12' And StringRegExp(StringRight($sTelNum2, $iOffSet +2), '(653)(00|1100|010|110)?') Then Return True ; Malta +356

                    If $sSub = '22' And StringRegExp(StringRight($sTelNum2, $iOffSet +2), '(132)(00|1100|010|110)?') Then Return True ; Liberia +231
                EndIf

                ; check Latin American exceptions
                If StringRegExp($sTelNum2, '(75)(00|1100|010|110)?\z') And $i > 7 Then Return True ; Colombia +57
                If StringRegExp($sTelNum2, '(45|55)(00|1100|010|110)?\z') And $i > 8 Then Return True ; Argentina +54, Brazil +55
                If StringRegExp($sTelNum2, '(25)(00|1100|010|110)?\z') And $i > 10 Then Return True ; Mexico +52

                ; check for international dialing code discrepancies (reversed)
                If StringRegExp($sDigit1, '[01]') Or StringRegExp($sDigit2, '[01]') Then
                    $sTelNum1 = StringRight($sTelNum1, $iLen1 - $i)
                    $sTelNum2 = StringRight($sTelNum2, $iLen2 - $i)
                    Return (StringRegExp($sTelNum1, '\A(0|00|10|100)\z') And StringRegExp($sTelNum2, '\A(0|00|10|100)\z'))
                EndIf
            EndIf
        EndIf
    Next

    Return True
EndFunc ;==> TelCompare

The code is based on the information I posted earlier (which may be subject to change).
http://www.onesimcard.com/how-to-dial/
I haven't thoroughly tested it yet. It shouldn't miss any possible matches now. 

Edit: Although working for the examples given, this code is based on some false assumptions.

Edited by czardas
Removed 1 line of code + small modification.
Link to comment
Share on other sites

; Input check for valid phone numbers
; Documentation about phone number conventions: https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers
#include <array.au3>

Opt ('MustDeclareVars', 1)

Local $findnumb = _
    ['882 8565','123 8762','7543010','07843 543287','00441619346534','+44208','0015417543012']

Local $strPhoneNumber, $intPhoneNumberLength
For $i = 0 to UBound ( $findnumb ) - 1
    $strPhoneNumber = $findnumb[$i]
    $strPhoneNumber = StringRegExpReplace($findnumb[$i],"[^0-9]","")

    ; * don't remove any starting 0 digit. They are used in some City Codes *

    $intPhoneNumberLength = StringLen ( $strPhoneNumber )
    Switch ( $intPhoneNumberLength )
        Case 12  ; to 15    [ a maximum of 15 numbers is reserved for use. Not sure if any country uses numbers higher then 12 digits-long at the moment.
            ConsoleWrite ( $strPhoneNumber & " [Valid Telephone Number with Country and City code]" & @CRLF )
        Case 10
            ; Twenty-four countries and territories share the North American Numbering Plan (NANP), with a single country code. It is a closed
            ; telephone numbering plan in which all telephone numbers consist of 10 digits, with the first three digits representing the area code
            ConsoleWrite ( $strPhoneNumber & " [Valid Telephone Number with City Code]" & @CRLF )
        ;Case 9
        ;   ; Belgian telephone numbers: Land lines are always 9 digits long
        ;   ConsoleWrite ( $strPhoneNumber & " [Valid Belgian Telephone Number With City Code]" & @CRLF )
        ;Case 8
        ;   ; Danish telephone numbers are eight digits long
        ;   ConsoleWrite ( $strPhoneNumber & " [Valid Belgian Telephone Number With City Code]" & @CRLF )
        Case 7
            ; 7-digit numbers: Most codes retain these rules today; in these areas, phone numbers continue to be written as 7-digit numbers
            ConsoleWrite ( $strPhoneNumber & " [Valid Local Telephone Number]" & @CRLF )
        ;Case 6
        ;   ; Hungary the standard lengths for area codes is two / Subscribers' numbers are six digits long
        ;   ConsoleWrite ( $strPhoneNumber & " [Valid Hungary Local Telephone Number]" & @CRLF )
        Case 3
            ConsoleWrite ( $strPhoneNumber & " [Valid Service Number]" & @CRLF )
        Case Else
            ConsoleWrite ( $strPhoneNumber & " [Invalid Phone Number]" & @CRLF )
        EndSwitch
Next

Nice to see all the good ideas you people come up with. Although i don't have much programming time the whole idea of phone number checking sounds great to me. This morning i thought about it a little more and concluded that since we don't have a UDF with all Country and City Codes and conventions its perhaps the best to stick with some simple string length checking.

 

Link to comment
Share on other sites

2 hours ago, pluto41 said:

Nice to see all the good ideas you people come up with.

I absolutely agree. This is one of five search algorithms I am implementing in a program of mine. The others are ebay (word sequence) type searches and string type (exact or part string) searches. For my own purposes, these combined search algorithms will find anything I might ever type and this constitutes the final piece of the puzzle.

I'm not partaking in the challenge, but I'll leave this a while for anyone else who wants to have a try. Your comments and ideas are invaluable to me and often quite entertaining. Those who wanted, or expected, a fool proof solution are naturally going to be disappointed. The challenge is to find the best approach - nothing more and nothing less. I'll look at every entry and ask someone to pick what they think is the most inventive solution. :) Failing that I'll pick one myself.

The number of times I write phone numbers on scraps of paper is so annoying. People also give me numbers all the time. Now I can clear my drawer full of scrap paper without duplicating anything I might already have logged. No need to worry about number format (however rough and ready the solution might be). :thumbsup:

It should be born in mind that 99.9% of the world's population do not have a systematic way to type phone numbers, nor do they even know what a regular expression is.

Edited by czardas
Link to comment
Share on other sites

 @pluto41 I didn't check every result from your last post, however these are valid number formats:

07843543287 [Invalid Phone Number] ==> Actually this looks like a UK mobile number
00441619346534 [Invalid Phone Number] ==> Actually this looks like someone calling the UK from within Europe

 

Edited by czardas
Link to comment
Share on other sites

That is correct @czardas the switch statement i made is incomplete. it was merely showing another approach for number checking. By using the KIS principle (Keep It Simple). Thats also why i commented some lines. When i would use the code i wrote into production i think i would have included all country rules into the switch statement. When production requirements are really -high- i would (personally) create a array for every country and every city there exists.

As i live in the Netherlands i would start with some arrays something like this:

Netherland = +31
CityName1 = 051
CityName2 = 038
...
..
LandLineLength = 8
MobileNumberPrefixLength1 =  06
MobileNumberSuffix = 8 chars

This has then to be done in a consistent way for every country / city including all exceptions. [a hell of a job] :) So thats exactly the reason why i thought KIS and wrote some example code into that direction. Again its merely a approach and i think it depends on the requirements which way to go.

 

Link to comment
Share on other sites

Uitstekend! I had to use Google to check my spelling. LOL

Checking validity is interesting. :)

I believe you will find a window of ambiguity: a certain range (number of digits) where uncertainty can't be eliminated. The question is - how large is the gap? Mexican mobiles and land lines contain at least 10 digits (so I believe). I think validity checking is a lot tougher challenge.

Edited by czardas
Link to comment
Share on other sites

Seems there's two lines of reasoning to finding telephone numbers. First find only exact matches of the reference numbers or match the numbers within a larger number (even if reference numbers are mistyped or incorrect). The second line of reasoning is to define every type of phone number via rules and a table of known area, national, regional, etc. calling codes and don't look for the reference numbers.

I've adjusted my code, based on the original request of the challenge, to look and find all the reference numbers and, from what I gathered, to find the most similar matches to up to a certain similarity limit (the whole "can have false positives"). One issue I noticed was the search for +44208... doesn't match ANYTHING unless you suppose that the numbers 08000225649 and 08457128276 are supposed to have +442 added to the beginning.  Also added a piece that starts matching the original reference number to the left characters of the database numbers. It will one by one remove a character from the reference number to match the beginning of the database number. The reference number must be less than a certain string length in order to use this process. I may do similarly to the reverse, but haven't found it necessary, yet.

#include <Array.au3>
#include "typos.au3"

#cs looking for
882 8565
123 8762
7543010
07843 543287
00441619346534
+44208.....missing numbers [optional task]
44208
   0800275002 ; too short, japan local?
   08000225649 ; 11 chars
   08457128276 ; 11 chars
0015417543012
#ce

GLOBAL $refNumT
Local $aArray = _
    ['+262 692 12 03 00', '1800 251 996',    '+1 994 951 0197', _
    '091 535 98 91 61',   '2397865',         '08457 128276', _
    '348476300192',       '05842 361774',    '0-800-022-5649', _
    '15499514891',        '0096 363 0949',   '04813137349', _
    '06620 220168',       '07766 554433',    '047 845 44 22 94', _
    '0435 773 4859',      '(01) 882 8565',   '00441619346434', _
    '09314 367090',       '0 164 268 0887',  '0590995603', _
    '991',                '0267 746 3393',   '064157526153', _
    '0 719 829 7756',     '+1-541-754-3012', '+441347543010', _
    '03890 978398',       '(31) 10 7765420', '020 8568 6646', _
    '0161 934 6534',      '0 637 915 1283',  '+44 207 882 8565', _
    '0800 275002',        '0750 646 9746',   '982-714-3119', _
    '000 300 74 52 40',   '023077529227',    '1 758 441 0611', _
    '0183 233 0151',      '02047092863',     '+44 20 7946 0321', _
    '04935 410618',       '048 257 67 60 79']


Local $findnumb = _
    ['882 8565','123 8762','7543010','07843 543287','00441619346534','+44208.....missing numbers [optional task]','0015417543012']
Consolewrite('---------------------------------------------------------------------------------------------------------------------'& @CRLF)
Consolewrite('---------------------------------------------------------------------------------------------------------------------'& @CRLF)

For $i = 0 to Ubound($findnumb)-1 ; find these numbers!
    $reference = StringRegExpReplace($findnumb[$i],"[^0-9]","") ; Santize Numbers

    For $a = 0 to ubound($aArray)-1
        GLOBAL $m = 0
        $dbnumbers = StringRegExpReplace($aArray[$a],"[^0-9]","") ; Sanitize Numbers
        $refNumT = $reference
        if $reference = $dbnumbers Then
            Consolewrite('> Reference Phone Number --] '& $findnumb[$i] & ' [--'& @CRLF)
            Consolewrite('+> ^ Exact Match to --] '& $aArray[$a] & ' [-- row '& $a & @CRLF)
        EndIf
        IF StringLen($reference) < 7 then ; Find Partial Match at beginning of the database number
            Do
                IF StringLeft($dbnumbers,StringLen($refNumT)) = $refNumT then
                    Consolewrite('> Reference Phone Number --] '& $findnumb[$i] & ' [-- using last '& StringLen($refNumT) & ' digits'& @CRLF)
                    consolewrite('+> ^ Partial Match, matching first '& StringLen($refNumT)&' numbers of --] ' & $aArray[$a] & ' [-- row '& $a & @CRLF)
                EndIf
                $refNumT = StringTrimLeft($refNumT,1)
            Until StringLen($refNumT) = 1 OR StringLen($dbnumbers) = 10
        endif
        if StringInStr($dbnumbers,$reference) then ; Find Partial Match within the numbers database
            Consolewrite('> Reference Phone Number --] '& $findnumb[$i] & ' [--'& @CRLF)
            Consolewrite('+> ^ Partial Match, within larger number --] '& $aArray[$a] & ' [-- row '& $a & @CRLF)
            ;ContinueLoop
        EndIf
        $typos = _Typos($dbnumbers, $reference) ; Find Similar numbers based on limits
        $stringlen = Stringlen($dbnumbers) / StringLen($reference)
        $similarity = Stringleft(100-($stringlen*$typos),6)
        IF $similarity > 97.5 then
            Consolewrite('> Reference Phone Number --] '& $findnumb[$i] & ' [--'& @CRLF)
            consolewrite('+> ^ Similarity Match, '& $similarity &'% similar to number --] '& $aArray[$a] &' [-- row '& $a & @CRLF)
        EndIf
    Next
Next

Consolewrite('---------------------------------------------------------------------------------------------------------------------'& @CRLF)
Consolewrite('---------------------------------------------------------------------------------------------------------------------'& @CRLF)

My output looks like so

---------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
> Reference Phone Number --] 882 8565 [--
+> ^ Partial Match, within larger number --] (01) 882 8565 [-- row 16
> Reference Phone Number --] 882 8565 [--
+> ^ Partial Match, within larger number --] +44 207 882 8565 [-- row 32
> Reference Phone Number --] 7543010 [--
+> ^ Partial Match, within larger number --] +441347543010 [-- row 26
> Reference Phone Number --] 00441619346534 [--
+> ^ Similarity Match, 99% similar to number --] 00441619346434 [-- row 17
> Reference Phone Number --] 00441619346534 [--
+> ^ Similarity Match, 97.642% similar to number --] 0161 934 6534 [-- row 30
> Reference Phone Number --] +44208.....missing numbers [optional task] [-- using last 2 digits
+> ^ Partial Match, matching first 2 numbers of --] 08457 128276 [-- row 5
> Reference Phone Number --] +44208.....missing numbers [optional task] [-- using last 2 digits
+> ^ Partial Match, matching first 2 numbers of --] 0-800-022-5649 [-- row 8
> Reference Phone Number --] 0015417543012 [--
+> ^ Similarity Match, 98.307% similar to number --] +1-541-754-3012 [-- row 25
---------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------

 

Edited by stamandster
Link to comment
Share on other sites

I'll leave this open until Thursday UK time. That gives three more days for any late entries. Perhaps you can come up with a new approach or improve on the ideas put forward already. One thing is for certain: there are some talented individuals around here and so far the discussion has been of value in many ways. I am constantly learning new things from you people! ;)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...