_SortByDateFuzzy

czardas · August 9, 2014

Fed up of so many confusing formats?

This is something for my text editor which can be used for ordering disorganized lines, with dates in them, into a timeline. The function _SortByDateFuzzy() attempts to put lines of text into sequential order using fuzzy date recognition. Default endianness is Little Endian (day/month/year sequence). The other two are of course very common: see link ==> http://en.wikipedia.org/wiki/Date_format_by_country

;

#include <Array.au3>

; #FUNCTION# ====================================================================================================================
; Name...........: _SortByDateFuzzy
; Description ...: Attempts to sort lines by date in multiline formated text such as CSV or similar.
; Syntax.........: _SortByDateFuzzy($sMultiline, [$Endianness = 1 [, $sLineBreak = @CRLF ]])
; Parameters ....: $sMultiline - multiline formated text
;                ; $Endianness - Default is day/month/year, set to 2 for month/day/year, set to 3 for year/month/day
;                ; $sLineBreak - Default = CRLF. Only vertical space characters are allowed.
; Return values .: Success - Returns the text lines in a sequential timeline.
;                  Failure - Sets @error to the following:
;                  @error = 1 No input exists.
;                  @error = 2 Endianness not recognized.
;                  @error = 3 Incorrect linebreak characters detected.
; Author ........: czardas
; Modified.......:
; Remarks .......: This function looks for the first pattern it finds that looks like a date in each line.
;                  Days and months can be 1 or 2 digits, and years can be 2 or 4 digits.
;                  Two digit years are interpreted as being within the current century.
;                  The following 18 date formats (used in various countries worldwide) are supported:
;                  Little Endian 'd/m/y', 'd-m-y', 'd.m.y', 'd m y', 'd.m-y',   'd. m y'
;                  Middle Endian 'm/d/y', 'm-d-y', 'm.d.y', 'm d y', 'm d, y',  'm. d, y'
;                  Big Endian    'y/m/d', 'y-m-d', 'y.m.d', 'y m d', 'y. m. d', 'y. m d.'
;                  Lines not containing a recognisable date format are dumped at the end in the same order they are encountered.
;                  Limitation - If multiple formats occur in the same line, dates with two identical delimiters get priority.
; ===============================================================================================================================

Func _SortByDateFuzzy($sMultiline, $Endianness = 1, $sLineBreak = @CRLF)
    If Not $sMultiline Then Return SetError(1) ; Invalid input.

    If $Endianness = Default Or $Endianness = "little endian" Then ; day/month/year
        $Endianness = 1
    ElseIf $Endianness = "middle endian" Then ; month/day/year
        $Endianness = 2
    ElseIf $Endianness = "big endian" Then ; year/month/day
        $Endianness = 3
    EndIf

    If $sLineBreak = Default Then $sLineBreak = @CRLF
    If $sLineBreak = "" Or StringRegExp($sLineBreak, "\V") Then Return SetError(3) ; $sLineBreak must only contain vertical space characters.

    Local $iBound = 3 + ($Endianness = 1)
    Local $sReplace, $aPattern[$iBound]
    $aPattern[0] = $iBound -1

    Switch $Endianness
        Case 1 ; Little Endian
            $aPattern[1] = '(?U)(.*)(\b\d{1,2})([\./ \-])(\d{1,2})(\g3)(\d{4}|\d{2}\b)(.*\z)' ; Delimiter repeats
            $aPattern[2] = '(?U)(.*)(\b\d{1,2})(\-)(\d{1,2})(\.)(\d{4}|\d{2}\b)(.*\z)' ; d-m.y
            $aPattern[3] = '(?U)(.*)(\b\d{1,2})(\. )(\d{1,2})( )(\d{4}|\d{2}\b)(.*\z)' ; d. m y

            $sReplace = "$6-$4-$2"
        Case 2 ; Middle Endian
            $aPattern[1] = '(?U)(.*)(\b\d{1,2})([\./ \-])(\d{1,2})(\g3)(\d{4}|\d{2}\b)(.*\z)' ; Delimiter repeats (same as above)
            $aPattern[2] = '(?U)(.*)(\b\d{1,2})(\. | )(\d{1,2})(, )(\d{4}|\d{2}\b)(.*\z)' ; d. m, y | d m, y

            $sReplace = "$6-$2-$4"
        Case 3 ; Big Endian
            $aPattern[1] = '(?U)(.*)(\b\d{4}|\b\d{2})([\./ \-]|\. )(\d{1,2})(\g3)(\d{1,2}\b)(.*\z)' ; Delimiter repeats
            $aPattern[2] = '(?U)(.*)(\b\d{4}|\b\d{2})(\. )(\d{1,2})( )(\d{1,2})(\..*\z)' ; y. m d.

            $sReplace = "$2-$4-$6"
        Case Else
            Return SetError(2) ; $Endianness must be a number between 1 and 3.
    EndSwitch

    Local $aEntries = StringSplit($sMultiline, $sLineBreak, 1) ; Whole delimiter string is used.

    Local $sDate, $aDateSplit, $aDateColumn[$aEntries[0] +1][2] ; A date sorting column is needed.
    For $i = 1 To $aEntries[0]
        $aDateColumn[$i][1] = $aEntries[$i]

        For $j = 1 To $aPattern[0]
            $sDate = StringRegExpReplace($aEntries[$i], $aPattern[$j], $sReplace, 1) ; Convert all to Big Endian

            If @extended Then ; Check replacements were made.
                $aDateSplit = StringSplit($sDate, "-", 2)
                $aDateSplit[0] = StringRight(StringLeft(@YEAR, 2) & $aDateSplit[0], 4) ; Accomodate 2 digit year

                If __ValidateDate($aDateSplit[2], $aDateSplit[1], $aDateSplit[0]) Then
                    $aDateColumn[$i][0] = $aDateSplit[0] & _ ; Four digit year
                    StringRight("0" & $aDateSplit[1], 2) & _ ; Accomodate single digit month
                    StringRight("0" & $aDateSplit[2], 2) ; Accomodate single digit day

                    ; Assuming either a single date or consistancy of format within a single line.
                    ContinueLoop 2 ; See limitations
                EndIf
            EndIf
        Next

        $aDateColumn[$i][0] = "z" & StringRight("000000" & $i, 7) ; Prevent sorting invalid entries.
    Next

    _ArraySort($aDateColumn, 0, 1)

    Local $sSorted = ""
    For $i = 1 To $aEntries[0] -1
        $sSorted &= $aDateColumn[$i][1] & $sLineBreak ; Append each entry in the new order.
    Next

    Return $sSorted & $aDateColumn[$aEntries[0]][1] ; Append the last entry without a line break
EndFunc ;==> _SortByDateFuzzy


Func __ValidateDate($sDay, $sMonth, $sYear) ; Year format must be ==> yyyy
    If $sMonth > 12 Or $sMonth = 0 Or $sDay = 0 Then Return False

    Switch Number($sMonth)
        Case 4, 6, 9, 11
            If $sDay > 30 Then Return False
        Case 2
            If (Mod($sYear, 4) And $sDay > 28) Or $sDay > 29 Then Return False
        Case Else
            If $sDay > 31 Then Return False
    EndSwitch

    Return True
EndFunc ;==> __ValidateDate

:

Eighteen different delimiter sequences (used internationally) are supported with single digits for months or days along with year formats of two or four digits. The function actually supports 48 possible variants in each of the three types of endianness. See notes in the function header.

Note: Some of the delimiter sequences may only occur when month names are used instead of numbers. They have low priority and are only included for reasons of compatibility with other date functions I have written. Month names are not supported at this present time.

Example;

Test_SortByDateFuzzy()

Func Test_SortByDateFuzzy()
    Local $aTest[3][2]
    $aTest[0][0] = "Little Endian"
    $aTest[1][0] = "Middle Endian"
    $aTest[2][0] = "Big Endian"

    $aTest[0][1] = _
    '14-8-2014,CSV DATA (UK)' & @CRLF & _
    'Chinese New Year: 31/1/14' & @CRLF & _
    'Fri 26-12.14 Boxing Day' & @CRLF & _
    'Chinese New Year 19 02 2015' & @CRLF & _
    '13 8 2014,CSV DATA (Little Endian)' & @CRLF & _
    '14-10.1066, Battle of Hastings' & @CRLF & _
    'Thu 25.12.14 Christmas Day' & @CRLF & _
    '#ID,"14. 10 14",DETAILS' & @CRLF & _
    'Bach was born on 31/3/1685' & @CRLF & _
    'Christmas Eve 24. 12 2014'

    $aTest[1][1] = _
    '8/14/2014,CSV DATA (Middle Endian)' & @CRLF & _
    '#ID,"10 14, 14",DETAILS' & @CRLF & _
    'Boxing Day - Fri 12. 26, 2014' & @CRLF & _
    'JSB was born on 03. 31, 1685' & @CRLF & _
    '12 25 14 Christmas Day' & @CRLF & _
    '12 24 2014 Christmas Eve' & @CRLF & _
    'Thursday, 02. 19, 2015 Chinese New Year' & @CRLF & _
    '8.13.2014,CSV DATA (US)' & @CRLF & _
    '1 31, 2014 Chinese New Year' & @CRLF & _
    'Battle of Hastings 10.14.1066'

    $aTest[2][1] = _
    '1066 10 14 - Battle of Hastings' & @CRLF & _
    'Chinese New Year 15 2 19' & @CRLF & _
    '2014-08-13,CSV DATA 2' & @CRLF & _
    '1685. 3. 31 - Bach was born' & @CRLF & _
    'Thu 2014. 12 25., Christmas Day' & @CRLF & _ ; ERROR
    'Boxing Day - Fri 14-12-26' & @CRLF & _
    '2014.8.14,CSV DATA 1' & @CRLF & _
    '#ID,"14. 10 14.",DETAILS' & @CRLF & _
    '14/12/24 Christmas Eve' & @CRLF & _
    'Friday, 2014-1-31 Chinese New Year'

    For $i = 0 To 2
        ConsoleWrite($aTest[$i][0] & " ===>" & @CRLF)
        ConsoleWrite(_SortByDateFuzzy($aTest[$i][1], $i +1) & @CRLF)
        ConsoleWrite(@CRLF)
    Next
EndFunc ;==> Test_SortByDateFuzzy()

:

Output:

Little Endian ===>
14-10.1066, Battle of Hastings
Bach was born on 31/3/1685
Chinese New Year: 31/1/14
13 8 2014,CSV DATA (Little Endian)
14-8-2014,CSV DATA (UK)
#ID,"14. 10 14",DETAILS
Christmas Eve 24. 12 2014
Thu 25.12.14 Christmas Day
Fri 26-12.14 Boxing Day
Chinese New Year 19 02 2015

Middle Endian ===>
Battle of Hastings 10.14.1066
JSB was born on 03. 31, 1685
1 31, 2014 Chinese New Year
8.13.2014,CSV DATA (US)
8/14/2014,CSV DATA (Middle Endian)
#ID,"10 14, 14",DETAILS
12 24 2014 Christmas Eve
12 25 14 Christmas Day
Boxing Day - Fri 12. 26, 2014
Thursday, 02. 19, 2015 Chinese New Year

Big Endian ===>
1066 10 14 - Battle of Hastings
1685. 3. 31 - Bach was born
Friday, 2014-1-31 Chinese New Year
2014-08-13,CSV DATA 2
2014.8.14,CSV DATA 1
#ID,"14. 10 14.",DETAILS
14/12/24 Christmas Eve
Thu 2014. 12 25., Christmas Day
Boxing Day - Fri 14-12-26
Chinese New Year 15 2 19

;

It is not assumed that the data is of any particular format - only the endianness of dates is assumed to be consistent. This can never be an exact science.

Edited August 14, 2014 by czardas

JohnOne · August 10, 2014

Code looks quite complex, so without reversing through it I'll attempt a suggestion to make less fuzzy via a little more logical assumption, assuming you do not already do it, purely pseudo of course.

Items/dates that cannot be determined such as 01/01/2014 and 12/12/2014 etcetera.

When encountered some sort of string algorithm could be employed to look for another item/entry with same similar pattern.

For example

01/01/2014 day 1 susan

12/12/2014 day 2 susan

31/09/2014 day 7

03/21/2014 day 7

14/12/2014 day 9 susan

First two dates cannot really be determined, so code searches rest of entries for similarities such as susan if you know what I mean, and perhaps other key words, if found and date can be determined then it is assumed the first can be considered same format.

Obviously not a simple task, and that's if you can even make sense of my ramblings.

Edited August 10, 2014 by JohnOne

czardas · August 10, 2014

Hmm, I'm not quite sure what you mean. If you leave the endianness as default (UK format) then the only failed enty is 03/21/2014. I suppose it's possible to have mixed endianness in the same text, but that's really impossible to determine.

01/01/2014 day 1 susan
31/09/2014 day 7
12/12/2014 day 2 susan
14/12/2014 day 9 susan
03/21/2014 day 7 <== Failed

Setting endianness to 'middle endian' or '2' (US Format) produces a different interpretation:

01/01/2014 day 1 susan
03/21/2014 day 7
12/12/2014 day 2 susan
31/09/2014 day 7 <== Failed
14/12/2014 day 9 susan <== Failed

I still need to add a check so that the number of days in february, april, june, september and november do not exceed 29 or 30. The selected delimiters are commonly used (in various combinations) in different regions throughout the world. Some data could throw a FP.

I'm personally going to use it for UK dates only.

Edited August 10, 2014 by czardas

JohnOne · August 10, 2014

I see, I thought your script was to sort mixed date formats or endianness if that's what that is.

czardas · August 10, 2014

Nope! Determining endianness would be mostly impossible. The idea is that most multiline data containing dates will be possible to parse providing endianness is specified. The delimiters are the fuzzy part. Suppose you are in an office and everyone uses UK endianness, but several people prefer different styles. Some write dd-mm-yyyy and lazy people write d/m/yy. That does not constitute a problem for this script.

Edited August 14, 2014 by czardas

czardas · August 13, 2014

Update to first post: Now this function would probably be better renamed to _SortByDateFussy because it's a lot more fussy (less false positives). It's still slightly fuzzy, but only a little bit. New description in 1st post.

Edited August 14, 2014 by czardas

jchd · August 13, 2014

My take about dates & timestamps: always store and process dates as YYYY-MM-DD and datetime as YYYY-MM-DDThh:mm:ss[.ssssss] (T may be a space) or as julian day and juliandate respectively and only convert to/from required user display format for human interface.

czardas · August 13, 2014

I became fed up with the strictness that the sort function imposes, and ended up using multiple date formats to accomodate limitations of my program. My program can already sort the yyyy/dd/mm format if every line starts with a date, but I don't wish to continue imposing such restrictions on myself. I prefer to have a function which is more intelligent when looking for dates, which could appear in different places within lines, possibly originating from different documents. The function is one of several sorting algorithms for multiline text in an edit control.

Edited August 13, 2014 by czardas

czardas · August 14, 2014

I have added a test example to the first post. Running the test revealed a bug with one of the (big endian) regular expressions which has now been fixed. My earlier comment about comma delimiters possibly conflicting with csv format was highly exagerated because csv fields containing commas must be wrapped in double quotes - silly me.

@jchd I don't have any plans to support the format 'YYYY-MM-DDThh:mm:ss[.ssssss]' because the delimiter 'T' is an odd exception not found in any language outside the realm of computing: besides this format is ready to sort anyway and does not require preprocessing. However if you change the delimiter 'T' to space, or any non-word character, then it's also good to go!

The thing I like most about this function is that it will sort dates with prefixed day names: Sun, Mon Tue etc...

Edited August 14, 2014 by czardas

jchd · August 14, 2014

I wasn't asking you to support all of ISO 8601 format variants, even any of them.

I felt the urge to dump a personal opinion dictated by countless of times that I hit walls processing ad-hoc dates or timestamps. It has nothing against your work or anything else.

I try hard to adopt the internal data representation having the most computability or compactness or, say, the most natural from a computer point of view. For integers it's 2's complement, for strings it's some choice of Unicode UTFx, etc. When I have to display values to human interface it's time to convert to a more natural representation from the human point of view in his context (e.g. language, preferences).

I would find utterly unnatural to input/display to my users (or myself) integers as they are represented in RAM like 87654321 (hex in PC) for 305419896 (human). Likewise it isn't particularly smart in general to represent numeric values as strings when you know you'll have to perform arithmetic with them, contrary to EAN barcodes for instance which are references which someday might contain non-digits when a new referencing convention appears. Fortunately languages like AutoIt provide automagic conversions between internal (computer) and human representations for numbers and strings.

Not all "computer languages" (or merely human languages that computers can be persuaded to understand) offer date, time or datetime built-in types and AutoIt is one of them. So the question is: how to represent such values internally?

Juliandate (astronomical) is the most computational-friendly and efficient way to represent precise and unambiguous timestamps, yet it's definitely human-unfriendly thus it needs conversion to what you, Hamamoto or me would like to see. It just happens that you, Hamamoto and me are all expecting a different formatting for the same data.

Edited August 14, 2014 by jchd

czardas · August 14, 2014

I did consider the idea: which I also understand is perfectly logical. I'm always appreciative of your comments and have great respect for your experience in this field. Indeed it is not easy, nor necessarily reliable, to cater for all plausible formats - hopefully just a few of the more commonly accepted ones. Internal 'back end' stuff clearly has to be consistent.

Personally, I would never recommend anyone using a two digit year under any circumstance.

Edited August 14, 2014 by czardas

_SortByDateFuzzy

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Similar Content