czardas Posted August 9, 2014 Posted August 9, 2014 (edited) Fed up of so many confusing formats? This is something for my text editor which can be used for ordering disorganized lines, with dates in them, into a timeline. The function _SortByDateFuzzy() attempts to put lines of text into sequential order using fuzzy date recognition. Default endianness is Little Endian (day/month/year sequence). The other two are of course very common: see link ==> http://en.wikipedia.org/wiki/Date_format_by_country ; expandcollapse popup#include <Array.au3> ; #FUNCTION# ==================================================================================================================== ; Name...........: _SortByDateFuzzy ; Description ...: Attempts to sort lines by date in multiline formated text such as CSV or similar. ; Syntax.........: _SortByDateFuzzy($sMultiline, [$Endianness = 1 [, $sLineBreak = @CRLF ]]) ; Parameters ....: $sMultiline - multiline formated text ; ; $Endianness - Default is day/month/year, set to 2 for month/day/year, set to 3 for year/month/day ; ; $sLineBreak - Default = CRLF. Only vertical space characters are allowed. ; Return values .: Success - Returns the text lines in a sequential timeline. ; Failure - Sets @error to the following: ; @error = 1 No input exists. ; @error = 2 Endianness not recognized. ; @error = 3 Incorrect linebreak characters detected. ; Author ........: czardas ; Modified.......: ; Remarks .......: This function looks for the first pattern it finds that looks like a date in each line. ; Days and months can be 1 or 2 digits, and years can be 2 or 4 digits. ; Two digit years are interpreted as being within the current century. ; The following 18 date formats (used in various countries worldwide) are supported: ; Little Endian 'd/m/y', 'd-m-y', 'd.m.y', 'd m y', 'd.m-y', 'd. m y' ; Middle Endian 'm/d/y', 'm-d-y', 'm.d.y', 'm d y', 'm d, y', 'm. d, y' ; Big Endian 'y/m/d', 'y-m-d', 'y.m.d', 'y m d', 'y. m. d', 'y. m d.' ; Lines not containing a recognisable date format are dumped at the end in the same order they are encountered. ; Limitation - If multiple formats occur in the same line, dates with two identical delimiters get priority. ; =============================================================================================================================== Func _SortByDateFuzzy($sMultiline, $Endianness = 1, $sLineBreak = @CRLF) If Not $sMultiline Then Return SetError(1) ; Invalid input. If $Endianness = Default Or $Endianness = "little endian" Then ; day/month/year $Endianness = 1 ElseIf $Endianness = "middle endian" Then ; month/day/year $Endianness = 2 ElseIf $Endianness = "big endian" Then ; year/month/day $Endianness = 3 EndIf If $sLineBreak = Default Then $sLineBreak = @CRLF If $sLineBreak = "" Or StringRegExp($sLineBreak, "\V") Then Return SetError(3) ; $sLineBreak must only contain vertical space characters. Local $iBound = 3 + ($Endianness = 1) Local $sReplace, $aPattern[$iBound] $aPattern[0] = $iBound -1 Switch $Endianness Case 1 ; Little Endian $aPattern[1] = '(?U)(.*)(\b\d{1,2})([\./ \-])(\d{1,2})(\g3)(\d{4}|\d{2}\b)(.*\z)' ; Delimiter repeats $aPattern[2] = '(?U)(.*)(\b\d{1,2})(\-)(\d{1,2})(\.)(\d{4}|\d{2}\b)(.*\z)' ; d-m.y $aPattern[3] = '(?U)(.*)(\b\d{1,2})(\. )(\d{1,2})( )(\d{4}|\d{2}\b)(.*\z)' ; d. m y $sReplace = "$6-$4-$2" Case 2 ; Middle Endian $aPattern[1] = '(?U)(.*)(\b\d{1,2})([\./ \-])(\d{1,2})(\g3)(\d{4}|\d{2}\b)(.*\z)' ; Delimiter repeats (same as above) $aPattern[2] = '(?U)(.*)(\b\d{1,2})(\. | )(\d{1,2})(, )(\d{4}|\d{2}\b)(.*\z)' ; d. m, y | d m, y $sReplace = "$6-$2-$4" Case 3 ; Big Endian $aPattern[1] = '(?U)(.*)(\b\d{4}|\b\d{2})([\./ \-]|\. )(\d{1,2})(\g3)(\d{1,2}\b)(.*\z)' ; Delimiter repeats $aPattern[2] = '(?U)(.*)(\b\d{4}|\b\d{2})(\. )(\d{1,2})( )(\d{1,2})(\..*\z)' ; y. m d. $sReplace = "$2-$4-$6" Case Else Return SetError(2) ; $Endianness must be a number between 1 and 3. EndSwitch Local $aEntries = StringSplit($sMultiline, $sLineBreak, 1) ; Whole delimiter string is used. Local $sDate, $aDateSplit, $aDateColumn[$aEntries[0] +1][2] ; A date sorting column is needed. For $i = 1 To $aEntries[0] $aDateColumn[$i][1] = $aEntries[$i] For $j = 1 To $aPattern[0] $sDate = StringRegExpReplace($aEntries[$i], $aPattern[$j], $sReplace, 1) ; Convert all to Big Endian If @extended Then ; Check replacements were made. $aDateSplit = StringSplit($sDate, "-", 2) $aDateSplit[0] = StringRight(StringLeft(@YEAR, 2) & $aDateSplit[0], 4) ; Accomodate 2 digit year If __ValidateDate($aDateSplit[2], $aDateSplit[1], $aDateSplit[0]) Then $aDateColumn[$i][0] = $aDateSplit[0] & _ ; Four digit year StringRight("0" & $aDateSplit[1], 2) & _ ; Accomodate single digit month StringRight("0" & $aDateSplit[2], 2) ; Accomodate single digit day ; Assuming either a single date or consistancy of format within a single line. ContinueLoop 2 ; See limitations EndIf EndIf Next $aDateColumn[$i][0] = "z" & StringRight("000000" & $i, 7) ; Prevent sorting invalid entries. Next _ArraySort($aDateColumn, 0, 1) Local $sSorted = "" For $i = 1 To $aEntries[0] -1 $sSorted &= $aDateColumn[$i][1] & $sLineBreak ; Append each entry in the new order. Next Return $sSorted & $aDateColumn[$aEntries[0]][1] ; Append the last entry without a line break EndFunc ;==> _SortByDateFuzzy Func __ValidateDate($sDay, $sMonth, $sYear) ; Year format must be ==> yyyy If $sMonth > 12 Or $sMonth = 0 Or $sDay = 0 Then Return False Switch Number($sMonth) Case 4, 6, 9, 11 If $sDay > 30 Then Return False Case 2 If (Mod($sYear, 4) And $sDay > 28) Or $sDay > 29 Then Return False Case Else If $sDay > 31 Then Return False EndSwitch Return True EndFunc ;==> __ValidateDate : Eighteen different delimiter sequences (used internationally) are supported with single digits for months or days along with year formats of two or four digits. The function actually supports 48 possible variants in each of the three types of endianness. See notes in the function header. Note: Some of the delimiter sequences may only occur when month names are used instead of numbers. They have low priority and are only included for reasons of compatibility with other date functions I have written. Month names are not supported at this present time. Example; expandcollapse popupTest_SortByDateFuzzy() Func Test_SortByDateFuzzy() Local $aTest[3][2] $aTest[0][0] = "Little Endian" $aTest[1][0] = "Middle Endian" $aTest[2][0] = "Big Endian" $aTest[0][1] = _ '14-8-2014,CSV DATA (UK)' & @CRLF & _ 'Chinese New Year: 31/1/14' & @CRLF & _ 'Fri 26-12.14 Boxing Day' & @CRLF & _ 'Chinese New Year 19 02 2015' & @CRLF & _ '13 8 2014,CSV DATA (Little Endian)' & @CRLF & _ '14-10.1066, Battle of Hastings' & @CRLF & _ 'Thu 25.12.14 Christmas Day' & @CRLF & _ '#ID,"14. 10 14",DETAILS' & @CRLF & _ 'Bach was born on 31/3/1685' & @CRLF & _ 'Christmas Eve 24. 12 2014' $aTest[1][1] = _ '8/14/2014,CSV DATA (Middle Endian)' & @CRLF & _ '#ID,"10 14, 14",DETAILS' & @CRLF & _ 'Boxing Day - Fri 12. 26, 2014' & @CRLF & _ 'JSB was born on 03. 31, 1685' & @CRLF & _ '12 25 14 Christmas Day' & @CRLF & _ '12 24 2014 Christmas Eve' & @CRLF & _ 'Thursday, 02. 19, 2015 Chinese New Year' & @CRLF & _ '8.13.2014,CSV DATA (US)' & @CRLF & _ '1 31, 2014 Chinese New Year' & @CRLF & _ 'Battle of Hastings 10.14.1066' $aTest[2][1] = _ '1066 10 14 - Battle of Hastings' & @CRLF & _ 'Chinese New Year 15 2 19' & @CRLF & _ '2014-08-13,CSV DATA 2' & @CRLF & _ '1685. 3. 31 - Bach was born' & @CRLF & _ 'Thu 2014. 12 25., Christmas Day' & @CRLF & _ ; ERROR 'Boxing Day - Fri 14-12-26' & @CRLF & _ '2014.8.14,CSV DATA 1' & @CRLF & _ '#ID,"14. 10 14.",DETAILS' & @CRLF & _ '14/12/24 Christmas Eve' & @CRLF & _ 'Friday, 2014-1-31 Chinese New Year' For $i = 0 To 2 ConsoleWrite($aTest[$i][0] & " ===>" & @CRLF) ConsoleWrite(_SortByDateFuzzy($aTest[$i][1], $i +1) & @CRLF) ConsoleWrite(@CRLF) Next EndFunc ;==> Test_SortByDateFuzzy() : Output: Little Endian ===> 14-10.1066, Battle of Hastings Bach was born on 31/3/1685 Chinese New Year: 31/1/14 13 8 2014,CSV DATA (Little Endian) 14-8-2014,CSV DATA (UK) #ID,"14. 10 14",DETAILS Christmas Eve 24. 12 2014 Thu 25.12.14 Christmas Day Fri 26-12.14 Boxing Day Chinese New Year 19 02 2015 Middle Endian ===> Battle of Hastings 10.14.1066 JSB was born on 03. 31, 1685 1 31, 2014 Chinese New Year 8.13.2014,CSV DATA (US) 8/14/2014,CSV DATA (Middle Endian) #ID,"10 14, 14",DETAILS 12 24 2014 Christmas Eve 12 25 14 Christmas Day Boxing Day - Fri 12. 26, 2014 Thursday, 02. 19, 2015 Chinese New Year Big Endian ===> 1066 10 14 - Battle of Hastings 1685. 3. 31 - Bach was born Friday, 2014-1-31 Chinese New Year 2014-08-13,CSV DATA 2 2014.8.14,CSV DATA 1 #ID,"14. 10 14.",DETAILS 14/12/24 Christmas Eve Thu 2014. 12 25., Christmas Day Boxing Day - Fri 14-12-26 Chinese New Year 15 2 19 ; It is not assumed that the data is of any particular format - only the endianness of dates is assumed to be consistent. This can never be an exact science. Edited August 14, 2014 by czardas operator64Â Â ArrayWorkshop
JohnOne Posted August 10, 2014 Posted August 10, 2014 (edited) Code looks quite complex, so without reversing through it I'll attempt a suggestion to make less fuzzy via a little more logical assumption, assuming you do not already do it, purely pseudo of course.Items/dates that cannot be determined such as 01/01/2014 and 12/12/2014 etcetera.When encountered some sort of string algorithm could be employed to look for another item/entry with same similar pattern.For example01/01/2014 day 1 susan12/12/2014 day 2 susan31/09/2014 day 703/21/2014 day 714/12/2014 day 9 susanFirst two dates cannot really be determined, so code searches rest of entries for similarities such as susan if you know what I mean, and perhaps other key words, if found and date can be determined then it is assumed the first can be considered same format.Obviously not a simple task, and that's if you can even make sense of my ramblings. Edited August 10, 2014 by JohnOne AutoIt Absolute Beginners  Require a serial  Pause Script  Video Tutorials by Morthawt  ipify Monkey's are, like, natures humans.
czardas Posted August 10, 2014 Author Posted August 10, 2014 (edited) Hmm, I'm not quite sure what you mean. If you leave the endianness as default (UK format) then the only failed enty is 03/21/2014. I suppose it's possible to have mixed endianness in the same text, but that's really impossible to determine. 01/01/2014 day 1 susan 31/09/2014 day 7 12/12/2014 day 2 susan 14/12/2014 day 9 susan 03/21/2014 day 7 <== Failed Setting endianness to 'middle endian' or '2' (US Format) produces a different interpretation: 01/01/2014 day 1 susan 03/21/2014 day 7 12/12/2014 day 2 susan 31/09/2014 day 7 <== Failed 14/12/2014 day 9 susan <== Failed I still need to add a check so that the number of days in february, april, june, september and november do not exceed 29 or 30. The selected delimiters are commonly used (in various combinations) in different regions throughout the world. Some data could throw a FP. I'm personally going to use it for UK dates only. Edited August 10, 2014 by czardas operator64Â Â ArrayWorkshop
JohnOne Posted August 10, 2014 Posted August 10, 2014 I see, I thought your script was to sort mixed date formats or endianness if that's what that is. AutoIt Absolute Beginners  Require a serial  Pause Script  Video Tutorials by Morthawt  ipify Monkey's are, like, natures humans.
czardas Posted August 10, 2014 Author Posted August 10, 2014 (edited) Nope! Determining endianness would be mostly impossible. Â The idea is that most multiline data containing dates will be possible to parse providing endianness is specified. The delimiters are the fuzzy part. Suppose you are in an office and everyone uses UK endianness, but several people prefer different styles. Some write dd-mm-yyyy and lazy people write d/m/yy. That does not constitute a problem for this script. Edited August 14, 2014 by czardas operator64Â Â ArrayWorkshop
czardas Posted August 13, 2014 Author Posted August 13, 2014 (edited) Update to first post: Now this function would probably be better renamed to _SortByDateFussy because it's a lot more fussy (less false positives). It's still slightly fuzzy, but only a little bit. New description in 1st post. Edited August 14, 2014 by czardas operator64Â Â ArrayWorkshop
jchd Posted August 13, 2014 Posted August 13, 2014 My take about dates & timestamps: always store and process dates as YYYY-MM-DD and datetime as YYYY-MM-DDThh:mm:ss[.ssssss] (T may be a space) or as julian day and juliandate respectively and only convert to/from required user display format for human interface. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
czardas Posted August 13, 2014 Author Posted August 13, 2014 (edited) I became fed up with the strictness that the sort function imposes, and ended up using multiple date formats to accomodate limitations of my program. My program can already sort the yyyy/dd/mm format if every line starts with a date, but I don't wish to continue imposing such restrictions on myself. I prefer to have a function which is more intelligent when looking for dates, which could appear in different places within lines, possibly originating from different documents. The function is one of several sorting algorithms for multiline text in an edit control. Edited August 13, 2014 by czardas operator64Â Â ArrayWorkshop
czardas Posted August 14, 2014 Author Posted August 14, 2014 (edited) I have added a test example to the first post. Running the test revealed a bug with one of the (big endian) regular expressions which has now been fixed. My earlier comment about comma delimiters possibly conflicting with csv format was highly exagerated because csv fields containing commas must be wrapped in double quotes - silly me. @jchd I don't have any plans to support the format 'YYYY-MM-DDThh:mm:ss[.ssssss]' because the delimiter 'T' is an odd exception not found in any language outside the realm of computing: besides this format is ready to sort anyway and does not require preprocessing. However if you change the delimiter 'T' to space, or any non-word character, then it's also good to go! The thing I like most about this function is that it will sort dates with prefixed day names: Sun, Mon Tue etc... Edited August 14, 2014 by czardas operator64Â Â ArrayWorkshop
jchd Posted August 14, 2014 Posted August 14, 2014 (edited) I wasn't asking you to support all of ISO 8601 format variants, even any of them. I felt the urge to dump a personal opinion dictated by countless of times that I hit walls processing ad-hoc dates or timestamps. It has nothing against your work or anything else. I try hard to adopt the internal data representation having the most computability or compactness or, say, the most natural from a computer point of view. For integers it's 2's complement, for strings it's some choice of Unicode UTFx, etc. When I have to display values to human interface it's time to convert to a more natural representation from the human point of view in his context (e.g. language, preferences). I would find utterly unnatural to input/display to my users (or myself) integers as they are represented in RAM like 87654321 (hex in PC) for 305419896 (human). Likewise it isn't particularly smart in general to represent numeric values as strings when you know you'll have to perform arithmetic with them, contrary to EAN barcodes for instance which are references which someday might contain non-digits when a new referencing convention appears. Fortunately languages like AutoIt provide automagic conversions between internal (computer) and human representations for numbers and strings. Not all "computer languages" (or merely human languages that computers can be persuaded to understand) offer date, time or datetime built-in types and AutoIt is one of them. So the question is: how to represent such values internally? Juliandate (astronomical) is the most computational-friendly and efficient way to represent precise and unambiguous timestamps, yet it's definitely human-unfriendly thus it needs conversion to what you, Hamamoto or me would like to see. It just happens that you, Hamamoto and me are all expecting a different formatting for the same data. Edited August 14, 2014 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
czardas Posted August 14, 2014 Author Posted August 14, 2014 (edited) I did consider the idea: which I also understand is perfectly logical. I'm always appreciative of your comments and have great respect for your experience in this field. Indeed it is not easy, nor necessarily reliable, to cater for all plausible formats - hopefully just a few of the more commonly accepted ones. Internal 'back end' stuff clearly has to be consistent. Personally, I would never recommend anyone using a two digit year under any circumstance. Edited August 14, 2014 by czardas operator64Â Â ArrayWorkshop
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now