Impossible to do with RegEx?

avery · December 15, 2009

Does anyone know if this is impossible to do with RegEx or not? (Warning: Head-ache material)

To populate an Array with:

Example #1
04/19/2005  09:16 AM            16,384 BUILTIN\Administrators filename.doc
1111111111  22222222            333333 44444444444444444444444444444444444

Example #2
04/19/2005  09:16 AM            16,384 BUILTIN\Administrators filename.doc
1111111111  22222222            333333 4444444444444444444444 555555555555

The <blank area> are not tabs, unforgivably, they are spaces.

I also understand a login name could include spaces as well so I figured "Example #1" is impossible or to high of a probability to result in bad results.

I figured "Example #2" might be do-able if I was to understand regex better.

I tried to use StringSplit but the delimiters are not consistent enough for me to get good results with.

Please, if there are any regex guru's out there, help me. These things hurt my head worse then anything else.

I understand I am asking for a lot of help. I'll donate 10$ to jon@autoitscript.com to help with his hosting bills if anyone is willing to try and help me out with my regex.

Thanks for reading my post.

Respectfully,

Avery Howell

_{Merry Christmas or Happy Holidays!}

The autoitscript.com domain runs on its own physical and dedicated server and currently handles 30GB of traffic per day.

The hosting fees are paid for by user donations and my own money. Please make a donation if you feel AutoIt is worth supporting. No amount is too small - it all helps

Thanks,
Jon

PsaltyDS · December 15, 2009

Oh, c'mon it wasn't that har... uhmm... I mean...

That was tough! Here you go:

#include <Array.au3>

Global $aInput[3] = ["03/18/2007  08:16 AM               987 BUILTIN\Users SmallFile.doc", _
        "05/20/2008  12:01 PM            16,384 BUILTIN\Administrators filename.doc", _
        "04/19/2005  09:16 AM         2,316,384 BUILTIN\Guests BigFile.doc"]

Global $sRegExp = "(\d{2}/\d{2}/\d{4})(?:\s+)(\d{2}:\d{2}\s[[:alpha:]]{2})(?:\s+)([0-9,]+)(?:\s+)(.+)"

For $n = 0 To UBound($aInput) - 1
    $aRET = StringRegExp($aInput[$n], $sRegExp, 3)
    If IsArray($aRET) Then
        _ArrayDisplay($aRET, $n & ":  $aRET")
    Else
        ConsoleWrite($n & ":  Error" & @LF)
    EndIf
Next

Make that donation to AutoIt commensurate with the extreme effort this required!

SmOke_N · December 16, 2009

I was just going to offer another pattern example to achieve the same thing... however, I had a thought that maybe this is one larger fileread or string.

So...

#include <Array.au3>

Global $s_string = "04/19/2005  09:16 AM            16,384 BUILTIN\Administrators filename.doc" & @CRLF
$s_string &= "1111111111  22222222            333333 44444444444444444444444444444444444" & @CRLF
$s_string &= "08/24/2006  11:23 PM            6 BUILTIN\Administrators filename.doc" & @CRLF
$s_string &= "1111111111  22222222            333333 4444444444444444444444 555555555555"

; If we have a large string, we can do this in two parts ( or one if you want to step 4)
; Get just the lines that are valid
Global $a_just_lines = _myString_GetValidLinesArray($s_string)
If IsArray($a_just_lines) = 0 Then Exit
_ArrayDisplay($a_just_lines)

; If we are not skipping the above ( not using Step 4 )
; Then we can send each individual line and get the 4 parts of the values returned
Global $a_sep_data
For $i = 0 To UBound($a_just_lines) - 1
    $a_sep_data = _myString_GetValidDataArray($a_just_lines[$i])
    _ArrayDisplay($a_sep_data)
Next

Func _myString_GetValidLinesArray($s_string)
    Local $s_pattern = "(\d{2}/\d{2}/\d{4}\s+\d+:\d+\s+(?:AM|PM)\s+[\d,]+\s+.+?)(?:\v|\z)"
    Return StringRegExp($s_string, $s_pattern, 3)
EndFunc

Func _myString_GetValidDataArray($s_string)
    Local $s_pattern = "(\d{2}/\d{2}/\d{4})\s+(\d+:\d+\s+(?:AM|PM))\s+([\d,]+)\s+(.+?)(?:\v|\z)"
    Return StringRegExp($s_string, $s_pattern, 3)
EndFunc

PsaltyDS · December 16, 2009

I was just going to offer another pattern example to achieve the same thing... however, I had a thought that maybe this is one larger fileread or string.
So...

Don't forget to emphasize what a huge level of effort this requires. We'd hate to see avery feel like a Scrooge at Christmas, now wouldn't we?

enaiman · December 16, 2009

Not the worst case to work with; only 1 group out of 4 is "not known" to you (it may have white spaces or not).

You know for sure that first group and the 3rd one does not have any white spaces. You know also that the 2nd group has 1 white space.

It can be easily done without StringRegExp (easy for me because StringRegExp is still a matter of trail and error for me) this way:

- StringStripWS with flag 4 (strip double or more spaces between words)

- StringSplit for " " (white space)

- [1] is the first group (date)

- [2] & [3] is "time"

- [4] is "size"

- what's left is the last group

It could have been worse: other groups might have or not white spaces or they might be present or not ... and you were speaking about headaches

Malkey · December 16, 2009

Here is another attempt at using the string of numbers in each example as a template for the entries of an array.

#include <Array.au3>

Global $s_string = "04/19/2005 09:16 AM     16,384 BUILTIN\Administrators filename.doc" & @CRLF
$s_string &= "1111111111 22222222   333333 44444444444444444444444444444444444" & @CRLF
$s_string &= "08/24/2006 11:23 PM   16,384 BUILTIN\Administrators filename.doc" & @CRLF
$s_string &= "1111111111 22222222   333333 4444444444444444444444 555555555555"

Local $temp
$aInput = StringSplit(StringRegExpReplace($s_string, "([ ]+)", " "), @CRLF, 3)

For $Ex = 0 To UBound($aInput) - 1 Step 2
    Local $Pat = StringRegExp($aInput[$Ex + 1], "([^ ]+)", 3)

    Local $aArray[UBound($Pat)]
    ConsoleWrite($aInput[$Ex] & @CRLF)
    $Num = 1
    For $i = 0 To StringLen($aInput[$Ex + 1] & " ")
        If StringMid($aInput[$Ex + 1] & " ", $i, 1) = $Num Then
            $temp &= StringMid($aInput[$Ex], $i, 1)
        EndIf
        If StringMid($aInput[$Ex + 1] & " ", $i, 1) = " " Then
            $aArray[$Num - 1] = $temp
            $Num += 1
            ConsoleWrite($Num & " " & $temp & @CRLF)
            $temp = ""
        EndIf
    Next
    _ArrayDisplay($aArray)
Next

Skruge · December 16, 2009

Don't forget to emphasize what a huge level of effort this requires. We'd hate to see avery feel like a Scrooge at Christmas, now wouldn't we?

You rang?

Seriously though, my contribution to this matter is thus:

The given output looks exactly like the output of the "dir /q" command.

If this is correct, then the owner field is fixed at 23 characters (longer names are concatenated with no space between it and the filename, shorter names are padded with spaces)

Mison · December 16, 2009

Regex Pattern..

Single Line

[\d/:,]+(?:(?:\sA|P)M)?|[A-Z]+\\.*(?=\s)|[a-z.]+

~~Doesn't works if login name has spaces.~~ Fixed

Multilines mode:

(*ANYCRLF)(?m)[\d/:,]+(?:(?:\sA|P)M)?|[A-Z]+\\.*(?=\s\S+$)|[a-z.]+

Edited December 16, 2009 by Mison

Malkey · December 16, 2009

Another attempt.

#include <Array.au3>

Global $s_string = "04/19/2005 09:16 AM     16,384 BUILTIN\Administrators filename.doc" & @CRLF
$s_string &= "1111111111 22222222 333333 44444444444444444444444444444444444" & @CRLF
$s_string &= "08/24/2006 11:23 PM 16,384 BUILTIN\Administrators filename.doc" & @CRLF
$s_string &= "1111111111 22222222 333333 4444444444444444444444 555555555555"


$aInput = StringSplit(StringRegExpReplace($s_string, "([ ]+)", " "), @CRLF, 3)

For $Ex = 0 To 1
    Local $aResult = StringRegExp($aInput[$Ex], "(.{10}) *(.{8}) *(.{6}) *(.*)", 3)
    _ArrayDisplay($aResult)
Next
For $Ex = 2 To 3
    Local $aResult2 = StringRegExp($aInput[$Ex], "(.{10}) *(.{8}) *(.{6}) *(.*?) (.*)", 3)
    _ArrayDisplay($aResult2)
Next

Anteaus · December 17, 2009

Just a point, but unless strings containing spaces are enclosed in quotes (which it looks like they aren't) then I don't think #2 can be done by any method. #1 should be feasible if times and dates are assumed to be in a regular format.

For example, if you have "domain\user name file.txt" there is no way of telling which section 'name' belongs to, so you cannot separate #4 from #5.

avery · December 17, 2009

Thanks guys. I still think it was a hard regex.

My example was listed with the numbers under the data as the array index number I was looking to create using the regex but I'm pretty sure these awesome regex would work with either anyways, correct? The 111,222,333 etc is not in the original data-source I'm looking to parse.

I will do the donation just as I promised and it was totally worth it even though some of you think it was easy. I've always struggled with regex for some reason. Maybe someone will buy me a regex book for Christmas, it was on my list to Santa.

enaiman · December 17, 2009

Told you can be done without using RegEx. I agree, RegEx results in a shorter and faster code and for those RegEx gurus nothing is easier, but there are always workarounds. There is always at least one other way to do it.

danielkza · December 18, 2009

@avery:

Have you checked http://www.regular-expressions.info ? RegEx looked like voodoo to me as well until I took the time to read a good deal of it's material. They have pretty clear examples of all the features (including more advanced topics, like lookarounds, greediness, etc), including explanations of how each match is performed.

PS: Some motivation for you, if you need it:

Posted Image

"Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee."

PsaltyDS · December 18, 2009

@danielkza: Your linky seemed to be a mashup of Cameron Laird's personal notes on "Regular Expressions" and Regular-Expressions.info

Sign In

Impossible to do with RegEx?

Recommended Posts

avery

PsaltyDS

SmOke_N

PsaltyDS

enaiman

Malkey

Skruge

Mison

Malkey

Anteaus

avery

enaiman

danielkza

PsaltyDS

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta