Sign in to follow this  
Followers 0
sapsorrow

Regular Expressions

14 posts in this topic

Hi,

I'd like some help with a recursive regexp, this one: /(?x) (\w) ( (?R) * (?R) ) */

My intention was to grab each word, /\w+/, in a string then further subdivide each word into single characters, /\w/.

eg the string "Lorem ipsum dolor sit amet" using StringRegExp and flag 4 would yield an array containing:

Array[0]

[0] = "Lorem"

[1] = "L"

[2] = "o"

[3] = "r"

[4] = "e"

[5] = "m"

Array[1]

[0] = "ipsum"

[1] = "i"

[2] = "p"

[3] = "s"

[4] = "u"

[5] = "m"

and so on.

The regexp I've included above isn't so much the current best hope more the least embarrassing failure.

Thanks

Share this post


Link to post
Share on other sites



Hi,

I'd like some help with a recursive regexp, this one: /(?x) (\w) ( (?R) * (?R) ) */

My intention was to grab each word, /\w+/, in a string then further subdivide each word into single characters, /\w/.

eg the string "Lorem ipsum dolor sit amet" using StringRegExp and flag 4 would yield an array containing:

Array[0]

[0] = "Lorem"

[1] = "L"

[2] = "o"

[3] = "r"

[4] = "e"

[5] = "m"

Array[1]

[0] = "ipsum"

[1] = "i"

[2] = "p"

[3] = "s"

[4] = "u"

[5] = "m"

and so on.

The regexp I've included above isn't so much the current best hope more the least embarrassing failure.

Thanks

Just looked at your post but I'm off to bed right now. Download a regexp tool called expresso. It'll allow you to build and test your expressions before wrting the au3 code.

Share this post


Link to post
Share on other sites

Im not sure if you can get that to work with StringRegExp, ive tried it, but i cant do it, because you group the same characters twice, once with the smallest greediness and once with the bigest. But when i try to StringRegExp that, i only get the last letter.

#include <Array.au3>
$string  = "Lorem ipsum dolor sit amet"
$results = StringRegExp($string , '([[:alpha:]]{1})+' ,4)
$match = $results[0]
_ArrayDisplay($match)oÝ÷ Ø­Z¢{r¢ë¬y+kx¦X­jëh×6#include <Array.au3>
$string  = "Lorem ipsum dolor sit amet"
$words = StringSplit($string , " ")
$letters = StringSplit($words[1] , "")
_ArrayDisplay($words)
_ArrayDisplay($letters)

+==================================================================+| The Definition of Madness: Creating a GUI, with GUI automation scripts |+==================================================================+

Share this post


Link to post
Share on other sites

The mere fact that the StringRegExp function returns a one-dimensional array tells me the best output you will get in one call is:

[0] = "Word"

[1] = "W"

[2] = "o"

[3] = "r"

[4] = "d"

[5] = "Another"

[6] = "A"

....

This will require one call to break the sentence into words and then another call on every element to break it into characters.

$Sentence = "Lorem ipsum"
$aWords = StringSplit($Sentence," ")

;Populate
For $X = 1 to $aWords[0]
    $aCharacters = StringSplit($aWords[$X], "")
    $aCharacters[0] = $aWords[$X]
    $aWords[$X] = $aCharacters
Next

;Display
For $X = 1 to $aWords[0]
    ConsoleWrite("[" & $X & "]:" & @CRLF)
    
    If IsArray($aWords[$X]) Then
        $aTemp = $aWords[$X]
        For $Y = 0 to Ubound($aTemp)-1
            ConsoleWrite(@TAB & "[" & $Y & "]: " & $aTemp[$Y] & @CRLF)
        Next
    Else
        ConsoleWrite($aWords[$X] & @CRLF)
    EndIf
Next

Share this post


Link to post
Share on other sites

The mere fact that the StringRegExp function returns a one-dimensional array tells me the best output you will get in one call is:

Actually StringRegExp mode 4 does return 2 dimensional array's but i dont think this particular task can be done that way. Ive tried and failed :)

So stringplit, and your code is the way to go :)


+==================================================================+| The Definition of Madness: Creating a GUI, with GUI automation scripts |+==================================================================+

Share this post


Link to post
Share on other sites

Hi,

I'd like some help with a recursive regexp, this one: /(?x) (\w) ( (?R) * (?R) ) */

My intention was to grab each word, /\w+/, in a string then further subdivide each word into single characters, /\w/.

eg the string "Lorem ipsum dolor sit amet" using StringRegExp and flag 4 would yield an array containing:

Array[0]

[0] = "Lorem"

[1] = "L"

[2] = "o"

[3] = "r"

[4] = "e"

[5] = "m"

Array[1]

[0] = "ipsum"

[1] = "i"

[2] = "p"

[3] = "s"

[4] = "u"

[5] = "m"

and so on.

The regexp I've included above isn't so much the current best hope more the least embarrassing failure.

Thanks

Captures themselves are not recursive. To do what you want you could pick some maximum number of letters and have that many captures for each word, or else have a regular expression that captures each word, and then another expression for individual characters of that word (or use a different mechanism that splits it into individual characters). Depending on what you plan to do with the result, maybe it would be sufficent to capture the first letter of a word, optionally followed by more word characters, optionally followed by a last character.

e.g.,

[0] = "Lorem"

[1] = "L"

[2] = "ore"

[3] = "m"

\b(\w)(\w*\B)?(\w?)\b

Share this post


Link to post
Share on other sites

Hi, again.

$rx = StringRegExp("Lorem ipsum dolor sit amet", "(?x) (\w) ( (?R) * (?R) ) *", 4)

The above code gives a five (there are five words) item main array. The first subarray gives:

[0] = "Lorem"

[1] = "L"

[2] = "orem"

"L" and "orem" feature twice - in the [0] element and the other two elements - suggesting that what I want is possible if tricky to rationalise. If it weren't possible then why offer the option of arrays within arrays as a return?

Just for contrast this code, which is an earlier attempt:

$rx = StringRegExp("Lorem ipsum dolor sit amet", "(?x) (\w+) (?: (\w) | (?2) ) (\w)", 4)

gives:

[0] = "Lorem"

[1] = "Lor"

[2] = "e"

[3] = "m"

This is a three part split: "Lor", "e", "m", (ignoring [0]) because of the three /\w+/, /\w/, /\w/ atoms in the regex. What I want is full recursion.

Thanks, so far. I'll likely go the easy way with this for now. On the other hand....

PS

I'm posting blind in part; so apologies if I'm rehashing things already discussed.

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

"L" and "orem" feature twice - in the [0] element and the other two elements - suggesting that what I want is possible if tricky to rationalise. If it weren't possible then why offer the option of arrays within arrays as a return?

AutoIt uses PCRE as a regular expression engine. PCRE executes only one match at a time and returns a vector from which the whole match and captures can be constructed. AutoIt executes PCRE in a loop to get multiple matches, and each time constucts the strings for whole match and captures from the content of PCRE's vector. (PCRE's vector actually contains the offsets, not the actual strings).

Trust me, what you are trying to do is not possible. PCRE's author recently told me "Nobody has invented a scheme where the contents of capturing subpatterns are expressed as vectors." That was in response to a question I asked him about using recursive Oniguruma subroutines which are new in PCRE version 7.7. (I see that the current release of AutoIt is using PCRE 7.6. PCRE 7.7 has been available since May 7, so I have no idea why it wasn't used in AutoIt.)

There is a PCRE mechanism for doing callouts in the middle of a regular expression, but callouts are not implemented in AutoIt. Also from the author of PCRE, still talking about recursive captures: "If you are prepared to program for callouts, you can catch as many as you like, by using a repeat with a callout at the end, but of course this is not feasible if you are in an environment that doesn't allow you to use callouts."

Edited by Sheri

Share this post


Link to post
Share on other sites

Trust me, what you are trying to do is not possible.

Well, despite the dreaded words "Trust me" I shall trust you. Thanks for your time and the reply, Sheri.

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

... using recursive Oniguruma subroutines which are new in PCRE version 7.7. (I see that the current release of AutoIt is using PCRE 7.6. PCRE 7.7 has been available since May 7, so I have no idea why it wasn't used in AutoIt.)

I don't speak for the developers, but I would hope they are not in the habit of throwing things into a new production version (currently 3.2.12.0 dated 16 May, 2008) that just became available 9 days earlier. That wouldn't allow for proper Beta testing first. I would expect to see PCRE 7.7 in a 3.2.13.x Beta before it is included in the 3.2.14.0 Prod some day.

:)

P.S. PCRE 7.6 came out 28 January, 2008. It was added to AutoIt Beta 3.2.11.3 on 18 March, 2008. And now it's in the production version deployed 16 May. Pretty reasonable time line, I think. More props for Jon... :)

Edited by PsaltyDS

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law

Share this post


Link to post
Share on other sites

Well, despite the dreaded words "Trust me" I shall trust you. Thanks for your time and the reply, Sheri.

I have never found usable results with flag 4 of StringRegExp(), which would have been your only option to get a nested array. The StringSplit() solution I posted above works just fine and the output looks exactly how you want it.

Share this post


Link to post
Share on other sites

I have never found usable results with flag 4 of StringRegExp()

Reaching for the stars I ended up with a hot lightbulb covered in flyshit. Many thanks to everyone for their trouble.

Share this post


Link to post
Share on other sites

I don't speak for the developers, but I would hope they are not in the habit of throwing things into a new production version (currently 3.2.12.0 dated 16 May, 2008) that just became available 9 days earlier. That wouldn't allow for proper Beta testing first. I would expect to see PCRE 7.7 in a 3.2.13.x Beta before it is included in the 3.2.14.0 Prod some day.

:)

P.S. PCRE 7.6 came out 28 January, 2008. It was added to AutoIt Beta 3.2.11.3 on 18 March, 2008. And now it's in the production version deployed 16 May. Pretty reasonable time line, I think. More props for Jon... :)

I disagree because bug fixes are implemented in PCRE releases and PCRE itself undergoes beta cycles. Possibly Jon was not aware there had already been a new release but such information is available. IMHO, if it a new PCRE release is pending, it would even be worthwhile to hold off on a new Production version of AutoIt until the new PCRE release is available. It is one of the few detriments to static linking of PCRE that the application itself needs to be recompiled to use it.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0