Sign in to follow this  
Followers 0
Robinson1

[solved] Removing comments with StringRegExpReplace()

12 posts in this topic

#1 ·  Posted (edited)

I like to strip all comments from a au3-source code file and use regular expressions to do so.

Introduction

AutoIt offers StringRegExpReplace() / StringRegExp() to do so.

Thinking this is easy I used this a search pattern ".*;" and ended up with these matches:

;Comment1

"test" ;Comment1

"test1;test2"

"test1;test2";Comment2

So far everything looks good - except the last two line, where there is a string that contains a ';' and which shouldn't be excluded.

Using ".*(?:[;])" will leave out the ';' at the end, but it won't help with the string-problem.

Okay here is a quick excerpt of the from the AutoIt help about 'StringRegExp'(for the lazy ppl like me - however I really hope you have a look at the manual ... later :o )

. Match any single character (except newline).

*Matches the preceding character or subexpression zero or more times.

For example, zo* matches "z" and "zoo". * is equivalent to {0,}.

(?: ... ) Non-capturing group. Behaves just like a normal group, but does not record the matching characters in the array nor can the matched text be used for back-referencing.

The question

The question is -when I go up one level of abstraction from the actual AutoIt RegExp patternsyntax- how to create a nice syntax that will separate/sort out the comments.

An AutoIt string that start with ' or " can be matched by this

$RegExp_STRING_DOUBLE_QUOTED$ = '("[^"]*")'

$RegExp_STRING_SINGLE_QUOTED$ = "'([^']*)'"

$RegExp_AU_STRING$ =_

"(?:" & $RegExp_STRING_DOUBLE_QUOTED & ")|" & _

"(?:" & $RegExp_STRING_SINGLE_QUOTED & ")"

or shorter but much more harder to read, understand or use any further:

$RegExp_AU_STRING$ ="(?:"([^"]*)")|(?:D[^']*)')"

^haha nice the smile fit's nice in here even when I didn't intented it to be here - but it's no problem if you copy&paste it (maybe into http://www.regexbuddy.com) will change back

Na anyway thats only one more 'puzzle piece'.

I tried stuff like this

[^;]*<AU_STRING>

which obviously don't work. But with hopefully somehow illustrates what I'm trying to do.

Has someone any idea or strategy or technique on how to wisely separate / unite these parts to get working syntax/regular expression pattern?

Is it at all possible and makes sense to to it with only one regexp pattern?

I mean regarding at tidy.exe and obfuscator.exe that comes with AutoIt or the AutoIt interpret it self that does this task, and also some other programs like the MS API that parse *.ini files and so on this is a common problem that is already solved.

On a quick search on the topic I didn't found any specific results on that topic. (But maybe keywords were not good enough or I didn't took enough time to check enough results)

Edited by Robinson1

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Dim $sSource = 'HotKeySet("{ESC}", "_EXIT") ; Use ESC to terminate script.' & @CRLF & _
               'ConsoleWrite("Use ; to add comments to the code up to the end of the line." & @LF) ; This is a comment' & @CRLF & _
               'MsgBox(64, ";;;", ";;"";;"";;") ; A comment ;];]' & @CRLF & _
               'ConsoleWrite('';;;'''';; ;; ;; ;'' & @LF) ; Comment ;p.'
               
Dim $sPatt = '(?:("(?:[^"]*(?:""[^"]*)*"))|(''(?:[^'']*(?:''''[^'']*)*''))|;[^\r\n]*[\r\n]*)'
$sSource = StringRegExpReplace($sSource, $sPatt, '\1')
ConsoleWrite($sSource & @LF)

$sSource = FileRead(@ProgramFilesDir & '\autoit3\include\array.au3')
$sSource = StringRegExpReplace($sSource, $sPatt, '\1')
ConsoleWrite($sSource & @LF)

Edited by Authenticity

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

Thanks for the quick reply.

Well in meanwhile I tooked some more time in understanding how the RegExp engine works and found the solution to my problem myself. Well it's simply this:

$RegExp_COMMENTS = '(?:[;].*)'

$RegExp_STRING_DOUBLE_QUOTED = '("[^"]*")'

$RegExp_STRING_SINGLE_QUOTED = "'([^']*)'"

$RegExp_AU_STRING$ =_

"(?:" & $RegExp_COMMENTS & ")|" & _

"(?:" & $RegExp_STRING_DOUBLE_QUOTED & ")|" & _

"(?:" & $RegExp_STRING_SINGLE_QUOTED & ")"

I just need to add the comments puzzle piece with '|' (=Alternation operator)

Regard this Line as example:

0123456789012345678901234

$A="test1" ;Comment1 "test2"

The RegExp Engine will work like this:

^ReadPosition is at begin of the line

(Lineoffset=0)

$ equal to ; ? No.

$ equal to " ? No.

$ equal to ' ? No.

(Lineoffset=1)

A equal to ; ? No.

A equal to " ? No.

A equal to ' ? No.

(Lineoffset=2)

= equal to ; ? No.

= equal to " ? No.

= equal to ' ? No.

(Lineoffset=3)

" equal to ; ? No.

" equal to " ? Yes. Match/forward over 'test1' using this pattern: '[^"]*'

(" equal to ' ? is never executed)

(Lineoffset=8)

' ' equal to ; ? No.

' ' equal to " ? No.

' ' equal to ' ? No.

(Lineoffset=9)

; equal to ; ? Yes. Match/forward over 'Comment1 "test2" till the end on line using this pattern: '.*'

(; equal to " ? is never executed)

(; equal to ' ? is never executed)

(Lineoffset=14)

$ReadPosition has reached the end of the line

This the syntax graph for this grammar:

|->; -->COMMENT-------------------|
           |                                  |
   ->--|---O-> " -->STRING_SINGLE_QUOTED-->|  |-->
       |   |                               |
       |   |-> ' -->STRING_DOUBLE_QUOTED-->|
       |                                   |
       ^----------------------------------<                           |

Hmm I wonder what's inside that pattern

Dim $sPatt = '(?:("(?:[^"]*(?:""[^"]*)*"))|(''(?:[^'']*(?:''''[^'']*)*''))|;[^\r\n]*[\r\n]*)'

Oh dear in this 'pure form a regexp pattern is really cryptic...

Okay unquoted it's like this:

(?:("(?:[^"]*(?:""[^"]*)*"))|('(?:[^']*(?:''[^']*)*'))|;[^\r\n]*[\r\n]*)


(?:
  ("(?:[^"]*(?:""[^"]*)*"))| -->STRING_DOUBLE_QUOTED
  ('(?:[^']*(?:''[^']*)*'))|  -->STRING_SINGLE_QUOTED
;[^\r\n]*[\r\n]* -->COMMENT
)
^- Now its getting a little more clear. But there still some minor questions

COMMENT:

; <- Starts a comment

[^\r\n]* <- Any char that's no line break

[\r\n]* <- A comments ends with a line break

\n Match a linefeed (@LF, chr(10)).

\r Match a carriage return (@CR, chr(13)).

So I wonder if ;[^\r\n]*[\n]{0,1} will work too. At the end of a line is only 1 line break (expect this line is at the end of the file - in that case there is none 0). In windows a line ends with @CR@LF that is \r\n as regexp pattern, while in Linux/Unix just uses @LF \n.

STRING_DOUBLE_QUOTED

"(?:[^"]*(?:""[^"]*)*")
"
(?:
    [^"]*
    (?:""[^"]*)* <-- Cares for 'masked' quotes inside a string
")


)
^Let's do the following replacement

<StringBody>='[^"]*'

"(?:<StringBody>

(?:""<StringBody>)*

")

Now I see how it's working - and did the following optimization:

"<StringBody>

(?:""<StringBody>)*

"

STRING_SINGLE_QUOTED

->simulare to STRING_DOUBLE_QUOTED

Edited by Robinson1

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Don't forget that there are #CS...#CE sections to, although not common.

Edit: Here is a more correct one:

Dim $sPatt = '(?m)(?:("(?:[^"]*(?:""[^"]*)*"))|(''(?:[^'']*(?:''''[^'']*)*''))|^;[^\r\n]*[\r\n]*|;[^\r\n]*)'
Edited by Authenticity

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

Edit: Here is a more correct one:

Dim $sPatt = '(?m)(?:("(?:[^"]*(?:""[^"]*)*"))|(''(?:[^'']*(?:''''[^'']*)*''))|^;[^\r\n]*[\r\n]*|;[^\r\n]*)'
So what has changed?
(?m)(?:("(?:[^"]*(?:""[^"]*)*"))|('(?:[^']*(?:''[^']*)*'))|^;[^\r\n]*[\r\n]*|;[^\r\n]*)
    (?:("(?:[^"]*(?:""[^"]*)*"))|('(?:[^']*(?:''[^']*)*'))|_;[^\r\n]*[\r\n]*          )

Okay I see:

1. (?m)option added that does ' ^ and $ match newlines within data.'

2. Modified the comment part

_;[^\r\n]*[\r\n]*          became
 ^;[^\r\n]*[\r\n]*|;[^\r\n]*
The new pattern with a cosmetic line break:
^;[^\r\n]*[\r\n]*
|;[^\r\n]*
...and again a replacement for better readability

<commentbody>=[^\r\n]*

 ^;<commentbody>[\r\n]*  <- commentbody at the very beginning of the line
|;<commentbody>      <- commentbody somewhere else
but what is the reason for this change?

I mean I check the comment part (?m)^;[^\r\n]*[\r\n]*|;[^\r\n]* with RegexBuddy(PCRE) with this testdata:

;
;etretetet
ttt;etetet
ttt;ete;tet
;
and didn't see any point why this is better now

btw ;.* seem to do/cover exactly the same.

Edited by Robinson1

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Regarding the new part "^;[^\r\n]*[\r\n]*" it's more correct to use "^\s*;[^\r\n]*[\r\n]" to remove lines composed only from comments. The main reason to remove comments in first place may help to understand why it's necessary in first place. If it's to save the source without comments then you'll need to search for #cs...#ce also. If it's so you're intending to analyze the file and want to pre-clean it to be able to operate like the compilers do it'll make more sense to just dig the necessary parts like keywords, directives, variables declaration, etc...

Edited by Authenticity

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

Hehe thanks for the Inspiration. So this is you source with so minor changes/extension...

Dim $sSource = _
'HotKeySet("{ESC}", "_EXIT") ; Use ESC to terminate script.' & @CRLF & _
'ConsoleWrite("Use ; to add comments to the code up to the end of the line." & @LF) ; This is a comment' & @CRLF & _
'MsgBox(64, ";;;", ";;"";;"";;"&"""") ; A comment ;];]' & @CRLF & _
'ConsoleWrite('';;;'''';; ;; ;; ;'' & @LF) ; Comment ;p.'

Const $String_SingleQuoted= "(?:'" & $StringBody_SingleQuoted & "')+"
Const $String_DoubleQuoted= '(?:"' & $StringBody_DoubleQuoted & '")+'

Const $LineCommentBody='[^\r\n]*'
Dim $sPatt = _
        '("' & $StringBody_DoubleQuoted & '(?:""' & $StringBody_DoubleQuoted & ')*")|'& _
        "('" & $StringBody_SingleQuoted & "(?:''" & $StringBody_SingleQuoted & ")*')|"& _
        ';' & $LineCommentBody

ConsoleWrite( "Raw RegExp pattern:" & @CRLF & $sPatt & @CRLF & @CRLF)   
    
Dim $Result = StringRegExpReplace($sSource, $sPatt, '\1')

ConsoleWrite($Result & @LF)

;--- Validate Result---
Dim $ValidResult = StringRegExpReplace($sSource, '(?m)(?:("(?:[^"]*(?:""[^"]*)*"))|(''(?:[^'']*(?:''''[^'']*)*''))|^;[^\r\n]*[\r\n]*|;[^\r\n]*)', '\1')
If ($ValidResult == $Result) then
    ConsoleWrite( @CRLF &'OK. ' & @CRLF  & @CRLF )
Else
    ConsoleWrite( @CRLF &'Err- Test fail result is not as expected!' & @CRLF   & @CRLF )
    ConsoleWrite($ValidResult) 
EndIf

;$sSource = FileRead(@ProgramFilesDir & '\autoit3\include\array.au3')
;$sSource = StringRegExpReplace($sSource, $sPatt, '\1')
;ConsoleWrite($sSource & @LF)oÝ÷ Ûû)jv§vȧqëaz· OjÛ^®x¬¶·õÊ+{b¶«{*.®V­zËZÛazÞ~º&¶¬r^w(«yÊ{&ã^4çÎ4ß4çÏußwãN|ßø÷]ø÷~5ß5ã^4ßÞ4çÏußßwãN|ßÝý÷]ý÷~5ßÞ5ã_uÓÝ«Óݧ÷uÓÝ«Óݧ÷uÓÝ«Óݧ÷~5ãMø÷]ø÷~4çÍøßußwã]øã^4ßßußßwãN|ßÝý÷]ý÷~5ßÞ5÷]=Ú½=ÚwýÊyÊy¨µçR¶¸§Ë:Øî¶ãMøã^´÷]ø÷~´EèÆڶ׫þ¶ãMøã^¶Bê-yÔ­®)àr;­¸Ó~8×­¸Ó~8×­º^u+kx¡Üën4ß5ëmø÷]ø÷~4ßø÷]ø÷~5ßܡעZ+£Í

That's the raw regExp pattern this code snipped uses:

((?:"[^"]*")+)|((?:'[^']*')+)|;[^\r\n]*

And about #cs...#ce I'll care about soon.

Edited by Robinson1

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

I liked the last resulting RE. ;]

It may involve more backtracking due to the fact that it's never thrown until the next match, success or failure so I thought it needs the last touch. ;]

((?>"[^"]*")+|(?>'[^']*')+)|;.*

and the replacement string is just \1 heh. I think it'll be even better to use possessive quantifiers where reasonable:

((?>"[^"]*")++|(?>'[^']*')++)|;.*

It'll be wise to add the #cs...#ce alternation at the end of the pattern as it'll require no retracing the entire string and zipping through the ".." '..' parts searching for a comment sections.

Edited by Authenticity

Share this post


Link to post
Share on other sites

I liked the last resulting RE. ;]

It may involve more backtracking due to the fact that it's never thrown until the next match, success or failure so I thought it needs the last touch. ;]

((?>"[^"]*")+|(?>'[^']*')+)|;.*
Thanks for the hint with 'Atomic Grouping'. (I also missed that 'advance RE-stuff' as this, lookahead and so because it wasn't mentioned in the autoit help - yes i know 'Complete description can be found here') I tried it out in RegexBuddy

(?:"[^"]*")+ vs (?>"[^"]*")+

on a few strings and looked at the debug Log but there were no changes in how many steps were required to match the pattern or backtracing steps. Na anyway I'll read the helpfile/tut about 'Atomic Grouping' to learn more about it. Also about 'possessive' (by now I only cared about if it's greepy or lazy)

I also already found out that I have 2 capture groups in there

((?:"[^"]*")+)|((?:'[^']*')+)|;[^\r\n]*

what resulted in that single quoted strings got deleted when removing the comments.

Today I also integrated the #cs...#ce comments:

Raw RegExp pattern:
(?x)    # MODIFIER
(?:(?:[\r\n]{0,2}\s*;.*|;.*))|  # LineComment
(?s)\#c(?:s|omments-start).*?\#c(?:e|omments-end)(?-s)| # BlockComment
((?:"[^"]*")+|  # String_DoubleQuoted
(?:'[^']*')+)   # String_SingleQuoted

Dim $sSource = _
    'HotKeySet("{ESC}", "_EXIT") ; Use ESC to terminate script.' & @CRLF & _
    'ConsoleWrite("Use ; to add comments to the code up to the end of the line." & @LF) ; This is a comment' & @CRLF & _
    '1sgBox(64, ";;;", ";;"";;"";;"&"""") ; A comment ;];]' & @CRLF & _
    ';#cs 2sgBox(64, "#cs", ";#cs"";;""#ce"&"""") ; A comment ;];]' & @CRLF & _
    '3sgBox(64, "", "") ; A comment ;];]' & @CRLF & _
    '#ce 4sgBox(64, "", "") ; A comment ;];]' & @CRLF & _
    '#cs A valid block Comment ; A comment ;];]' & @CRLF & _
    '6sgBox(64, "", "") ; A comment ;];]' & @CRLF & _
    '#ce 7sgBox(64, "", "") ; A comment ;];]' & @CRLF & _
    'ConsoleWrite('';;;'''';; ;; ;; ;'' & @LF) ; Comment ;p.'


    Const $StringBody_SingleQuoted  = "[^']*"
    Const $String_SingleQuoted      = "(?:'" & $StringBody_SingleQuoted & "')+"

    Const $StringBody_DoubleQuoted  = '[^"]*'
    Const $String_DoubleQuoted      = '(?:"' & $StringBody_DoubleQuoted & '")+'

  ; /r => carriage return @CR, chr(13)   -   /n => linefeed        @LF, chr(10)
  ; 2 - in Windows it's @CR@LF        -   1 - in Linux/Unix it's just @LF
  ; 0 - at the end of the file there is none of these
    Const $LineBreak        = "[\r\n]{0,2}"

  ; $LineComment_EntiredLine should include the LineBreak in the match -> so whole line 
  ; can be deleted - while at 'NotEntiredLineComments' the line break is keept as it is.
    Const $LineComment_NotEntiredLine   = ';.*'
    Const $LineComment_EntiredLine      = $LineBreak & '\s*' & $LineComment_NotEntiredLine 
    Const $LineComment      = "(?:" & $LineComment_EntiredLine & '|' & $LineComment_NotEntiredLine & ")"

    const $BlockCommentStart= "\#c(?:s|omments-start)"
    const $BlockCommentEnd  = "\#c(?:e|omments-end)"
    const $BlockComment     = "(?s)" & $BlockCommentStart & ".*?" & $BlockCommentEnd & "(?-s)"

    Dim $sPatt = "(?x)" & "    # MODIFIER" & @CRLF & _
           '(?:'&  $LineComment & ")|"  & "    # LineComment" & @CRLF & _
             '' & $BlockComment & "|" & "    # BlockComment"  & @CRLF & _
            '(' & $String_DoubleQuoted & '|' & "    # String_DoubleQuoted" & @CRLF & _
             "" & $String_SingleQuoted & ")" & "    # String_SingleQuoted"

    ConsoleWrite( "Raw RegExp pattern:" & @CRLF & $sPatt & @CRLF & @CRLF)   

    ConsoleWrite($sSource & @CRLF & @CRLF)


    Dim $Result = StringRegExpReplace($sSource, $sPatt, '\1')

    ConsoleWrite($Result & @LF)

Commenting out the comment like this works

;#cs

Test "#ce"

#ce

How ever there are still some unsolved problems

like this(a #cr e inside a commented string)

#cs

Test "#ce"

#ce

and nested comments like this

#cs

#cs

#ce

#ce

^- however for recursive stuff like this RE's are not made for. However I think I can use RE to detect if there are nested comments(what is probably very seldom the case)...

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

So done ;)

RegExp pattern (with comments):
(?x)
 (?:\r?\n?\s*;.*)|  # LineComment

(?s)
 \r?\n?\s*\#c(?>s|omments-start)(?>(?>"[^"]*")+|    # String_DoubleQuoted
                                        (?>'[^']*')+|.)*? # String_SingleQuoted
              \#c(?>e|omments-end)| # BlockComment
(?-s)

((?>"[^"]*")+|  # String_DoubleQuoted
 (?>'[^']*')+   # String_SingleQuoted

)

Raw RegExp pattern:
(?:\r?\n?\s*;.*)|(?s)\r?\n?\s*\#c(?>s|omments-start)(?>(?>"[^"]*")+|(?>'[^']*')+|.)*?\#c(?>e|omments-end)(?-s)|((?>"[^"]*")+|(?>'[^']*')+)

That's the final Version:

Func RE_Comment($Comment)
; To hide comments in RE_Pattern -  uncomment the next line 
;   Return ""
    Return "    # " & $Comment & @CRLF
EndFunc

    Const $StringBody_SingleQuoted  = "[^']*"
    Const $String_SingleQuoted      = "(?>'" & $StringBody_SingleQuoted & "')+"

    Const $StringBody_DoubleQuoted  = '[^"]*'
    Const $String_DoubleQuoted      = '(?>"' & $StringBody_DoubleQuoted & '")+'
    
    Const $String                   =   $String_DoubleQuoted & '|' & RE_Comment("String_DoubleQuoted") & _
                                        $String_SingleQuoted & RE_Comment("String_SingleQuoted")
; /r => carriage return @CR, chr(13)   -   /n => linefeed        @LF, chr(10)
  ; 2 - in Windows it's @CR@LF        -   1 - in Linux/Unix it's just @LF
  ; 0 - at the end of the file there is none of these
    Const $LineBreak        = "\r?\n?"
    Const $WhiteSpaces      = "\s*"

; $LineComment_EntiredLine should include the LineBreak in the match -> so whole line 
  ; can be deleted - while at 'NotEntiredLineComments' the line break is keept as it is.
  ; $BlockCommentEnd is in there for the case there's a line like this: " #ce ;some comment"
    Const $LineComment      = $LineBreak & $WhiteSpaces &';.*'
    

    const $BlockCommentStart=   $LineBreak & $WhiteSpaces & "\#c(?>s|omments-start)"
    const $BlockCommentEnd  =   "\#c(?>e|omments-end)"
    const $BlockComment     =   "(?s)" & $BlockCommentStart & "(?>" & _
                                $String & "|" & "." & _ 
                                ")*?" & $BlockCommentEnd & "(?-s)"

    Dim $sPatt = '(?x)' & @CRLF & _
           '(?:'&  $LineComment & ')|' & RE_Comment('LineComment') & _
                   $BlockComment & '|' & RE_Comment('BlockComment') & _
             '(' & $String & ')' 

    ConsoleWrite( "Raw RegExp pattern:" & @CRLF & $sPatt & @CRLF & @CRLF)   

    Dim $sSource = _
    'HotKeySet("{ESC}", "_EXIT") ; Use ESC to terminate script.' & @CRLF & _
    'ConsoleWrite("Use ; to add comments to the code up to the end of the line." & @LF) ; This is a comment' & @CRLF & _
    '#cs A valid block Comment ; A Linecomment ' & @CRLF & _
    'MsgBox(64, "#", "#cs""#ce") ; A comment' & @CRLF & _
    '#ce ;space after #comments-end' & @CRLF & _
    'MsgBox(64, ";;;", ";;"";;"";;"&"""") ; A comment ;];]' & @CRLF & _
    ';#cs A uncomment blockcomment' & @CRLF & _
    '#ce 4sgBox(64, "", "") ; A comment' & @CRLF & _
    ' ; A line comment (''EntiredLine'' type) ' & @CRLF & _
    'ConsoleWrite('';;;'''';; ;; ;; ;'' & @LF) ; comment (''NotEntiredLine'' type)'

    ConsoleWrite($sSource & @CRLF & @CRLF)


    Dim $Result = StringRegExpReplace($sSource, $sPatt, '\1')

    ConsoleWrite($Result & @LF)

This version also handles block comment the have a string with "#ce" inside like this:

#cs

$Test="#ce"

#ce

It tooked me some time to find a way to make the RE to skip over strings:

[blockCommentStart](?>[string] | . )*? [blockCommentEnd]

the [string] part will 'eat' all strings the '(?>[string] | . )*?' part will consume the chars outside the string. The '?' after the '*' make the '*' lazy what means that will pass each char to the [blockCommentEnd] terminator (or terminal symbol). If [blockCommentEnd] can't match the char's with '#ce' the RE-engine will jump back into the '(?>[string] | . )-loop (or in RE-words 'backtrace') or else [blockCommentEnd] will terminate/confirm the match and the loop is exited(/backtrace stack gets cleaned).

There is no need to seperate 'LineComment_EntiredLine' and 'NotEntiredLineComments' this will match handle them both.

Const $LineComment = $LineBreak & $WhiteSpaces &';.*'

in a line like

Quit() ;This quites the program

$LineBreak & $WhiteSpaces are define optional (include both <Empty>) they will be ignored here:

; This is a line comment the takes an entire line.

Will match the LineBreak the leading spaces - so that the whole line gets dismissed/deleted on a RE-replace.

Optimizations considerations.

You can merge

((?>"[^"]*")+| # <- $String_DoubleQuoted &

(?>'[^']*')+) ) # <- $String_SingleQuoted together using backreference

((?>(["'])[^"']*\2)+) with the small flaw that this pattern will also match a string like this as one string "qwe"'asd'

Too bad that backreference inside a characterclass is not allow:

((?>(["'])[^\2]*\2)+)

A negative look ahead allows you to use '\2' as backreference

((?>(["'])(?>(?!\2).)*\2)+)

but for this pattern the RE-engine need two times more steps to match the string.

Na anyway problem solved. Hope it 'teached/show' you as much (or may even more ) about regular expressions that it did to me. ^_^

Thanks for reading (and replying)!

Now I'll integrate this into my...

Edited by Robinson1

Share this post


Link to post
Share on other sites

Well, using back-reference is still fine using ungreedy quantifiers.

Dim $sStr = '";Comment"";Coment" pe fep fe; fkoepkfeopkfe'  & @CRLF & _
            ' "; wpoekwe''''m, we"";, " we0kew' & @CRLF & _
            ' ''; ko de; ''''l del deld e"" " "'' & @CRLF'
            
            
Dim $sPatt = '((?>(["'']).*?\2)+|;[^\r\n]*)'
ConsoleWrite(StringRegExpReplace($sStr, $sPatt, '<<< \1 >>>') & @LF)

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

Well, using back-reference is still fine using ungreedy quantifiers.

Hehe yeah what I nice idea.

Instead of that flawed :

((?>(["'])[^"']*\2)+) that one

((?>(["']).*?\2)+).

Btw a nice simple und practical example that has much elements of RE inside; such as

captured/non-captured groups, ungreedy & unqreedy quantifiers and back-references...

When reading a little more about advanced RE features this RE start to look like a specialize programming language(like SQL). However even when now knowing more about it's syntax is still a little cryptic.

Edited by Robinson1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0