Jump to content

Remove <a ...> and </a> from text with StringRegExpReplace


Recommended Posts

String Regular Expressions are a mystical art that I can't quite wrap my head around. But I think they're the best thing to use for what I need.

I want to take a string of text and remove any instances of <a ...> and </a> tags from it. (The <a...> tags might be <a href=whatever> or <a id=whatever> tags.) I just want to replace any of those tags found with nothing, deleting them from the string of text.

I feel like StringRegExpReplace is likely the function I need to use, but I have no clue what incantations it takes to make it do this.

Any regexp gurus want to lend a hand?

Edited by TimRude
Link to comment
Share on other sites

When I plug that into a StringRegExpReplace function, it strips out too much.

$sBefore = 'This is a string that contains <a id=something>multiple</a> instances of the <a href=another thing>codes</a> that I want to strip out.'

$sAfter = StringRegExpReplace($sBefore, '(</?a.*/?>)', '')

ConsoleWrite('Before:' & @TAB & $sBefore & @CRLF)
ConsoleWrite('After:' & @TAB & $sAfter & @CRLF)

What I want to end up with is this:

This is a string that contains multiple instances of the codes that I want to strip out.

But my ConsoleWrite output is this:

Before: This is a string that contains <a id=something>multiple</a> instances of the <a href=another thing>codes</a> that I want to strip out.
After:  This is a string that contains  that I want to strip out.

So it's grabbing everything from the first <a to the last > and removing it, like this:

This is a string that contains <a id=something>multiple</a> instances of the <a href=another thing>codes</a> that I want to strip out.

Instead, I need to selectively remove just each tag while leaving the stuff that's not inside the tag brackets in place, like this:

This is a string that contains <a id=something>multiple</a> instances of the <a href=another thing>codes</a> that I want to strip out.

- AND -

It can't strip out any other codes other than those that either (1) explicitly begin with '<a ' (there must be a space after the a) and end with '>', or (2) are exactly '</a>'.

I suppose I don't need StringRegExpReplace to take care of the </a> codes since those are always just that and a simple pass with StringReplace will take care of those. But surgically getting rid of the <a ...> codes is my sticky point.

 

Edited by TimRude
Link to comment
Share on other sites

17 minutes ago, thezlehman said:

(</?a.*?>)

That does get the individual codes that look like <a...> and those that look like </a>. So far so good.

However, I realized after the first post that it must not touch codes that begin with <a but do not have a space after the a. So a code like <anyother> must be left alone (since there's no space after the a), but a code like <a href=whatever> must be removed.

So after fiddling a bit with your initial pattern and some spelunking through the regexp info in the help, I came up with what seems to work. It searches for either the <a ...> codes (with a space following the a) or the </a> codes and it seems to work.

$sBefore = 'Leave <anyother> in. This is a string that contains <a id=something>multiple</a> instances of the <a href=another thing>codes</a> that I want to strip out.'

$sAfter = StringRegExpReplace($sBefore, '(</?a[ ].*?>)|(</a>)', '')

ConsoleWrite('Before:' & @TAB & $sBefore & @CRLF)
ConsoleWrite('After:' & @TAB & $sAfter & @CRLF)

Console output is this:

Before: Leave <anyother> in. This is a string that contains <a id=something>multiple</a> instances of the <a href=another thing>codes</a> that I want to strip out.
After:  Leave <anyother> in. This is a string that contains multiple instances of the codes that I want to strip out.

Thanks for the help.

Link to comment
Share on other sites

Here's another way to do it: https://regex101.com/r/kCDJ2E/2

I also like using this site (regex101) as it has some description of what's happening in the right toolbar, and common tokens and expressions as well. So it can explain the regex better than I can, but basically:

<\/?a\b[^>]*?\/?>

  • < - Matches <
  • \/? - Escapes "/" and matches it 0 or 1 times with "?"
  • a - Match your a tag
  • \b - Word boundary, meaning that it'll match only the previous character is followed by a non-word character, like whitespace or _
  • [^>] - Matches any character that ISN'T (because of ^) in the character set, so just the ">" character. It continues matching on NOT ">" between 0 and unlimited (*) times, returning the first time it finds a match
  • \/? - Match a tag like <a href="meow" /> with a trailing /. Escapes "/" and matches it 0 or 1 times with "?"
  • > - Ends our pattern by finally matching on the first ">" found.

I'm not the best at regex either, but I use it often. This above information is just from my understanding of it, but if you want more information I would definitely check out the description that is given on the regex101 page.

 

Edit: Be sure to also check out the AutoIt options for RegEx, especially this one: (?i)

From the helpfile for (?i): Caseless: matching becomes case-insensitive from that point on.

This way you can match on <A> or <a>. So my provided pattern would look like this:

$sPattern = "(?i)<\/?a\b[^>]*?\/?>"

 

Edited by mistersquirrle
Updated RegEx for <a /> format

We ought not to misbehave, but we should look as though we could.

Link to comment
Share on other sites

1 hour ago, mistersquirrle said:

(?i): Caseless: matching becomes case-insensitive from that point on.

Definitely a good bit to add to the front. Thanks!

1 hour ago, mistersquirrle said:

$sPattern = "(?i)<\/?a\b[^>]*?\/?>"

That's just voodoo. With what I had ended up with, I could just about make out what it was doing. With yours, it just looks like the curse words in the Sunday comics. No idea how that's working, but it does!

Link to comment
Share on other sites

10 hours ago, TimRude said:

With what I had ended up with, I could just about make out what it was doing.

You already did great with your test, that's why you shouldn't give up your efforts and keep on testing simple patterns. It's the way to improve, slowly, your knowledge with RegEx . I'm feeling same constantly (and probably plenty of users here too) because our knowledge in RegEx is very limited, compared to some gurus on this Forum who can construct quickly powerful patterns.

If you don't mind, I would like to discuss your pattern, to eliminate from it what could be superfluous. Here are your original subject & pattern :

Leave <anyother> in. This is a string that contains <a id=something>multiple</a> instances of the <a href=another thing>codes</a> that I want to strip out.

(</?a[ ].*?>)|(</a>)

(In everything that follows, I'll surround with simple quotes every piece of the pattern we're gonna discuss, so it will be clearly visible, for example ' ' for a space or '<a ' for the 3 characters surrounded by the simple quotes. Please think of these simple quotes as 'Forum visual delimiters' , they're not part of the real pattern)

You used the pipe symbol '|' which means OR (e.g. named alternation)
It's ok with that, because you're searching '<a ...>' or '</a>'

So you wanna match the first '<' followed by 'a' followed by 1 space ' '
Then you can simply start your pattern like this, with 3 characters :

'<a '

Now you want to grab each and every character until the corresponding closing '>' is found. To do this :
=> The dot '.' matches any character (except newlines characters by default, we won't discuss it here)
=> The star quantifier * will repeat the precedent character 0 or more times.
=> And you want a closing '>' at the end of the match.

Then you think : "great, I'll construct my pattern easily ( the part before '|' ) just like this" :

'<a .*>'

Unfortunately, this didn't work, as you discovered it by yourself (which is great for your knowledge) :

13 hours ago, TimRude said:

So it's grabbing everything from the first <a to the last >

This is because the star quantifier '*' is greedy by default. Though it found the 1st closing '>' you were interested in... it kept on searching through the whole subject if there are others '>' . When it found the very last '>' then it matches an endless string starting with the 1st '<a ' and ending with the very last '>' which is not what you want at all.

There is a simple way to make the star '*' ungreedy (aka lazy) , it will order the engine to stop searching after it found the very first closing '>' . A question mark '?' placed just after '*' makes the star quantifier ungreedy (as you correctly added it in your original pattern), so now your correct pattern on the left side of the alternation '|' is functional :

'<a .*?>'

The rest is easy as 1-2-3, a pipe symbol '|' followed by a simple '</a>'

'<a .*?>|</a>'

This is what this pattern returns (4 matches) before the Replace process :

1264837242_TimRudesRegExexample.png.0b33c7454c54bff23d2e384fdc1d3b44.png

Then you use it directly in the Replace function, replacing all matches with an empty string...

$sAfter = StringRegExpReplace($sBefore, '<a .*?>|</a>', '')

...which will correctly output like this :

Leave <anyother> in. This is a string that contains multiple instances of the codes that I want to strip out.

I hope it's a bit clearer now : no need of groups in your case, e.g. '(...)' or character classes '[ ]' or the optional '/?' as found in your original pattern.

Guys, if I wrote something incorrect, please don't hesitate to indicate it.
Thanks for reading :bye:

Edited by pixelsearch
typo
Link to comment
Share on other sites

@pixelsearch Thanks for that mini-tutorial. Makes it much clearer now. I'll also add the '(?i)' option at the beginning as recommended by @mistersquirrle to make it case insensitive to match <a> or <A>, resulting in:

$sAfter = StringRegExpReplace($sBefore, '(?i)<a .*?>|</a>', '')

And now I'll try to understand why '(?i)<\/?a\b[^>]*?\/?>' works. :huh2: 

Edited by TimRude
Link to comment
Share on other sites

18 hours ago, mistersquirrle said:

$sPattern = "(?i)<\/?a\b[^>]*?\/?>"

and

18 hours ago, mistersquirrle said:

Here's another way to do it: https://regex101.com/r/kCDJ2E/2

You're right, that site is very helpful for explaining the voodoo that you do. And your explanation in the post, once I took the time to dissect it, makes sense.

However, one part of your pattern seems superfluous, but please correct me if I'm mistaken:

The '[^>]' grouping says to match any character that is NOT a '>', and since it's followed by '*?' (0 or more, lazy) that means it will take either 0 or as many characters as there are up until it comes to the next '>' character. Since a '/' character would already be matched by that, I think the next '\/?' code to look specifically for a '/' isn't needed. And in fact, removing that '\/?' bit doesn't prevent it from matching your '<a href="meow" />' example.

So it seems this will suffice, unless I'm mistaken:

$sPattern = "(?i)<\/?a\b[^>]*?>"

 

Link to comment
Share on other sites

2 minutes ago, TimRude said:

So it seems this will suffice, unless I'm mistaken:

$sPattern = "(?i)<\/?a\b[^>]*?>"

That's correct, I that's what I had originally and I edited to add '\/?', but you're right it isn't needed and I was slightly overthinking it (which is usually where unneeded complications come in :) ).

We ought not to misbehave, but we should look as though we could.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...