How to remove from a string all between < and > pairs

Gianni · February 6, 2015

how would you remove all between the < lesser and greater > parenthesis, parenthesis included, and leave only what's outside the parenthesis.
For example from the following piece of code from an html table, it should remain only the part marked in green. (that is the content inside the cell of the table), while all the rest that is included between < and > pairs, should be removed

thanks for any solution

Jos · February 6, 2015

StringRegExpReplace($YourString,"(?U)\<.*\>","")

Jos

MikahS · February 6, 2015

StringRegExpReplace($sString, "(?U)(<.*>)", "")

Gianni · February 6, 2015

Wow!
seems to work very well!

thanks a lot Jos

Gianni · February 6, 2015

waw
your version MikahS works great as well
thanks a lot you too

p.s.

I sign as solved the post of jos, because he was faster
Many thanks to both

Edited February 6, 2015 by Chimp

MikahS · February 6, 2015

My pleasure.

jdelaney · February 6, 2015

$oDOMObj.innertext

Edited February 6, 2015 by jdelaney

Gianni · February 6, 2015

$oDOMObj.innertext

thanks jdelaney,

but I'm working on a Table extractor from a raw html, not from a browser or DOM objects

Thanks for the idea as well.

Gianni · February 8, 2015

... I'm again on this,

the above regexp fails if the checked line is a multiline (it contains @cr or @crlf)

and the opening and closing parenthesis are on different lines

for example, the following line is not correctly parsed;

<TD> Hello
<IMG src="../images/icon.gif" alt=
"Hello pic">
</TD>

so instead of only the Hello word, also the two lines below remains on result

Could someone tell me how to modify the above posted regexp to catch and delete also text enclosed between < and > also if the two parenthesis are on 2 different lines?

thanks a lot

Edited February 8, 2015 by Chimp

iamtheky · February 8, 2015

StringRegExpReplace(stringstripws($YourString, 8),"(?U)\<.*\>","")

Gianni · February 8, 2015

StringRegExpReplace(stringstripws($YourString, 8),"(?U)\<.*\>","")

Thanks boththose, but I do not want to remove the @cr if them are outside the < and >

using your way all @cr are removed, also those outside the < and > parenthesis

for example

<TD> Hello

Good morning

<IMG src="../images/icon.gif" alt=

"Hello pic">

</TD>

the @cr between hello @cr Good morning should remain

... is there a way?

here a simple reproducer to show the problem:

Local $sHtml, $sHtml2

$sHtml = '<TD>Hello'
$sHtml &= @CRLF & 'Good morning'
$sHtml &= @CRLF & '<IMG src="../images/icon.gif" alt='
$sHtml &= @CRLF & '"Hello pic">'
$sHtml &= @CRLF & ' </TD>'

$sHtml2 = '<TD>Hello' & @CR & 'Good morning<IMG src="../images/icon.gif" alt="Hello pic"> </TD>'

MsgBox(0, "string with < and > on different lines", $sHtml)
MsgBox(0, "Parsed string", StringRegExpReplace($sHtml, "(?U)\<.*\>", "")) ; < and > on different lines, parse fails

MsgBox(0, "string with < and > on same line", $sHtml2)
MsgBox(0, "Parsed string", StringRegExpReplace($sHtml2, "(?U)\<.*\>", "")) ; < and > on same line, parse OK

Edited February 8, 2015 by Chimp

SmOke_N · February 8, 2015

Is this what you're looking for?

StringRegExpReplace($sHtml2, "(?s)(<.*?>)(.*?)(<\s*/.*?>)", "$2")

mikell · February 8, 2015

You must use (?s) to allow the dot to match newline

StringRegExpReplace($sHtml2, '(?s)<.*?>', "")

SmOke_N · February 8, 2015

Ahh, thought he wanted to keep the img one... btw, as demonstrated above... you don't have escape the angle brackets.

mikell · February 8, 2015

May I add, I'm not a great fan of the (?U) option because it makes lazy ALL the possible + or * quantifiers in the expression

BTW the usual (and recommended) workaround is

StringRegExpReplace($sHtml2, '<[^>]+>', "")

[^>]+ meaning : 1 or more non ">" chars

Jan Goyvaerts explains this :

In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing

http://www.regular-expressions.info/repeat.html

Edited February 8, 2015 by mikell

Gianni · February 8, 2015

May I add, I'm not a great fan of the (?U) option because it makes lazy ALL the possible + or * quantifiers in the expression

BTW the usual (and recommended) workaround is
StringRegExpReplace($sHtml2, '<[^>]+>', "")
[^>]+ meaning : 1 or more non ">" chars

Jan Goyvaerts explains this :

In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing

http://www.regular-expressions.info/repeat.html

thanks a lot mikell, it works great!

Thanks also for the explanation (.....although I do not understand much about what you're talking about :huh2: )

thanks again

Sign In

How to remove from a string all between < and > pairs

Recommended Posts

Gianni

Jos

MikahS

Gianni

Gianni

MikahS

jdelaney

Gianni

Gianni

iamtheky

Gianni

SmOke_N

mikell

SmOke_N

mikell

Gianni

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta