Sign in to follow this  
Followers 0
Chimp

How to remove from a string all between < and > pairs

16 posts in this topic

how would you remove all between the  < lesser and greater > parenthesis,  parenthesis included, and leave only what's outside the parenthesis.
For example from the following piece of code from an html table, it should remain only the part marked in green. (that is the content inside the cell of the table), while all the rest that  is included between < and > pairs, should be removed

<td bgcolor="#d3d3d3" align="center" valign="middle" rowspan="2"><font size="2" color="#000000" face="verdana"><b>Cell Two</b></font></td>

thanks for any solution
 


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites



StringRegExpReplace($sString, "(?U)(<.*>)", "")

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites

Wow!
seems to work very well!

thanks a lot Jos :)


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

waw
your version MikahS  works great as well
thanks  a lot you too

p.s.

I sign as solved the post of jos, because he was faster
Many thanks to both
:)

Edited by Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

My pleasure. ;)


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

$oDOMObj.innertext

Edited by jdelaney

IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.

Share this post


Link to post
Share on other sites

$oDOMObj.innertext

 

thanks jdelaney,

but I'm working on a Table extractor from a raw html, not from a browser or DOM objects

Thanks for the idea as well.


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

... I'm again on this,

the above regexp fails if the checked line is a multiline (it contains @cr or @crlf)

and the opening and closing parenthesis are on different lines

for example, the following line is not correctly parsed;

<TD> Hello
 <IMG src="../images/icon.gif" alt=
      "Hello pic">
</TD>

so instead of only the Hello word, also the two lines below remains on result

Could someone tell me how to modify the above posted regexp to catch and delete also text enclosed between < and > also if the two parenthesis are on 2 different lines?

thanks a lot

Edited by Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

StringRegExpReplace(stringstripws($YourString, 8),"(?U)\<.*\>","")


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

StringRegExpReplace(stringstripws($YourString, 8),"(?U)\<.*\>","")

 

Thanks boththose, but I do not want to remove the @cr if them are outside the < and >

using your way all @cr are removed, also those outside the < and > parenthesis

for example

<TD> Hello

         Good morning

 <IMG src="../images/icon.gif" alt=

      "Hello pic">

</TD>

the @cr between hello @cr Good morning should remain

... is there a way?

here a simple reproducer to show the problem:

Local $sHtml, $sHtml2

$sHtml = '<TD>Hello'
$sHtml &= @CRLF & 'Good morning'
$sHtml &= @CRLF & '<IMG src="../images/icon.gif" alt='
$sHtml &= @CRLF & '"Hello pic">'
$sHtml &= @CRLF & ' </TD>'

$sHtml2 = '<TD>Hello' & @CR & 'Good morning<IMG src="../images/icon.gif" alt="Hello pic"> </TD>'

MsgBox(0, "string with < and > on different lines", $sHtml)
MsgBox(0, "Parsed string", StringRegExpReplace($sHtml, "(?U)\<.*\>", "")) ; < and > on different lines, parse fails

MsgBox(0, "string with < and > on same line", $sHtml2)
MsgBox(0, "Parsed string", StringRegExpReplace($sHtml2, "(?U)\<.*\>", "")) ; < and > on same line, parse OK
Edited by Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Is this what you're looking for?

StringRegExpReplace($sHtml2, "(?s)(<.*?>)(.*?)(<\s*/.*?>)", "$2")

[center]Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.[/center]

Share this post


Link to post
Share on other sites

You must use (?s) to allow the dot to match newline

StringRegExpReplace($sHtml2, '(?s)<.*?>', "")

Share this post


Link to post
Share on other sites

Ahh, thought he wanted to keep the img one... btw, as demonstrated above... you don't have escape the angle brackets.


[center]Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.[/center]

Share this post


Link to post
Share on other sites

#15 ·  Posted (edited)

May I add, I'm not a great fan of the (?U) option because it makes lazy ALL the possible + or * quantifiers in the expression

BTW the usual (and recommended) workaround is

StringRegExpReplace($sHtml2, '<[^>]+>', "")

[^>]+  meaning : 1 or more non ">" chars

Jan Goyvaerts explains this :

In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing

http://www.regular-expressions.info/repeat.html

Edited by mikell
2 people like this

Share this post


Link to post
Share on other sites

May I add, I'm not a great fan of the (?U) option because it makes lazy ALL the possible + or * quantifiers in the expression

BTW the usual (and recommended) workaround is

StringRegExpReplace($sHtml2, '<[^>]+>', "")

[^>]+  meaning : 1 or more non ">" chars

Jan Goyvaerts explains this :

In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing

http://www.regular-expressions.info/repeat.html

 

thanks a lot mikell, it works great!

Thanks also for the explanation (.....although I do not understand much about what you're talking about :huh2: )

thanks again :)


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0