Jump to content
Sign in to follow this  
Kevitto

Removing Characters Contained in HTML Tags

Recommended Posts

Kevitto

Good afternoon,

I was wondering if there was a simple way of removing everything that is contained between the '<' and '>' characters in a string.

I'm using AutoIT to pull information from HTML files and I need any tags removed.

Example:

<br /><span style="font-size: 14pt; font-weight: normal; font-style: italic;">(téléchargement manuel ou guide de référence)</span> 

I have to strip out everything contained in <> tags.

But the StringReplace function can't help me, because the tags are different depending on the content.  I run about 4,000 files through the script.

Any help is appreciated!

PS:  I'm not including my full code because there are waaaaaaay too many functions in there not relating to this.  I just need to find a way to strip the strings of all tags and their content.

Share this post


Link to post
Share on other sites
Melba23

Kevitto,

Just what RegExes are designed for: ;)

$sString = '<br /><span style="font-size: 14pt; font-weight: normal; font-style: italic;">(téléchargement manuel ou guide de référence)</span> '

$sStripped = StringRegExpReplace($sString, "(?U)(<.*>)", "")

ConsoleWrite($sStripped & @CRLF)
Decode:

(?U)     - Not greedy - look for smallest match
(<.*>)   - Look for anything between <>

""       - Replace any found strings with an empty string
All clear? :)

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
Kevitto

Sir, you are a gentleman and a scholar. 

I've spent a lot of time trying to understand Regexp properly and it always eludes me.

Thank you so much!  Marking as Solved.

  • Like 1

Share this post


Link to post
Share on other sites
MaxG

Kevitto,

I found regular expressions rather cryptic and this site helped me understand them better than any other:

http://regexone.com/

It is interactive, progresses from basic to complicated smoothly, and made all the difference to my understanding.

  • Like 1

Share this post


Link to post
Share on other sites
Kevitto

Thank you, MaxG!  Will definitely check it out.

Share this post


Link to post
Share on other sites
Jury

As someone else once pointed out why not this from the helpfile?

; Open a browser with the basic example, read the body Text
; (the content with all HTML tags removed) and display it in a MsgBox

#include <IE.au3>
#include <MsgBoxConstants.au3>

Local $oIE = _IECreate("http://www.pri.org/about-pri")
Local $sText = _IEBodyReadText($oIE)
ConsoleWrite($sText & @CRLF)
_IEQuit($oIE)

 or I'm I missing something?

 

Share this post


Link to post
Share on other sites
Kevitto

As someone else once pointed out why not this from the helpfile?

; Open a browser with the basic example, read the body Text
; (the content with all HTML tags removed) and display it in a MsgBox

#include <IE.au3>
#include <MsgBoxConstants.au3>

Local $oIE = _IECreate("http://www.pri.org/about-pri")
Local $sText = _IEBodyReadText($oIE)
ConsoleWrite($sText & @CRLF)
_IEQuit($oIE)

 or I'm I missing something?

 

In case you were wondering, I was looking for specific parts of the file (I was using FileReadLine to read line per line) because I was searching for specific tags with StringInStr.  I just wanted to strip all tags so I could convert the result to a _Date() format and use it to check the data against the current date.

So getting the whole body wouldn't have helped :P.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×