Jump to content
youtuber

StringRegExp issue

Recommended Posts

youtuber

I doubt about whether the site title is true for a regex try like this, I want title1, title2, title3

<title>title1 | wb1</title>
<title>title2 &#8211; wb2</title>
<title>title3 - wb3</title>

<title>title4 _ wb4</title>

#include <Array.au3>

$sSource = "<title>title1 | wb1</title>" & @CRLF & _
"<title>title2 &#8211 ; wb2</title>" & @CRLF & _
"<title>title3 – wb3</title>" & @CRLF & _
"<title>title4 _ wb4</title>"

$aRegEx = StringRegExp($sSource, '<title>?([^|\-\–\_]+)', 3)
_ArrayDisplay($aRegEx)

 

Edited by youtuber

Share this post


Link to post
Share on other sites
ripdad

?

$aRegEx = StringRegExp($sSource, '<title>(.*?)\W.*</title>', 3)

 

  • Like 1

"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Share this post


Link to post
Share on other sites
Simpel

Or pattern like this: “<title>([a-zA-Z0-9]+).+<\/title>“

Regards, Conrad 

  • Like 1

SciTE4AutoIt = 3.7.3.0   AutoIt = 3.3.14.2   AutoItX64 = 0   OS = Win7Pro SP1   OSArch = X64   Language = 0407/german
H:\...\AutoIt3\SciTE     H:\...\AutoIt3      H:\...\AutoIt3\Include     (H:\ = Network Drive)

   88x31.png  Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind.

Share this post


Link to post
Share on other sites
TheXman

Or

StringRegExp($sSource, "(?i)<title>\s*([^ ]*)", 3)

Basically it says, barring any intial whitespace, capture everything after <title> until you encounter a space.  The (?i) makes the search case-insensitive.  If you know that <title> will ALWAYS be lowercase, then you can remove it from the regular expression.  The same is true for the \s*.  If you are sure that there will never be one or more spaces between <title> and the title, then you can remove it also.  :)

Edited by TheXman

Share this post


Link to post
Share on other sites
ripdad

They will probably all fail since titles generally have spaces in them.
ie: The quick brown fox - wb1

Edited by ripdad

"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Share this post


Link to post
Share on other sites
TheXman
8 minutes ago, ripdad said:

They will probably all fail since titles generally have spaces in them.
ie: The quick brown fox - wb1
 

No, I don't think you understand the regular expression.  I said that it will remove any INITIAL whitespace.  If you test it, I think you will see that it will not fail.  The OP said he wanted just title1, title2, title3...  It didn't say that everything between the <title> tag was wanted.

 

If everything between <title> and </title> is required, then it is even simpler:

StringRegExp($sSource, "(?i)<title>(.*?)</title>", 3)

:)

Edited by TheXman

Share this post


Link to post
Share on other sites
ripdad
8 minutes ago, TheXman said:

The OP said he wanted just title1, title2, title3...  It didn't say that everything between the <title> tag was wanted.

I know.

But, like all things with SRE, you have to think of what you overlooked. He wants the titles, but not anything extra in the title. And since titles are generally more than one word separated by spaces, they (the previous codes) will fail.

Except for the last one you posted, which is not what he ask for.

 


"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Share this post


Link to post
Share on other sites
TheXman
1 hour ago, youtuber said:

I want title1, title2, title3

I am taking what the OP asked for literally which is all one can do unless he/she adds more specificity.  I think you are making assumptions that were not detailed in the request, that being, more than just title1, title2, title3 is what is wanted.  You could be right or I could be right.  I'm just working off of what was requested, not what I think the OP meant.

Edited by TheXman

Share this post


Link to post
Share on other sites
ripdad

Okay.


"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Share this post


Link to post
Share on other sites
OldGuyWalking

YouTuber -

Thanks for asking the question.  Your example helped me understand the $STR_REGEXPARRAYGLOBALMATCH (3) flag better.  That will come in handy for doing one off extraction of data from XML and NZB files.

I love accidental learning situations.

 

Share this post


Link to post
Share on other sites
TheXman

Although I don't like to make assumptions, assuming @ripdad is correct in terms of the OP wanting everything between the <title> tags except certain characters or strings, here's a possible solution.  It is not meant to be exhaustive, just an example of one way to achieve the goal.  :D

 

#include <Array.au3>
#include <Constants.au3>

test()

;==========================================================================
;
;==========================================================================
Func test()

    Const $kSource = "<title>title1 | wb1</title>" & @CRLF & _
                     "<title>title2 &#8211 ; wb2</title>" & @CRLF & _
                     "<title>title3 – wb3</title>" & @CRLF & _
                     "<title>title4 _ wb4</title>"

    Local $aTitles


    ;Parse titles into an array
    $aTitles = StringRegExp($kSource, "(?i)<title>(.*?)</title>", 3)
    Switch @error
        Case 1 ; No matches found
            MsgBox($MB_ICONWARNING, "Test", "No matches found - check regular expression")
            Exit 1
        Case 2 ; Invalid regex
            MsgBox($MB_ICONERROR, "Test", StringFormat("Invalid regular expression - error at position %s", @extended))
            Exit 1
    EndSwitch
    _ArrayDisplay($aTitles, "Raw Titles")

    ;Remove unwanted characters from each title in the array
    For $i = 0 To UBound($aTitles) - 1
        $aTitles[$i] = StringReplace($aTitles[$i], "-", "")
        $aTitles[$i] = StringReplace($aTitles[$i], "–", "")
        $aTitles[$i] = StringReplace($aTitles[$i], "|", "")
        $aTitles[$i] = StringReplace($aTitles[$i], "_", "")
        $aTitles[$i] = StringReplace($aTitles[$i], ";", "")
        $aTitles[$i] = StringRegExpReplace($aTitles[$i], "&#\d{4}", "")
        $aTitles[$i] = StringRegExpReplace($aTitles[$i], " +", " ") ;remove extra spaces
    Next
    _ArrayDisplay($aTitles, "Scrubbed Titles")

EndFunc

 

Edited by TheXman
Added error checking example and comments

Share this post


Link to post
Share on other sites
iamtheky

why so complex for:

#include <Array.au3>

$sSource = "<title>title1 | wb1</title>" & @CRLF & _
"<title>title22 &#8211 ; wb2</title>" & @CRLF & _
"<title>title333 – wb3</title>" & @CRLF & _
"<title>title4444 _ wb4</title>"

;just titlenumber
$aRegEx = StringRegExp($sSource, '<title>(\w+)', 3)
_ArrayDisplay($aRegEx)
;everything
$aRegEx = StringRegExp($sSource, '<title>(.*?)<', 3)
_ArrayDisplay($aRegEx)

 


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites
TheXman
2 minutes ago, iamtheky said:

why so complex for:

Because there is a question as to what the OP actually wants.  If it is everything between the <title> tags except for certain characters (as it appears he/she may have been trying to do in their example), then your snippet would not achieve that goal.  If it is just pulling out the title, then their are a few examples of that too, including yours.

Share this post


Link to post
Share on other sites
iamtheky

my snippet does both things...

the point was how few words it uses to show both examples rather than having a long ass discussion about the two things it could mean.

i see now mulitple similar examples scattered throughout, still a wordy af thread.

 

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites
TheXman
1 minute ago, iamtheky said:

my snippet does both things...

Are you sure about that?  Yours appears to have all of the characters that the OP appeared to be trying to exclude in their regex. 

Also, I wasn't aware that we were having a contest on who could use the fewest lines of code.

 

Share this post


Link to post
Share on other sites
ripdad

This is as close as I can get without making my brain bleed. You will have to check for blanks in a For Loop.

ie: If $aRegEx[$i] <> '' Then....

#include <Array.au3>

$sSource = "<title>title1 | wb1</title>" & @CRLF & _
"<title>the quick brown fox</title>" & @CRLF & _
"<title>title3 – wb3</title>" & @CRLF & _
"<title>title4 _ wb4</title>"

$aRegEx = StringRegExp($sSource, '<title>(.*?)[\|\-\_\&\#\–].*</title>|<title>(.*?)</title>', 3)
_ArrayDisplay($aRegEx)

I'm sure an SRE guru will come along with a simpler solution.

 


"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Share this post


Link to post
Share on other sites
kylomas

ripdad,

?

#include <array.au3>

local $str = '<title>title1 | wb1</title>' & @CRLF & _
    '<title>title2 &#8211; wb2</title>' & @CRLF & _
    '<title>title3 - wb3</title>' & @CRLF & _
    '<title>title4 _ wb4</title>'

;msgbox(0,'',$str)

msgbox(0,'',_arraytostring(stringregexp($str,'title>([^<]+).*',3),@CRLF))

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
ripdad

kylomas,

I wish it were that simple.
If you study the code in Post #1, you will see he is trying to remove everything after certain characters within the title.

He just wants the initial title.


"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Share this post


Link to post
Share on other sites
OldGuyWalking

My take on this. Slight tweak on the original and It returns.

Title1
Title2
Title3
Title4

#include <Array.au3>

$sSource = "<title>title1 | wb1</title>" & @CRLF & _
"<title>title2 &#8211 ; wb2</title>" & @CRLF & _
"<title>title3 – wb3</title>" & @CRLF & _
"<title>title4 _ wb4</title>"

$aRegEx = StringRegExp($sSource, '<title>?(\w[^|\-\–\_\s]+)', 3)
_ArrayDisplay($aRegEx)

 

Share this post


Link to post
Share on other sites
youtuber

Obviously, I don't want the ascii characters on the left side

What I need is
title1
title2
title3
title4
title5
title6

#include <Array.au3>

$sSource = "<title>»title1 | wb1</title>" & @CRLF & _
"<title>®☺title2 &#8211 ; wb2</title>" & @CRLF & _
"<title>●title3 – wb3</title>" & @CRLF & _
"<title>-title4 _ wb4</title>" & @CRLF & _
"<title>_title5 _ wb5</title>" & @CRLF & _
"<title> _ title6 _ wb6</title>"

$aRegEx = StringRegExp($sSource, "<title>.*?([a-zA-Z0-9]+).+<\/title>", 3)
_ArrayDisplay($aRegEx)

 

And if this is what I want

title-1
tit-le2
tit@le3
title_4
tit-le5
ti_tle6

#include <Array.au3>

$sSource = "<title>»title-1 | wb1</title>" & @CRLF & _
"<title>®☺tit-le2 &#8211 ; wb2</title>" & @CRLF & _
"<title>●tit@le3 – wb3</title>" & @CRLF & _
"<title>-title_4 _ wb4</title>" & @CRLF & _
"<title>_tit-le5 _ wb5</title>" & @CRLF & _
"<title> _ ti_tle6 _ wb6</title>"

$aRegEx = StringRegExp($sSource, "<title>.*?([a-zA-Z0-9]+).+<\/title>", 3)
_ArrayDisplay($aRegEx)


but I do not know what a pattern would be like:think:

 

Edited by youtuber

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×