Jump to content

StringRegExp issue


youtuber
 Share

Recommended Posts

I doubt about whether the site title is true for a regex try like this, I want title1, title2, title3

<title>title1 | wb1</title>
<title>title2 &#8211; wb2</title>
<title>title3 - wb3</title>

<title>title4 _ wb4</title>

#include <Array.au3>

$sSource = "<title>title1 | wb1</title>" & @CRLF & _
"<title>title2 &#8211 ; wb2</title>" & @CRLF & _
"<title>title3 – wb3</title>" & @CRLF & _
"<title>title4 _ wb4</title>"

$aRegEx = StringRegExp($sSource, '<title>?([^|\-\–\_]+)', 3)
_ArrayDisplay($aRegEx)

 

Edited by youtuber
Link to comment
Share on other sites

Or pattern like this: “<title>([a-zA-Z0-9]+).+<\/title>“

Regards, Conrad 

SciTE4AutoIt = 3.7.3.0   AutoIt = 3.3.14.2   AutoItX64 = 0   OS = Win_10   Build = 19044   OSArch = X64   Language = 0407/german
H:\...\AutoIt3\SciTE     H:\...\AutoIt3      H:\...\AutoIt3\Include     (H:\ = Network Drive)

   88x31.png  Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind.

Link to comment
Share on other sites

Or

StringRegExp($sSource, "(?i)<title>\s*([^ ]*)", 3)

Basically it says, barring any intial whitespace, capture everything after <title> until you encounter a space.  The (?i) makes the search case-insensitive.  If you know that <title> will ALWAYS be lowercase, then you can remove it from the regular expression.  The same is true for the \s*.  If you are sure that there will never be one or more spaces between <title> and the title, then you can remove it also.  :)

Edited by TheXman
Link to comment
Share on other sites

They will probably all fail since titles generally have spaces in them.
ie: The quick brown fox - wb1

Edited by ripdad

"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Link to comment
Share on other sites

8 minutes ago, ripdad said:

They will probably all fail since titles generally have spaces in them.
ie: The quick brown fox - wb1
 

No, I don't think you understand the regular expression.  I said that it will remove any INITIAL whitespace.  If you test it, I think you will see that it will not fail.  The OP said he wanted just title1, title2, title3...  It didn't say that everything between the <title> tag was wanted.

 

If everything between <title> and </title> is required, then it is even simpler:

StringRegExp($sSource, "(?i)<title>(.*?)</title>", 3)

:)

Edited by TheXman
Link to comment
Share on other sites

8 minutes ago, TheXman said:

The OP said he wanted just title1, title2, title3...  It didn't say that everything between the <title> tag was wanted.

I know.

But, like all things with SRE, you have to think of what you overlooked. He wants the titles, but not anything extra in the title. And since titles are generally more than one word separated by spaces, they (the previous codes) will fail.

Except for the last one you posted, which is not what he ask for.

 

"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Link to comment
Share on other sites

1 hour ago, youtuber said:

I want title1, title2, title3

I am taking what the OP asked for literally which is all one can do unless he/she adds more specificity.  I think you are making assumptions that were not detailed in the request, that being, more than just title1, title2, title3 is what is wanted.  You could be right or I could be right.  I'm just working off of what was requested, not what I think the OP meant.

Edited by TheXman
Link to comment
Share on other sites

Although I don't like to make assumptions, assuming @ripdad is correct in terms of the OP wanting everything between the <title> tags except certain characters or strings, here's a possible solution.  It is not meant to be exhaustive, just an example of one way to achieve the goal.  :D

 

#include <Array.au3>
#include <Constants.au3>

test()

;==========================================================================
;
;==========================================================================
Func test()

    Const $kSource = "<title>title1 | wb1</title>" & @CRLF & _
                     "<title>title2 &#8211 ; wb2</title>" & @CRLF & _
                     "<title>title3 – wb3</title>" & @CRLF & _
                     "<title>title4 _ wb4</title>"

    Local $aTitles


    ;Parse titles into an array
    $aTitles = StringRegExp($kSource, "(?i)<title>(.*?)</title>", 3)
    Switch @error
        Case 1 ; No matches found
            MsgBox($MB_ICONWARNING, "Test", "No matches found - check regular expression")
            Exit 1
        Case 2 ; Invalid regex
            MsgBox($MB_ICONERROR, "Test", StringFormat("Invalid regular expression - error at position %s", @extended))
            Exit 1
    EndSwitch
    _ArrayDisplay($aTitles, "Raw Titles")

    ;Remove unwanted characters from each title in the array
    For $i = 0 To UBound($aTitles) - 1
        $aTitles[$i] = StringReplace($aTitles[$i], "-", "")
        $aTitles[$i] = StringReplace($aTitles[$i], "–", "")
        $aTitles[$i] = StringReplace($aTitles[$i], "|", "")
        $aTitles[$i] = StringReplace($aTitles[$i], "_", "")
        $aTitles[$i] = StringReplace($aTitles[$i], ";", "")
        $aTitles[$i] = StringRegExpReplace($aTitles[$i], "&#\d{4}", "")
        $aTitles[$i] = StringRegExpReplace($aTitles[$i], " +", " ") ;remove extra spaces
    Next
    _ArrayDisplay($aTitles, "Scrubbed Titles")

EndFunc

 

Edited by TheXman
Added error checking example and comments
Link to comment
Share on other sites

why so complex for:

#include <Array.au3>

$sSource = "<title>title1 | wb1</title>" & @CRLF & _
"<title>title22 &#8211 ; wb2</title>" & @CRLF & _
"<title>title333 – wb3</title>" & @CRLF & _
"<title>title4444 _ wb4</title>"

;just titlenumber
$aRegEx = StringRegExp($sSource, '<title>(\w+)', 3)
_ArrayDisplay($aRegEx)
;everything
$aRegEx = StringRegExp($sSource, '<title>(.*?)<', 3)
_ArrayDisplay($aRegEx)

 

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

2 minutes ago, iamtheky said:

why so complex for:

Because there is a question as to what the OP actually wants.  If it is everything between the <title> tags except for certain characters (as it appears he/she may have been trying to do in their example), then your snippet would not achieve that goal.  If it is just pulling out the title, then their are a few examples of that too, including yours.

Link to comment
Share on other sites

my snippet does both things...

the point was how few words it uses to show both examples rather than having a long ass discussion about the two things it could mean.

i see now mulitple similar examples scattered throughout, still a wordy af thread.

 

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

1 minute ago, iamtheky said:

my snippet does both things...

Are you sure about that?  Yours appears to have all of the characters that the OP appeared to be trying to exclude in their regex. 

Also, I wasn't aware that we were having a contest on who could use the fewest lines of code.

 

Link to comment
Share on other sites

This is as close as I can get without making my brain bleed. You will have to check for blanks in a For Loop.

ie: If $aRegEx[$i] <> '' Then....

#include <Array.au3>

$sSource = "<title>title1 | wb1</title>" & @CRLF & _
"<title>the quick brown fox</title>" & @CRLF & _
"<title>title3 – wb3</title>" & @CRLF & _
"<title>title4 _ wb4</title>"

$aRegEx = StringRegExp($sSource, '<title>(.*?)[\|\-\_\&\#\–].*</title>|<title>(.*?)</title>', 3)
_ArrayDisplay($aRegEx)

I'm sure an SRE guru will come along with a simpler solution.

 

"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Link to comment
Share on other sites

ripdad,

?

#include <array.au3>

local $str = '<title>title1 | wb1</title>' & @CRLF & _
    '<title>title2 &#8211; wb2</title>' & @CRLF & _
    '<title>title3 - wb3</title>' & @CRLF & _
    '<title>title4 _ wb4</title>'

;msgbox(0,'',$str)

msgbox(0,'',_arraytostring(stringregexp($str,'title>([^<]+).*',3),@CRLF))

kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

kylomas,

I wish it were that simple.
If you study the code in Post #1, you will see he is trying to remove everything after certain characters within the title.

He just wants the initial title.

"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Link to comment
Share on other sites

My take on this. Slight tweak on the original and It returns.

Title1
Title2
Title3
Title4

#include <Array.au3>

$sSource = "<title>title1 | wb1</title>" & @CRLF & _
"<title>title2 &#8211 ; wb2</title>" & @CRLF & _
"<title>title3 – wb3</title>" & @CRLF & _
"<title>title4 _ wb4</title>"

$aRegEx = StringRegExp($sSource, '<title>?(\w[^|\-\–\_\s]+)', 3)
_ArrayDisplay($aRegEx)

 

Link to comment
Share on other sites

Obviously, I don't want the ascii characters on the left side

What I need is
title1
title2
title3
title4
title5
title6

#include <Array.au3>

$sSource = "<title>»title1 | wb1</title>" & @CRLF & _
"<title>®☺title2 &#8211 ; wb2</title>" & @CRLF & _
"<title>●title3 – wb3</title>" & @CRLF & _
"<title>-title4 _ wb4</title>" & @CRLF & _
"<title>_title5 _ wb5</title>" & @CRLF & _
"<title> _ title6 _ wb6</title>"

$aRegEx = StringRegExp($sSource, "<title>.*?([a-zA-Z0-9]+).+<\/title>", 3)
_ArrayDisplay($aRegEx)

 

And if this is what I want

title-1
tit-le2
tit@le3
title_4
tit-le5
ti_tle6

#include <Array.au3>

$sSource = "<title>»title-1 | wb1</title>" & @CRLF & _
"<title>®☺tit-le2 &#8211 ; wb2</title>" & @CRLF & _
"<title>●tit@le3 – wb3</title>" & @CRLF & _
"<title>-title_4 _ wb4</title>" & @CRLF & _
"<title>_tit-le5 _ wb5</title>" & @CRLF & _
"<title> _ ti_tle6 _ wb6</title>"

$aRegEx = StringRegExp($sSource, "<title>.*?([a-zA-Z0-9]+).+<\/title>", 3)
_ArrayDisplay($aRegEx)


but I do not know what a pattern would be like:think:

 

Edited by youtuber
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...