Jump to content

StringRegExp issue


youtuber
 Share

Recommended Posts

Then this will match:

#include <Array.au3>

$sSource = "<title>»title-1 | wb1</title>" & @CRLF & _
"<title>®?tit-le2 &#8211 ; wb2</title>" & @CRLF & _
"<title>?tit@le3 – wb3</title>" & @CRLF & _
"<title>-title_4 _ wb4</title>" & @CRLF & _
"<title>_tit-le5 _ wb5</title>" & @CRLF & _
"<title> _ ti_tle6 _ wb6</title>"

$aRegEx = StringRegExp($sSource, "<title>.*?([a-zA-Z][a-zA-Z0-9\-_@]+).+<\/title>", 3)
_ArrayDisplay($aRegEx)

You have to fill in the allowed symbols like _, @ and - (this you have to mask with \, that's why \-). But you dont want these extra signs at the beginning. Therefor you have to look once for [a-zA-Z].

Conrad

SciTE4AutoIt = 3.7.3.0   AutoIt = 3.3.14.2   AutoItX64 = 0   OS = Win_10   Build = 19044   OSArch = X64   Language = 0407/german
H:\...\AutoIt3\SciTE     H:\...\AutoIt3      H:\...\AutoIt3\Include     (H:\ = Network Drive)

   88x31.png  Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind.

Link to comment
Share on other sites

This one works too - for this moment - thanks to posix classes

#include <Array.au3>

$sSource = "<title>»title-1 | wb1</title>" & @CRLF & _
"<title>®?tit-le2 &#8211 ; wb2</title>" & @CRLF & _
"<title>?tit@le3 – wb3</title>" & @CRLF & _
"<title>-title_4 _ wb4</title>" & @CRLF & _
"<title>_tit-le5 _ wb5</title>" & @CRLF & _
"<title> _ ti_tle6 _ wb6</title>"

$aRegEx = StringRegExp($sSource, '<title>[^[:alnum:]]*(.*?)\s[^[:alnum:]]', 3)

_ArrayDisplay($aRegEx)

But if you add one requirement more, it will become hazardous/nearly impossible   :)

Link to comment
Share on other sites

32 minutes ago, mikell said:

The magic universal regex which gets anything you want whatever your requirements doesn't exist

blasphemy! stone him!

Also, apologies for claiming a false trophy. I thought i manufactured the most ridiculous edge cases, but I stand so very corrected by every single post in this thread.  I'm not even top 5.

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

#include <Array.au3>

$sSource = "<title>»title-1 | wb1</title>" & @CRLF & _
"<title>®?tit-le2 &#8211 ; wb2</title>" & @CRLF & _
"<title>?tit@le3 – wb3</title>" & @CRLF & _
"<title>-title_4 _ wb4</title>" & @CRLF & _
"<title>_tit-le5 _ wb5</title>" & @CRLF & _
"<title> _ ti_tle6 _ wb6</title>" & @CRLF & _
"<title>title7 &#8211 ; wb2</title>" & @CRLF & _
"<title>title8 – wb3</title>" & @CRLF & _
"<title>title9 _ wb4</title>"

;$aRegEx = StringRegExp($sSource, "<title>.*?([a-zA-Z0-9]+).+<\/title>", 3)
;_ArrayDisplay($aRegEx)

$aRegEx = StringRegExp($sSource, '<title>[\W\s\_]*([\S]+)', 3)
_ArrayDisplay($aRegEx)

I threw in the data you'd used earlier as well to show that this works for both.

The first half of the pattern:  '<title>[\W\s\_]*

Capture (and then discard) \W nonword characters (the [^A-Za-z0-9_] characters), \s spaces, underscore (which is treated as a \w word character and in this instance would trigger the second half of the capture if not put in the list) * (if they exist) after the <title>

The second half of the pattern:  ([\S]+)
Capture any \S non-space character until it reaches a \s space character.

If there is a space in the data you want to capture you'll only get the characters up to the space.
Say you wanted 'ti tle6' captured in the following example ("<title> _ ti tle6 _ wb6</title>" you would only get the ti part.

For the set of data I got.
title-1
tit-le2
tit@le3
title_4
tit-le5
ti_tle6
title7
title8
title9

Does this work for you?

One of the more powerful features in Regular Expressions are the Character Classes that can be used to shorten code (depending on what you're trying to do of course).  Below is a short list. There are times when using a single character class can replace alot of code. Not always, but it's a good idea to use them when possible.

Character Classes
\c Control character
\s White space
\S Not white space
\d Digit
\D Not digit
\w Word
\W Not word
\x Hexade­cimal digit
\O Octal digit

Edited by OldGuyWalking
Minor edit. Added list of some RegEx character classes.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...