Jump to content

Using RegEx to remove the first two characters


Go to solution Solved by TheXman,

Recommended Posts

I have data that has a pattern.  In that pattern, I want to trim the first two digits, if those digits start with "10" and maintain the rest of the string.  For example, if the data is this:

Input              ==>  Desired Output

1019002LP    ==> 19002LP
1024001LW   ==> 24001LW
1040001LP    ==> 40001LP
1019001LP    ==> 19001LP
1150001LP    ==> 1150001LP

The string is always 9 characters long.  The first seven characters are always digits and the last two are always letters.  Below is my starting code.  Thanks

 

Const $TEST_DATA = "Lorem ipsum dolor sit amet, consectetur adipiscing 1019002LP elit, " & _
        "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 1024001LW enim ad minim veniam," & _
        "quis nostrud exercitation ullamco 1040001LP laboris nisi ut aliquip 1019001LP ex ea commodo consequat. Duis aute " & _
        "irure dolor in reprehenderit in voluptate velit 1150001LP esse cillum dolore eu fugiat nulla pariatur. "

example()

Func example()
Local $sPattern = "(?i)\h+\b(10\d{7}\w{2})\b"
    Local $sOutPut = StringRegExpReplace($TEST_DATA, $sPattern, @TAB & "\1") ;Find the pattern and trim the first two digits, if those digits are 10
    ConsoleWrite($sOutPut & @CRLF)

EndFunc   ;==>example

 

Link to post
Share on other sites
  • Solution
Posted (edited)

Maybe something like this?

Const $TEST_DATA = _
        "Lorem ipsum dolor sit amet, consectetur adipiscing 1019002LP elit, " & _
        "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 1024001LW enim ad minim veniam," & _
        "quis nostrud exercitation ullamco 1040001LP laboris nisi ut aliquip 1019001LP ex ea commodo consequat. Duis aute " & _
        "irure dolor in reprehenderit in voluptate velit 1150001LP esse cillum dolore eu fugiat nulla pariatur. "

example()

Func example()
    Local $sPattern = "(?i)\b10(\d{5}[a-z]{2})\b"
    Local $sOutPut  = StringRegExpReplace($TEST_DATA, $sPattern, "\1")
    ConsoleWrite($sOutPut & @CRLF)
EndFunc   ;==>example

Console output:

Lorem ipsum dolor sit amet, consectetur adipiscing 19002LP elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 24001LW enim ad minim veniam,quis nostrud exercitation ullamco 40001LP laboris nisi ut aliquip 19001LP ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit 1150001LP esse cillum dolore eu fugiat nulla pariatur.

Regular Expression Description Below:

Spoiler

(?i)\b10(\d{5}[a-z]{2})\b
---------------------------------

  • Use these options for the whole regular expression «(?i)»
    • Case insensitive «i»
  • Assert position at a word boundary (position preceded or followed—but not both—by an ASCII letter, digit, or underscore) «\b»
  • Match the character string “10” literally «10»
  • Match the regex below and capture its match into backreference number 1 «(\d{5}[a-z]{2})»
    • Match a single character that is a “digit” (ASCII 0–9 only) «\d{5}»
      • Exactly 5 times «{5}»
    • Match a single character in the range between “a” and “z” (case insensitive) «[a-z]{2}»
      • Exactly 2 times «{2}»
  • Assert position at a word boundary (position preceded or followed—but not both—by an ASCII letter, digit, or underscore) «\b»

Created with RegexBuddy

 


Your regular expression had a few issues:

"(?i)\h+\b(10\d{7}\w{2})\b"
  1. Not sure why you have "\h+" in your regular expression.  You didn't specify that you wanted to trim the leading spaces.
  2. If you exclude "10", then there would only be 5 remaining digits in the capture group, not 7.
  3. "10" should not be inside your capture group.  You said that you only want to capture the remaining 5 digits and the 2 letters at the end.
  4. "\w" includes [A-Za-z0-9_] which means it would have also included strings with numeric digits and/or underscores at the end, not just letters.
  5. Not a significant issue but, your regular expression did not need "(?i)" because you didn't specify letters within the expression.  So case was not an issue.
Edited by TheXman
Link to post
Share on other sites
1 hour ago, TheXman said:

"\w" includes [A-Za-z0-9_] which means it would have also included strings with numeric digits and/or underscores at the end, not just letters.

Shouldn't a pattern be as explicit as possible ? (IMHO)
"\b10(\d{5}L[PW])\b"

Link to post
Share on other sites
Posted (edited)
22 hours ago, mikell said:

Shouldn't a pattern be as explicit as possible ? (IMHO)
"\b10(\d{5}L[PW])\b"

 

23 hours ago, zuladabef said:

The string is always 9 characters long.  The first seven characters are always digits and the last two are always letters.

It should only be as explicit as the requester states that it should be (IMHO).  There is no explicit mention of "LP" or "LW" being a requirement.  That is only your assumption based on the data that was supplied.  As a matter of fact, it only states that the last 2 characters "are always letters"Your assumption may end up being correct, but it doesn't match the stated requirements in the original post.

 

Maybe you are getting this request mixed up with a different request that was made in a previous topic:

 

Edited by TheXman
Link to post
Share on other sites

@TheXman

This is a super helpful conversation!

 

2 hours ago, TheXman said:

Your regular expression had a few issues:

"(?i)\h+\b(10\d{7}\w{2})\b"
  1. Not sure why you have "\h+" in your regular expression.  You didn't specify that you wanted to trim the leading spaces.
  2. If you exclude "10", then there would only be 5 remaining digits in the capture group, not 7.
  3. "10" should not be inside your capture group.  You said that you only want to capture the remaining 5 digits and the 2 letters at the end.
  4. "\w" includes [A-Za-z0-9_] which means it would have also included strings with numeric digits and/or underscores at the end, not just letters.
  5. Not a significant issue but, your regular expression did not need "(?i)" because you didn't specify letters within the expression.  So case was not an issue.

 

  1. You have a fair point that I didn't mention needing \h+, but I do want it in there.
  2. Still trying to understand this, but I think point 3 will clarify
  3. So this is where I am not clear on the syntax.  I am telling RegEx find a ten character string, so don't I need the 10 in there?  Or, are you saying that the 10 is signifying that I want the output to be 10 characters long? 
  4. That is a great clarifier.  I just assumed "w" means word, which implies only letters.
  5. Gotcha, so I can take it out.  Case is not important since the data is always all caps.
  6. You were right about the last two characters.  The supplied data was only two instances, but it could be any combination of two letters.

 

 

Link to post
Share on other sites
Posted (edited)
15 hours ago, zuladabef said:
  • So this is where I am not clear on the syntax.  I am telling RegEx find a ten character string, so don't I need the 10 in there?  Or, are you saying that the 10 is signifying that I want the output to be 10 characters long? 

You said that the 9-character string starts with 7 numeric digits followed by 2 letters.  You also said that the 9-character strings of interest start with "10" .  You then stated that the result should trim "10" from the beginning and leave the remaining 7 characters (5 digits + 2 letters).  So #2 is saying that your regular expression was looking for ...(10\d{7}\w{2})...  That would mean "10" + 7 digits + 2 word-chars.  That would be a total of 11 characters, which is not what you wanted.  You want to find: "10" + 5 digits + 2 letters.  Does that make sense?

Capture groups are surrounded by parens "()".  You said you want to strip the leading "10" from the 9-character string and only keep the last 7 characters (5 digits + 2 letters).  Therefore, "10" should not be inside the capture group.  You had "...(10\d{7}\w{2})...".  I suggested "...10(\d{5}[a-z]{2})...", because the "10" is not wanted in the result and capture group 1 should only be the last 5 digits + 2 letters (not the "10" + 7 digits + 2 word-chars).

15 hours ago, zuladabef said:

Gotcha, so I can take it out.  Case is not important since the data is always all caps.

Case wasn't necessary in the regular expression that you supplied.  However, it is required in mine because of "[a-z]".  So as the detailed description shows behind the "Reveal hidden contents" banner, "(?i)" at the beginning of the regex means everything after this is not case-sensitive, including [a-z].  Regular expression are literal and case sensitive, unless modifiers are used.  Therefore, without "(?i)", my regular expression would only match if it found lower case letters from a-z.  Try removing "(?i)" from my regular expression and rerunning the script.  You will see that nothing is replaced because all of the strings of interest have uppercase letters at the end. 

Did you click on the "Reveal hidden contents" banner in my previous post?  It gave a detailed description of entire regular expression that I suggested.  It should have answered some of the questions that you had above.

Edited by TheXman
Link to post
Share on other sites
17 hours ago, TheXman said:

As a matter of fact, it only states that the last 2 characters "are always letters"

I formerly learned from a regex master (jchd) that the more accurate is the pattern, the more reliable is the result
So it should be  [:alpha:] or [A-Z]  (as these last 2 letters are always uppercase in the OP's various posts)  - with the case insensitivity option if needed
Of course \w  works, but as you mentioned previously this matches also numbers and underscore which seems improper. If a generic term is allowed, then the dot works too  :idiot:

Link to post
Share on other sites

@TheXman 

This is great!  Thanks for the further clarification, I am trying to learn this material and your explanations have been very helpful.  I did read the hidden contents originally, but I think I just needed further clarification to help me absorb it some more, so thank you for taking the time to do that for me.

The data looks much cleaner now.  So now, the data has a set of "00" in the middle of the string that can be replaced with either a "-".  I thought I understood how to do that, but it's failing.  I thought something like this might do the trick, so maybe you can show me where I went wrong.

 

Const $TEST_DATA = _
        "Lorem ipsum dolor sit amet, consectetur adipiscing 1019002LP elit, " & _
        "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 1024001LW enim ad minim veniam," & _
        "quis nostrud exercitation ullamco 1040001LP laboris nisi ut aliquip 1019001LP ex ea commodo consequat. Duis aute " & _
        "irure dolor in reprehenderit in voluptate velit 1150001LP esse cillum dolore eu fugiat nulla pariatur. "


example()

Func example()
    ;Find the first pattern | replace the 10 with nothing
    Local $sPattern = "(?i)\b10(\d{5}[a-z]{2})\b" 
    Local $sOutPut = StringRegExpReplace($TEST_DATA, $sPattern, "\1")
    ;Find the second pattern | replace the 11 with 1
    $sPattern = "(?i)\b11(\d{5}[a-z]{2})\b"  
    $sOutPut = StringRegExpReplace($sOutPut, $sPattern, "1\1")
    ;Find the third pattern | replace the 12 with 2
    $sPattern = "(?i)\b22(\d{5}[a-z]{2})\b" 
    $sOutPut = StringRegExpReplace($sOutPut, $sPattern, "2\1")
     ;Find the fourth pattern | replaced the 00 with -
    $sPattern = "(?i)\b(\d{5})00([a-z]{2})\b"
    $sOutPut = StringRegExpReplace($sOutPut, , "-\1")
    ConsoleWrite($sOutPut & @CRLF)
EndFunc   ;==>example

Here are examples of what else can be found in the data:

142001LP
250001LP
17003UP
19002LP
02002LP
19003LP
21002LP
17003LP
07001LP
06003LP
12002LP
25001LP
07002LP
24001LW
01001LW
28004LP
40001LP
19004LP
15003LW
 

 

Link to post
Share on other sites
  • Developers
Posted (edited)
40 minutes ago, zuladabef said:

So now, the data has a set of "00" in the middle of the string that can be replaced with either a "-".  I thought I understood how to do that, but it's failing.  I thought something like this might do the trick, so maybe you can show me where I went wrong.

Guess the Regex wasn't clear for you yet as your new defined regex: (\d{5})00([a-z]{2}   expects:

  • Capture group1: 5 digits => \d{5}
  • 00  => 00
  • Capture group2: 2 Letters =>[a-z]{2}

Ouput: is missing the second capture group

So try understanding what you are doing and try again. Use this website to help you for the proper regex: https://regex101.com/

Jos

 

Edited by Jos

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to post
Share on other sites

I have no patience for chasing moving targets.  If you have an ultimate goal in mind, then explain the whole thing in detail, not a piece at a time.  Explaining it a piece at a time makes me think that you don't have a well thought out plan and that you are doing whatever it is you are trying to do on-the-fly.  It's also a tactic sometimes used by lazy people that are trying to not-so-slyly get others to do their work for them.   In either case, as far as I'm concerned, that would be a waste of my time and theirs.  If you would rather do it a piece-at-a-time, that's fine.  It just won't be with my continued help because I won't willingly allow anyone to waste my time.

Also, as I suggested previously, you should provide comprehensive test data and you should also show what the expected results should be for the data that you supply.  Doing so removes some of the guess-work that exists when you don't fully and accurately explain the help that you need  It also forces you to better understand and describe what the problem is, the help that you require, and maybe even catch errors before asking for help. 

First your test data was 9 bytes long, Now your test data is 7 & 8 bytes long, with no 9 byte strings.   Some of the patterns that you say you are looking for don't even exist within the supplied test data. i.e.  "Replace leading 12 with 2".   There are no strings with a leading "12".  You also say that you want to "replace 12 with 2" but your regex is looking for "22" and not "12".  The last StringRegExpReplace() doesn't even include the pattern!  It doesn't even look like you are trying.  :no:

 

Edited by TheXman
Link to post
Share on other sites
Posted (edited)

@TheXman

My apologies for offending you.  It is not my intention to take advantage of your kindness.  I simply am trying to understand and I truly appreciate the time you have taken to explain these complex matters.  I hope you can understand that while a lot of this syntax seems simple to you, that I am struggling to learn it.  I am not a professional working in IT; I am an elementary school teacher.  I am trying to learn how to become more proficient at this kind of stuff, so I can spend more time teaching and less time copying and pasting.   I am not trying to get someone else to do the work for me.  I genuinely want to understand. 

While I tried my best to plan out my needs in advanced, other ideas came to mind as I continued to dig around.  Mostly, I did not know how powerful RegEx could be, so I didn't even think most of this could even be accomplished.  As I read more of your posts and looked through articles online, I began to see that RegEx could accomplish even more than I knew and so I set my sights higher.  

Again, my apologies for upsetting you.  I hope this message finds you well.

Edited by zuladabef
Link to post
Share on other sites

I can only second @TheXman advice to first expose all of your precise requirements in unambiguous plain english, being sure to cover all your real-world cases. Providing an exhaustive set of examples data-in and expected data-out is a must have, of course.

Yes, Regex is a very powerful language by itself, even if it uses a rather uncommon syntax. The flavor used in AutoIt is even Turing-complete, which means it could (in theory) solve any problem solvable in finite time, albeit with absolutely unmanageable complexity.

Staying within relatively simple tasks, you should first examine each different form of input data and decide which processing each must get thru. Expose that and thinks will magically become clearer.

In fact it's common to see neophytes explain only part of their regex problem, only to discover later that other cases are not covered by their initial request. It's hard for those people to be both precise and exhaustive in the first place. That's how we all learn. OTOH, it's a little painful for helpers to try to follow what we call such a "moving target".

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites

I totally agree with everything @jchd said.  The only thing that I would add is that there's nothing wrong with asking for help in learning & understanding regular expressions (or any other directly or indirectly related technical subject).  But if that is your goal, then it is better to simply state it, explicitly, up front so that those who choose to engage do so knowingly and willingly as opposed to trying to do it under the guise of solving a real world problem or task that keeps changing.  I enjoy helping others learn new concepts, skills, and techniques.  And as it says under my avatar, "Semper volens auxilium" (Always willing to help).  If I would have known up front that you were trying to learn as opposed to appearing to play whack-a-mole, I wouldn't have gotten so frustrated.  🙂

Edited by TheXman
Link to post
Share on other sites

@JockoDundee

I don't help people for "Likes".  I couldn't care less about receiving "Likes" or even being liked.  ;)  However, it is nice to know that ones efforts are appreciated.  Therefore, a simple written thanks (not the icon) will suffice.  Please don't misunderstand, I gladly accept "Likes" or "Thanks" as a token of appreciation.  It's just that to me, there's no difference between a "Like" and someone simply writing thanks.  For some, how many "Likes" they can rack up is a competition or some sort of badge of honor.  My point is that I'm not one of those people.

Edited by TheXman
Refined my statements for Jocko :)
Link to post
Share on other sites
18 minutes ago, TheXman said:

I don't care about or help people for "Likes".  I couldn't care less about "Likes"

Yes, but even so you care enough to give one, not because you want one, but because it’s a way to show appreciation.  

Edited by JockoDundee

Code hard, but don’t hard code...

Link to post
Share on other sites

@TheXman

I totally appreciate your position.  While I am trying to learn, I am also solving a real world problem.  Or maybe better stated, trying to solve a real world problem is forcing me to learn something new (RegEx) and I am trying to take on that challenge.  Since my profession is teaching, I probably forget how it can be frustrating for others when there is a moving target; whereas for me, that is my entire existence as a teacher.  Even though the subject matter is the same every year, when you're faced with a classroom of 30 eight-year olds, I have to find multiple ways to explain the subject matter because every kid has a different starting point and their minds works differently.  Some understand it quicker than others.  And sometimes the subject matter takes us to unexpected places.  When we were talking about dying stars, one student asked me, "If all stars die, then why do people tell us to wish up on a star?"

All of that to say, sorry if I did not anticipate how frustrating it could be to try and solve a moving target.  Futhermore, I understand your position in "not doing it for the likes."  As a teacher you don't get into this job for the likes, because they are few.  And you certainly don't get into teaching for the money, because there is none.  You get into teaching, because you hope you can make a better future for the children in front of you.

Thank you again for your time and patience.

Link to post
Share on other sites

@zuladabef

No worries.  I wasn't offended.  I was just getting a little frustrated by the incremental additional requirements that were popping up and the obvious issues with the script, the changing data, and the lack of expected results.

In my signature, you will find links to a couple of articles that discuss how to get better answers to technical questions.  I think that the articles contain some useful information, suggestions, and tips.  You may find some of those tips relevant and useful too.

 

1 hour ago, zuladabef said:

Since my profession is teaching, I probably forget how it can be frustrating for others when there is a moving target; whereas for me, that is my entire existence as a teacher.  Even though the subject matter is the same every year

Here's an analogy from my point of view.  As a teacher, I'm sure you have an overall goal that defines what your students should learn by the end of the school year or session.  That goal is defined by some "authority",  Now imagine that throughout the school year that "authority" keeps adding additional learning goals to that list of things that your students need to know by the end year or session.  Not only do they keep adding additional learning goals, but they don't even clearly define those new goals, which makes it even harder for you, the teacher, to meet those new goals.  I'm sure in that scenario you, as the teacher, would probably start to get a little frustrated too.  :)

One other point, in one of the articles that I referred to above, it talks about concentrating on the problem, not your solution (#7 Describe the goal, not your attempted solution).  The problem you are trying to solve is cleaning up your data.  Don't get caught up with a particular solution (regular expressions) when there are many ways to solve the problem.  If learning regular expressions is you main goal, then forge ahead with regular expressions.  If cleaning up the data is your main goal, then state exactly how the data needs to be cleaned up and then you may get replies with multiple ways to do it (including regular expression).  Basically, I'm saying don't limit yourself to any given solution.  Regular expressions are powerful but they are rarely the only solution.  In my opinion, the best solution is one that you understand and can maintain.  Those solutions may not be the most efficient or the fastest, but they are the best because it gets the job done, you understand it, and you can modify it without any assistance.  You can always learn new things, like regular expressions, and then go back and implement what you've learned to make the process more efficient.  :)

Edited by TheXman
Link to post
Share on other sites

@TheXman

That is a perfect analogy and that is exactly what happens in education on a regular basis.  New curriculum, new textbooks, new research on brain development, new bosses, new parents, new exams, etc.  It is a constant source of frustration, for sure.  So, I get your point.

It's a great point you make about cleaning up the data versus learning.  It is always a balance of making a pragmatic choice of just getting the task done (short-term) versus putting in a little bit more time to understand and learn something that may be more powerful (long-term).  It's tough to know how to balance that.  It's also tough to know as the person learning it, how long will it take to learn this?  Obviously, I don't know that answer until after I've learned it.  At first, it seems simple.  A few lines of code, a few parameters and so I say, let's give this a try.  Then it ends up going down a rabbit hole and you're more confused in the end than you were in the beginning, but you've learned a few cool tricks along the way.  I recognize your point to be true and valid.  It's just tough to know ahead of time.

I would like to comment on my experience learning so far.    Sometimes when I ask for help, the person that does the helping is quick to give a final solution.  And, I can learn for that because you see a model of how to do it correctly and can make some inferences about it.  Other times though, the person helping doesn't want to give you the final solution.  They expect you as the person asking the question, learn on your own how to do it, but they will give you a nudge in the right direction.  Both styles of helpers are useful.  Both are necessary.  Both provide something different to the end-user (the learner).  What is tough as the person asking the initial question (the learner), is you don't know in a forum, what kind of person is going to answer you.  In a formal classroom, you get to know the teacher and know their expectations and how they may/may not help you.  In a forum, that's different.  You don't know who is going to help.  I love the AutoIt forums, because the community is so willing to help, even if the questions extend past strictly AutoIt types of questions.  But still, the uncertainty for the learner still remains. 

That makes it tough for the learner, because how then do they know how to ask the question?  Should they ask a short and tidy question that someone might be more willing to answer?  Or, do they lay out the entire scope of the project (the best they know how) and risk not getting any/little help at all, because someone will interpret is as them wanting to have someone else do all the work for them?  All of that to say, even if our intentions are the best, it may still be interpreted as lazy, incompetent, or trying to take advantage of someone else.  I can't speak for everybody, only myself.  But, I both want to learn, get the task(s) done, and be a viable member of this lovely community.  Thanks again for everything.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...