need help with regex

Burgaud · July 17, 2016

I would like to OCR Scan jpg files and extract numbers.

This seems to do the trick:

StringRegExp($temp, '[0-9\.]+', 3)

However, I realized that the OCR app oftentimes would translate capital I and small L (l) as 1 as well so much so words like "Will" are oftentimes OCRed and recognized as "11". I noticed that numbers are preceeded by either +, - or space. Thus I would like to match only numbers if they are preceeded by either of these three chars [+1 ] without having those chars as part of the regex result. How do I do that?

Sorry, I am basic regex to know how to do it

Melba23 · July 17, 2016

Burgaud,

Use a" lookbehind" to check if the match is preceded by one of those 3 characters:

Global $aList[] = ["+11", "i11", "-11", "a11", " 11"]

For $i = 0 To UBound($aList) - 1

    $fMatch = False
    If StringRegExp($aList[$i], "(?<=[ +-])(\d+)") Then
        $fMatch = True
    EndIf
    ConsoleWrite($aList[$i] & " : " & $fMatch & @CRLF)
Next

M23

Burgaud · July 17, 2016

(?<=[ +-])

This re's lookbehind has eluded me for many years.

Thanks for your simple script, i finally get to understand this awesome function.

+1 for the education

iamtheky · July 18, 2016

whats the difference, and benefits, between that regex and

[\s\+\-](\d+)

mikell · July 18, 2016

iamtheky,
Because the OP's particular sample is very simple, the answer is : none
Furthermore, regex101 says that the lookbehind consumes more steps than the simple expression

Edit
But I strongly suspect that Melba chose the lookbehind for teaching purpose, to introduce the concept - which is extremely powerful and useful in more complex situations

Edited July 18, 2016 by mikell

iamtheky · July 18, 2016

sure, due to my sucking with regex lookbehind never even enters as a possibility. I cant figure out why to use it and when to use it, until i figure out exactly what it is.

jchd · July 18, 2016

See http://www.pcre.org/original/doc/html/pcrepattern.html for more information about feaures and gory details.

EDIT: forgot to mention that AutoIt currently uses the "legacy" version of PCRE, now nicknamed PCRE1. As the PCRE main webpage explains, PCRE has been substantively rewritten as PCRE2. While 99.9% of the regexp features are compatible, some corner cases have been fixed or changed. The main changes are in the library interface functions. So do not refer to PCRE2 documentation until a new version of AutoIt is made available with explicit support for it.

Edited July 18, 2016 by jchd

jguinch · July 18, 2016

@iamtheky :

To understand how look arround assertions work, here is an example :

You have the string A123B456C789 and you want to capture each numbers enclosed by a letter (123 and 456)
What comes to your mind is probably a regex like [A-Z](\d+)[A-Z] : in this case, you will have only one result (123) because the regex consumes the specified characters (A123B) and then continues to search from the position after the first match. It remains "456C789" in the chain : the B has been consumed, so 456 cannot be considered enclosed by 2 letters.

To avoid the regex consumes characters, you can use a look arround assertion : [A-Z](\d+)(?=[A-Z])
With this regex, the letter after the number is not consumed : it means take one letter, then a number, and look after to see if there is a letter. "Look arround" does not consume characters, so the first match for this regex consumes A123, that's all. Next search starts from B, so the capture works, 456 is captured. The last search starts from C and so on .

I could have used a look before for this job : (?<=[A-Z])(\d+)[A-Z], it works as well.

I hope it will help you to understand

iamtheky · July 18, 2016

But is there a larger thing you are solving for that would in your case not just use:

\D?(\d+)\D

Or is that a look around as well?

Edited July 18, 2016 by iamtheky

jguinch · July 18, 2016

\D? takes any non-digits. If it matches, the non digit is consumed, but with "?" it is not mandatory, so it works with 123 even if there is no letter before (it's not what I wanted to do in my example).
Look at the difference of the two regex here (the consumed characters appear in blue and green) :

https://regex101.com/r/uG2lZ7/1
https://regex101.com/r/uG2lZ7/2

The first link works, the second no.
It's not easy to explain, I spent a lot of time doing tests before to understand

Another example : check if a string contains some desired words.
You have to check if a string contains the words "iamtheky", "King" and "Regex", the order does not matter.

^(?=.*iamtheky)(?=.*King)(?=.*Regex) : the "look after" assertion retains the current position (beginnig) and searchs on the right. First, it looks for .*iamtheky and comes back at the current position. If the string is found, the regex continues the job (looks at the right for .*King and comes back again). If the string is not found, the regex fails.

So it matchs with both iamtheky is the future King of Regex and The future Regex's King is iamtheky

iamtheky · July 18, 2016

ah, i didnt realize it made it unnecessary.

Sign In

need help with regex

Recommended Posts

Burgaud

Melba23

Burgaud

iamtheky

mikell

iamtheky

jchd

jguinch

iamtheky

jguinch

iamtheky

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta