Burgaud

need help with regex

11 posts in this topic

I would like to OCR Scan jpg files and extract numbers.

This seems to do the trick:

StringRegExp($temp, '[0-9\.]+', 3)

However, I realized that the OCR app oftentimes would translate capital I and small L (l)  as 1 as well so much so words like "Will" are oftentimes OCRed and recognized as "11". I noticed that numbers are preceeded by either +, - or space. Thus I would like to match only numbers if they are preceeded by either of these three chars [+1 ] without having those chars as part of the regex result. How do I do that?

 

Sorry, I am basic regex to know how to do it :(

 

 

Share this post


Link to post
Share on other sites



Burgaud,

Use a" lookbehind" to check if the match is preceded by one of those 3 characters:

Global $aList[] = ["+11", "i11", "-11", "a11", " 11"]

For $i = 0 To UBound($aList) - 1

    $fMatch = False
    If StringRegExp($aList[$i], "(?<=[ +-])(\d+)") Then
        $fMatch = True
    EndIf
    ConsoleWrite($aList[$i] & " : " & $fMatch & @CRLF)
Next

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
(?<=[ +-])

This re's lookbehind has eluded me for many years.

Thanks for your simple script, i finally get to understand this awesome function.

+1 for the education

Share this post


Link to post
Share on other sites

whats the difference, and benefits, between that regex and 

[\s\+\-](\d+)

 


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

iamtheky,
Because the OP's particular sample is very simple, the answer is : none  :)
Furthermore, regex101 says that the lookbehind consumes more steps than the simple expression

Edit
But I strongly suspect that Melba chose the lookbehind for teaching purpose, to introduce the concept  -  which is extremely powerful and useful in more complex situations   ;)

Edited by mikell
1 person likes this

Share this post


Link to post
Share on other sites

sure, due to my sucking with regex lookbehind never even enters as a possibility.  I cant figure out why to use it and when to use it, until i figure out exactly what it is.


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

See http://www.pcre.org/original/doc/html/pcrepattern.html for more information about feaures and gory details.

EDIT: forgot to mention that AutoIt currently uses the "legacy" version of PCRE, now nicknamed PCRE1. As the PCRE main webpage explains, PCRE has been substantively rewritten as PCRE2. While 99.9% of the regexp features are compatible, some corner cases have been fixed or changed. The main changes are in the library interface functions. So do not refer to PCRE2 documentation until a new version of AutoIt is made available with explicit support for it.

Edited by jchd
1 person likes this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

@iamtheky :

To understand how look arround assertions work, here is an example :

You have the string A123B456C789 and you want to capture each numbers enclosed by a letter (123 and 456)
What comes to your mind is probably a regex like [A-Z](\d+)[A-Z] : in this case, you will have only one result (123) because the regex consumes the specified characters (A123B) and then continues to search from the position after the first match. It remains "456C789" in the chain : the B has been consumed, so 456 cannot be considered enclosed by 2 letters.

To avoid the regex consumes characters, you can use a look arround assertion : [A-Z](\d+)(?=[A-Z])
With this regex, the letter after the number is not consumed : it means take one letter, then a number, and look after to see if there is a letter. "Look arround" does not consume characters, so the first match for this regex consumes A123, that's all. Next search starts from B, so the capture works, 456 is captured. The last search starts from C  and so on .

I could have used a look before for this job : (?<=[A-Z])(\d+)[A-Z], it works as well.

I hope it will help you to understand

2 people like this

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

But is there a larger thing you are solving for that would in your case not just use:

\D?(\d+)\D

Or is that a look around as well?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

\D? takes any non-digits. If it matches, the non digit is consumed, but with "?" it is not mandatory, so it works with 123 even if there is no letter before (it's not what I wanted to do in my example).
Look at the difference of the two regex here (the consumed characters appear in blue and green) :

https://regex101.com/r/uG2lZ7/1
https://regex101.com/r/uG2lZ7/2

The first link works, the second no.
It's not easy to explain, I spent a lot of time doing tests before to understand
 

 

Another example : check if a string contains some desired words.
You have to check if a string contains the words "
iamtheky", "King" and "Regex", the order does not matter.

^(?=.*iamtheky)(?=.*King)(?=.*Regex) : the "look after" assertion retains the current position (beginnig) and searchs on the right. First, it looks for .*iamtheky and comes back at the current position. If the string is found, the regex continues the job (looks at the right for .*King and comes back again). If the string is not found, the regex fails.

So it matchs with both iamtheky is the future King of Regex and The future Regex's King is iamtheky
 

 

1 person likes this

Share this post


Link to post
Share on other sites

ah, i didnt realize it made it unnecessary.


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now