Jump to content
Sign in to follow this  
Mun

[Solved] RegEx Pattern Algorithm Assist.

Recommended Posts

Mun

I need a bit of help parsing few large excel files about 8mb each file filled with these variation.

I want to capture what in bold only. These are all narrow down variation. Most of the time wording are misspelled or abbreviated.

^(?:.*?)(\d+|\$\d+|\d+.\d+|\$\d+.\d+|\$\s\d+.\d+|\d+)

invoice 928.00 paid 880.00 pricing.

Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing

inv 1681.00 pd 1575.00 no pay

Invoice $80.00 Paid $79.50 paid per g

(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing

inv 220.89 pd 212.10 paid per pricing less.

Invoiced Amt $76, Paid $64.48 - paid as per flat fee

Invice64.00 Paid 63.50 Paid per admin pricing

Invoiced: $32.00 Paid: $30.00

Inv. $136 Pd. $126 per flat rate of $50 for

Can anyone guide me down or point me to the right direction? I am willing to learn.

(Sorry didn't mean to hijack other thread.)

Edited by Mun

Share this post


Link to post
Share on other sites
Robjong

As ugly as the sample text, but it should work (for the sample at least):

$s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF
$s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF
$s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF
$s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF
$s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF
$s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF
$s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF
$s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF
$s &= 'Invoiced: $32.00 Paid: $30.00' & @CRLF
$s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF

;~ $a = StringRegExp($s, '(?im)^.*\bin[voiced]{0,6}(?:\h+[a-z]+\b)?\W*?\K(?:\$\h*)?\d+(?:[,.]\d{1,2})?(?=.*\bp[aid]{1,3}\W*?(\$?\h*\d+(?:[,.]\d{1,2})?))', 4) ; strict as per the sample text
$a = StringRegExp($s, '(?im)^.*\bin[voiced]{0,6}(?:\h+[a-z]+\b)?\W*?\$?\h*\K\d+(?:[,.]\d{1,2})?(?=.*\bp[aid]{1,3}\W*?\$?\h*(\d+(?:[,.]\d{1,2})?))', 4) ; suggested pattern, does not capture $

For $i = 0 To UBound($a) - 1 Step 1
    $b = $a[$i]
    ConsoleWrite(StringFormat('Invoiced: %.2f Paid: %.2f\n', $b[0], $b[1]))
Next

To understand this pattern we break it down.

(?im)                     Options: i = case-insensetive matching,
                      m = ^ and $ also match the start of a line and the end of a line respectively.
                                    
^.*                   ^ matches the start of the line. .* matches any character except line feed
                      0 or more times. This anchors our match to a line.
                                    
\b                    \b matches a word boundary, which is a point between a non-word and word character,
                      this helps to avoid 'in' being part of another word.

in[voiced]{0,6}               'in' matches the start of the invoice-like word we want to precede the amount.
                      [voiced]{0,6} matches  v, o, i, c, e or d, 0 to 6 times.

(?:                   ( opens a group, ?: makes it non-capturing. This group allows one word after 'invoiced'.
    \h+[a-z]+             \h+ matches 1 or more horizontal space. [a-z]+ matches any letter, 1 or more times.
    \b                \b matches a word boundary, this makes sure that if this word is present one of the
                      following optional parts (\W*?, \$? or \h*) matches.
)?                    ) closes the group. ? makes it optional, match 0 or 1 times.

\W*?\$?\h*                \W* matches 0 or more non-word characters, ? makes the match lazy, as small as possible.
                      \$? optionally matches a $ dollar sign, \ escapes the dollar sign, $ is the dollar sign,
                      and again, the ? makes it optional. This allows the for spaces, dots and colons seen in
                      the sample text, as well as the dollar sign and space.
                                    
\K                    This is an interesting one, and it's not in the AutoIt help file (yet).
                      It resets our global match, everything matched so far in the global match is discarded.
                      Because we only match the first amount in the global match from here on, only the first  
                      amount will be in the global match (b[0]). We capture the second amount in a lookahead,
                      which does not end up in the global match. In short \K basically turns the part of the
                      pattern before it into a less restricted lookbehind.

\d+(?:[,.]\d{1,2})?              \d+ matches 1 or more digits. (?: open non-capturing group. [,.] matches comma or dot.
                      \d{1,2} matches 1 or 2 digits. ) closes the group. ? makes it optional.
                      This matches the first amount, e.g. 1, 23, 45.67 or 78,90.

(?=                   ( opens a group, ?= makes it a positive lookahead, meaning the subpattern must match the
                      subject ahead of this point.
    .*\b                  .* matches any character except line feed 0 or more times. \b matches a word boundary to
                      help avoid p (see below) being part of another word.
    p[aid]{1,3}           p[aid]{1,3} matches a, i or d 1 2 or 3 times.
    \W*?\$?\h*            Covered this already.
    (                 ( Opens a capturing group, which will capture the second amount (b[1]).
        \d+(?:[,.]\d{1,2})?      Covered this already.
    )                 Close capturing group.
)                     Close lookahead.
Edited by Robjong
  • Like 1

Share this post


Link to post
Share on other sites
Mun

Ok Robjong. How did you come up with that pattern?

I AM IMPRESSED!!!

Did you use any regex builder? That's some intertwine algorithm to extract info.

Share this post


Link to post
Share on other sites
PhoenixXL

How did you come up with that pattern?

wondering!! Very hard regex :)

Yet another attempt

$s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF
$s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF
$s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF
$s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF
$s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF
$s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF
$s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF
$s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF
$s &= 'Invoiced: $32.00 Paid: $30.00' & @CRLF
$s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF

$s = StringRegExpReplace($s, '(?im)^[^i]+', '');Make the start from "I"
$a = StringRegExp($s, '(?m)^(?:[[:alpha:].,: $]+)([\d.]+)(?:[[:alpha:].,: $]+)([\d.]+)', 4)
If @error Then Exit @extended

For $i = 0 To UBound($a) - 1 Step 1
    $b = $a[$i]
    ConsoleWrite(StringFormat('Invoiced: %.2f     Paid: %.2f\n', $b[1], $b[2]))
Next
  • Like 1

My code:

PredictText: Predict Text of an Edit Control Like Scite. Remote Gmail: Execute your Scripts through Gmail. StringRegExp:Share and learn RegExp.

Run As System: A command line wrapper around PSEXEC.exe to execute your apps scripts as System (LSA). Database: An easier approach for _SQ_LITE beginners.

MathsEx: A UDF for Fractions and LCM, GCF/HCF. FloatingText: An UDF for make your text floating. Clipboard Extendor: A clipboard monitoring tool. 

Custom ScrollBar: Scroll Bar made with GDI+, user can use bitmaps instead. RestrictEdit_SRE: Restrict text in an Edit Control through a Regular Expression.

Share this post


Link to post
Share on other sites
Mun

I'm reading up and trying to learn from both of your expression. Both seems to use look ahead anchor starting point and group those number.

What does this expression does?

(?im)

[:alpha:] Look ahead for alpha?

$]+)([d.]+) Group digit but $ not include since it's being omit.

Am I on right path?

Lastly I tried to look at the array "$a" through _Arraydisplay, the array seem to be empty?

$b = $a[$i] How did $a[] pull out information?

*head scratched*

Share this post


Link to post
Share on other sites
Robjong

It's a pretty basic pattern, and I just wrote it, no special tools. I will explain the pattern later today when I have more time.

[                   opens a character class
    [:alpha:]    match uppercase and lowercase letters (POSIX character class)
    .,: $          matches dot, comma, semicolon, space or dollar sign.
]                   close the character class
(                   open capturing group
    [\d.]+       matches digits or dot, 1 or more times
)                   close capturing group

Look in the help file for StringRegExp, see flag 4 for return values, it returns an array with arrays, not a 2D array.

Edited by Robjong
  • Like 1

Share this post


Link to post
Share on other sites
PhoenixXL

Edit: Robjong explained it, :)

(?m) makes ^ to match the SOF as well as the the point after a Vertical Space, and $ to match the EOF as well as the point before the Vertical Space

(?i) makes case insensitive

both can be clubbed as (?im) or (?mi)

Edited by PhoenixXL

My code:

PredictText: Predict Text of an Edit Control Like Scite. Remote Gmail: Execute your Scripts through Gmail. StringRegExp:Share and learn RegExp.

Run As System: A command line wrapper around PSEXEC.exe to execute your apps scripts as System (LSA). Database: An easier approach for _SQ_LITE beginners.

MathsEx: A UDF for Fractions and LCM, GCF/HCF. FloatingText: An UDF for make your text floating. Clipboard Extendor: A clipboard monitoring tool. 

Custom ScrollBar: Scroll Bar made with GDI+, user can use bitmaps instead. RestrictEdit_SRE: Restrict text in an Edit Control through a Regular Expression.

Share this post


Link to post
Share on other sites
guinness

Robjong is back to claim their title. -_0


UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Share this post


Link to post
Share on other sites
Mun

After I asked about flag 4 returns, I went to help file and read up on it.

That's smart to use flag 4 because you can keep the line index the same (and sub array for fetched info). Opened my mind* I always use flag 3 before.

As for expression,

I think I understood both expression now. I still think that is a cleverly impressive use of expression.

Share this post


Link to post
Share on other sites
Robjong

...use flag 4 because you can keep the line index the same (and sub array for fetched info)...

Exactly.

Added an explanation of the pattern to my as I said I would.

Edited by Robjong

Share this post


Link to post
Share on other sites
Mun

+1 Robjong for a detail explanation and thank you for your patience in teaching me.

Your explanation is so in plain English and make it seem too easy - that make me like wtf why couldn't I do that or thought of that. >_<

To be honest I learn a lot about regex within these 10 posts.

I have documented for save keeping.

Kudo to everyone here to spend your free time helping.

Share this post


Link to post
Share on other sites
Malkey

For this recently deceased but still warm topic.

Using Flag 3 and a tidy StringFormat layout.

$s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF
$s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF
$s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF
$s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF
$s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF
$s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF
$s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF
$s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF
$s &= '"Invoiced: $32.00 Paid: $30.00"' & @CRLF
$s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF

;Positive leading and trailing characters that must be present to capture the required, in-between number.
;Leading characters are "[$a-z\h]", and, trailing characters are "[\ha-z'"",]"
;$a = StringRegExp($s, "(?:.+?[$a-z\h])(\d+\.?\d*)(?:[\ha-z'"",].+?[$a-z\h])(\d+\.?\d*)(?:[\ha-z'"",]?.*)", 3)

;or

;Negative leading and trailing characters that must not be present to capture the required, in-between number.
;Leading and trailing characters are "[^():\-]".
$a = StringRegExp($s, "(?:.+?[^():\-])(\d+\.?\d*)(?:[^():\-].+?[^():\-])(\d+\.?\d*)(?:[^():\-]?.*)", 3)

;Both regular expression pattern above capture the first two matching numbers in each line. A third matching number is ignored.
;The first matching number is the "Invoiced". The second matching number is the "Paid" value.

For $i = 0 To UBound($a) - 1 Step 2
    ConsoleWrite(StringFormat('Invoiced:%9s\tPaid: %8.2f\r\n', StringFormat("$%.2f", $a[$i]), $a[$i + 1]))
Next
  • Like 1

Share this post


Link to post
Share on other sites
PhoenixXL

With the spirit of flag 3

a replace can further shorten the code

;Original Code by Malkey
$s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF
$s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF
$s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF
$s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF
$s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF
$s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF
$s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF
$s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF
$s &= 'Invoiced: $32.00 Paid: $30.00' & @CRLF
$s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF

$s = StringRegExpReplace($s, '(?im)^[^i]+', '');Make the start from "I"
$a = StringRegExp($s, '(?m)^(\D+)([\d.]+)(\D+)([\d.]+)', 3)

For $i = 0 To UBound($a) - 1 Step 4
ConsoleWrite(StringFormat('Invoiced: %.2f     Paid: %.2f\n', $a[$i + 1], $a[$i + 3]))
Next
  • Like 1

My code:

PredictText: Predict Text of an Edit Control Like Scite. Remote Gmail: Execute your Scripts through Gmail. StringRegExp:Share and learn RegExp.

Run As System: A command line wrapper around PSEXEC.exe to execute your apps scripts as System (LSA). Database: An easier approach for _SQ_LITE beginners.

MathsEx: A UDF for Fractions and LCM, GCF/HCF. FloatingText: An UDF for make your text floating. Clipboard Extendor: A clipboard monitoring tool. 

Custom ScrollBar: Scroll Bar made with GDI+, user can use bitmaps instead. RestrictEdit_SRE: Restrict text in an Edit Control through a Regular Expression.

Share this post


Link to post
Share on other sites
Mun

Thank you Malkey/PhoenixXL for the alternate pattern.

I think in this scenario I would rather use flag 4 to keep the index correct and I can look up reference invoices to match the pricing if I need to without wondering where the paid fall in that array.

This was a great lesson for me with the alternate flag 3 default and flag 4 and I have learn to appreciate the use of it.

You guys are tremendous teacher/helper.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×