Sign in to follow this  
Followers 0
JLogan3o13

Regex making my eyes bleed as usual

14 posts in this topic

It is no secret I suck hard when it comes to regex. Usually I can get by with String functions just fine, but I am struggling at the moment. Hoping someone out there can provide some regex guidance for a simple solution.

I am extracting all the text from a PDF file for manipulation. The text, when extracted, comes out like this:

Agency Delegated Admin Request BDC Name: John Doe
Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &
Taylor Ins. Agency Number: 123456
Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator Jim Jones

I am basically looking to pull each field (BDC Name, Request Date, Agency Name, etc.) but as it the formatting it off I am not finding an easy way of capturing this. And the length of the field is not going to be consistent, so String functions with StringLen are getting me nowhere. Is there a simple method of pulling the fields? I thought about pulling everything between the colons, and then just removing the excess - so I would get this:

: John Doe Request Date
: 12/02/2014 Agency Name
: FUG Insurance Inc. dba Claribell & Taylor Ins. Agency Number
: 123456 Agency State
: TN Administrator Name
: Lu Ann Smith Administrator Phone
: 900-111-2222 Administrator Extension
: 111 Administrator Email
: myemail@ins.com Back-up Administrator Jim Jones

..and then would have to remove the next field's name from the string. But perhaps there is a better way that I am missing.


√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites



JLogan3o13,

Are the various heading texts (i.e the words before the colon but after the previous data) always the same? :huh:

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

Yes, they are.


√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites

JLogan3o13,

Then this seems to work:

$sString = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

; Remove all @CRLF and then replace the headers with @CRLF
$sExtract = StringRegExpReplace(StringReplace($sString, @CRLF, " "), "Agency Delegated Admin Request BDC Name: |Request Date: |Agency Name: |Agency Number: |Agency State: |Administrator Name: |Administrator Phone: |Administrator Extension: |Administrator Email: |Back-up Administrator: ", @CRLF)

ConsoleWrite($sExtract & @CRLF
No doubt a real guru will give you a better solution, but that should get you going. :)

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

(.+?)(?::s*|z)


Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

That is awesome, thanks Melba


√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites

Thanks, Smoke_N, I will try that out as well


√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

Sorry, the vertical spaces were screwing it up, try this:  

"(?s)(?:(.+?)\:\s*|\z)" 

Edit:

You are going to have to replace the vertical spaces regardless to get a proper layout.

Local $aPatt = "(?s)(?:(.+?)\:\s*|\z)"
Local $aRegex = StringRegExp(StringRegExpReplace(ClipGet(), "\v+", " "), $aPatt, 3)
Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Thanks. You are right, it will take some massaging, but I can definitely work with it.


√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $res[10][2], $txt = $str  ;FileRead("1.txt")
Local $items[11] = ["Agency Delegated Admin Request BDC Name", "Request Date", "Agency Name", "Agency Number", "Agency State", "Administrator Name", "Administrator Phone", "Administrator Extension", "Administrator Email", "Back-up Administrator", "\z"]
For $i = 0 to 9
     $res[$i][0] = $items[$i]
     $res[$i][1] = StringRegExpReplace($txt, '(?s).*' & $items[$i] & ':\s*([^\r\n]+)\R?([^\r\n]+)?\s*' & $items[$i+1] & '.*', "$1$2")
Next

 _ArrayDisplay($res)

:)

Edit

This one will work even in case of missing info(s)

Edited by mikell

Share this post


Link to post
Share on other sites

BTW this can be done with String* funcs and without regex  :)

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $res[10][2], $txt = $str  ;FileRead("1.txt")
Local $items[10] = ["Agency Delegated Admin Request BDC Name", "Request Date", "Agency Name", "Agency Number", "Agency State", "Administrator Name", "Administrator Phone", "Administrator Extension", "Administrator Email", "Back-up Administrator"]

$txt = StringReplace($txt, @crlf, " ")
For $i = 1 to 9
   $txt = StringReplace($txt, $items[$i], @crlf & $items[$i])
Next
Msgbox(0,"", $txt)
$lines = StringSplit($txt, @crlf, 1)
 _ArrayDisplay($lines)
For $i = 1 to $lines[0]
    $tmp = StringSplit($lines[$i], ": ", 1) 
    $res[$i-1][0] = $tmp[1]
    $res[$i-1][1] = $tmp[2]
Next
 _ArrayDisplay($res)

Share this post


Link to post
Share on other sites

Yes, i did much the same, mikell, stringsplit each line and then captured my content from there. But the PDFs are large and it was getting unwieldy.

Thanks, all, for the direction. I believe I've found what will work best.


√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites

#14 ·  Posted (edited)

Another way :

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $ret = StringRegExp(StringRegExpReplace($str, "\R", " "), "(?s)(Agency Delegated Admin Request BDC Name|Request Date|Agency Name|Agency Number|Agency State|Administrator Name|Administrator Phone|Administrator Extension|Administrator Email|Back-up Administrator): (.+?)\h*(?=(?1)|$)", 3)
_ArrayDisplay($ret)


; $ret2D = _Array1DTo2D($ret, 2) ; http://www.autoitscript.com/forum/topic/165600-array1dto2d/
; _ArrayDisplay($ret2D)



Local $aResult[ UBound($ret) / 2 ][2]
Local $iIndex = 0
For $i = 0 To UBound($ret) - 1 Step 2
    $aResult[$i / 2][0] = $ret[$i]

    $aResult[$i / 2][1] = $ret[$i + 1]
Next
_ArrayDisplay($aResult)
Edited by jguinch

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0