Sign in to follow this  
Followers 0
Danp2

Regex question

14 posts in this topic

I could use some help with a Regex pattern. I am attempting to extract some names and addresses (US format) from some free form text. Examples:

Jane Doe
PO Box 1234
Pensacola, FL 32524


Jane Doe
123 Sunset Blvd
Apt 456
Pensacola, FL 32524


Jane Doe
555 Any Street
Pensacola, FL 32524-1234

Second address line is optional. There are instances where the city or state may be missing. I came up with the following pattern through research / trial & error, but it isn't working exactly as I desire:

(.+)(?>\r\n)(.+)(?>\r\n)(.+)(?>\r\n)?(.+)?,\h+(\w+)? ([-\d]+)

When the second address line is missing, the value for city ends up in the wrong place. Suggestions on how I can fix / improve the pattern?

TIA, Dan

 

Share this post


Link to post
Share on other sites



Danp2,

First rule when asking for SRE help - give an example of what you want as the result. ;)

Each of those addresses has a second line, so the "when second line is missing" comment is not a great deal of help. What exactly do you want to see returned from the RegEx? :huh:

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

Hey Melba,

Sorry for being vague. When I said second line, I'm actually meant the optional address line (Apt 456 in the second example). I'm trying to extract the following elements: name, address1, address2, city, state, and zip.

Dan

Share this post


Link to post
Share on other sites

This appears to mostly work, although the 3rd group is capturing the CRLF.

(.+)(?>\R)(.+)(?>\R)((?:.+)(?>\R))?(.+)?,\h+(\w+)?\h([-\d]+)

Share this post


Link to post
Share on other sites

Danp2,

I would not bother with an SRE for this - just split the address and read the required lines: :)

#include <StringConstants.au3>

Global $aAddress[3] = ["Jane Doe" & @CRLF & "PO Box 1234" & @CRLF & "Pensacola, FL 32524", _
                "Jane Doe" & @CRLF & "123 Sunset Blvd" & @CRLF & "Apt 456" & @CRLF & "Pensacola, FL 32524", _
                "Jane Doe" & @CRLF & "555 Any Street" & @CRLF & "Pensacola, FL 32524-1234"]

For $i = 0 To 2
    $aSplit = StringSplit($aAddress[$i], @CRLF, $STR_ENTIRESPLIT)
    ; Combine the first 2 and last lines
    $sExtract = $aSplit[1] & @CRLF & $aSplit[2] & @CRLF & $aSplit[$aSplit[0]]
    ConsoleWrite($sExtract & @CRLF & @CRLF)
Next
Now wait for a real SRE guru to show you how it should be done! ;)

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

I mentioned the free form text, but failed to add that the text contains other details unrelated to the address I am attempting to capture. It's essentially a series of form letters where the data has already been merged, and I'm trying to capture the separate portions of the mailing address.

Share this post


Link to post
Share on other sites

Having fought with international addresses for a long time, I wouldn't be surprised if even addresses in "US format" would prove very hard to deal with in general. Sure one can come up with a fairly good pattern but I strongly suspect that you'll get exceptions that would ruin your efforts. Here a more pedestrian approach is best, AFAICT.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Danp2,

Second rule (applicable to all requests for help) - give all the necessary data when posing the question so people do not waste their time coding solutions which are obviously not going to work. :mad2:

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

I tried to be detailed in my post, but obviously I left out some critical details. My bad. :geek:

Share this post


Link to post
Share on other sites

A possible way could be to get rid of all the 'unwanted details', it depends a lot on the shape of the original text

Otherwise considered jchd's comment this looks like an impossible task

Share this post


Link to post
Share on other sites

Danp2,

Following on from mikell's comment: is there any way you can distinguish the address lines from the rest of the text? Are they the only lines surrounded by blank lines? Are they in a specific place in the document? Do they have an indentation? Anything of that sort would help extract those particular lines for further analysis. :huh:

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

If you use option 4, in stringregexp, you can then fix the ubound of the array of arrays as the city, state, zip, and depending on the size of the array of arrays, you can infer if an address2 exists.


IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.

Share this post


Link to post
Share on other sites

Yes, they are surrounded by blank lines. They are in the same general vicinity for each letter, but the positioning may vary slightly at times.

I believe that I can make it work using the SRE I posted earlier. Thanks for the help!

Share this post


Link to post
Share on other sites

The best thing I ever learned about RegEx was knowing when not to use it. This is one of those places where a RegEx isn't going to help all that much, and may in fact take far longer to develop than just a FileReadToArray and parsing each line.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Similar Content

    • ISI360
      By ISI360
      Hi!

      I need a little bit help from some RegEx experts please:
      I would make my ISN AutoIt Studio faster when generating the scripttree. And what would be better to do this via regex?
      Problem is i am not really good at this regex stuff. So maybe someone could help me here.
       
      The challange is to get all Global Variables from a script via RegEx in a Array.
      Here is a example script with some tests:
      Global $Var1 = 1234 Local $Local_Var = 1234 $Ignore_me_too = 1234 Global $Var2 = 1234, $var3 = 1242 Global $ahIcons[30], $ahLabels[30] Global Const $Var4 = iniread($inivar1,"jj","jj","") , $var5= iniread($inivar2,"jj","jj","") Global $Var_String = "was" Global $Array_Test[16] = [1,15,16,0,31,15,25,15,25,30,8,30,8,15,1,15] Global Enum $MARGIN_SCRIPT_NUMBER = 0, $MARGIN_SCRIPT_ICON, $MARGIN_SCRIPT_FOLD Global Const $Delim = '\', $Delim1 = '|' Global $hard1 = "a", _ $hard2 = "b", _ $hard3 = "c"  
      The returning array should look like this:
      $Var1 $Var2 $var3 $Var4 $var5 $Var_String $Array_Test $MARGIN_SCRIPT_NUMBER $MARGIN_SCRIPT_ICON $MARGIN_SCRIPT_FOLD $Delim $Delim1 $hard1 $hard2 $hard3  
      I already made some success with a expression i found in the SciTE Jump Tool:  (\$\w+)(?:[\h\[.=+*/^,)\-])?
      This nearly returns the perfect results. But it does not check if it´s a global variable (with the const and enum options) and also returns variables in commands (for example $inivar1)
      I also found this regex: (?im:^(?=Global|Const|Enum|Static)(?:Global)?\h*(?:Const|Enum|Static)?(?:(?<=Enum)\h+Step\h+[+*-]\d+)?\h*)([^\r\n .\=]+)
      This returns also usefull results...but trying to understand this explodes my head

      Maybe someone can help me here?
      Thanks in advance!
    • TheAutomator
      By TheAutomator
      Can anyone tell me why this isn't working?..
      #include <array.au3> $regexp = StringRegExp("test 'a b c'", "'([^']|'')*'|\S+", 3) _ArrayDisplay($regexp) trying to split this "test 'a b c'  'some other '' test'' ...'" into:
      0: test
      1: 'a b c'
      2: ...
      but it gives me:
      0: test
      1: c
    • anthonyjr2
      By anthonyjr2
      Hi guys,
      I am pretty bad with regex, and am having some trouble trying to come up with an expression for a certain type of string. Basically I want to be able to tell if a string is of the format:
      AA#####A
      Where the A's are any letter from A-Z and the #'s are any digit from 0-9.
      I've been playing around with a regex tester online for a while but I can't really seem to grasp the concept very well. Could anyone give me any tips?
      This isn't exactly an AutoIt specific question which is why I didn't post it in General Help & Support.
    • tezhihi
      By tezhihi
      I have a file (see attached file) with a string all line and this problem on here is I want to separate all $00:, $03:, $10:, $20:, $25:, $30:, $40:, $45:, $110:, $115:, $120: and $T. It's mean that each $ with value start a new line ( a new paragraph). I tried with Regular Expression in notepad++ ex:
      Find ($00:, $01:, $03: and so on) with regex (\$)([0-9]+): and replace is \r\n\1\2 (I think \r\n is @CRLF (not sure :() ) Find $T with regex (\$T)(.*?)(\$T) and replace is \1\2\r\n\3 When I try these regex to replace in notepad on StringRegexReplace the results is incorrect . I have read some example simple about regex. Please advise me how to do that with some example on autoit . The result will be in attached photo. Thanks 
      ahihi.txt

    • MyEarth
      By MyEarth
      Hello, i need to validate a string can be different things. I just need a True - False return value, no groups or things like that. It will be always one line at time to be processed by StringRegEx
      Valid:
      13:52|String
      02:52 XX|String
      13:52~SUN, MON, TUE, WED, THU, FRI, SAT|String
      02:52 XX~SUN, MON, FRI|String
      22/04/2017 13:52|String
      22/04/2017 02:52 YY|String
      Not Valid
      22/04/2017 13:52~Dom|String
      I need to validate until and inclusively the | after that i don't care
      The XX and YY value are two $sVariable from my script
      SUN, MON, TUE, WED, THU, FRI, SAT are fixed value, the can be mixed but always in the same order like
      SUN
      SUN, TUE, WED
      SUN, SAT 
      The time can be 12 or 24 hours, the date is always in the same format DD/MM/YYYY. If there is a date can't be a day after that ( see not valid )
      Well i think is all
      Sorry if i don't provide a working code, regex is too way complex.
      Thanks