Sign in to follow this  
Followers 0
tobject

Parsing Mailing Address?

11 posts in this topic

#1 ·  Posted (edited)

I have bunch of text lines with addresses

I want to parse addresses somehow

I was thinking if I push whole string to google.MAPS but it does not understand it! Yikes!

any other ideas?

held October 4, 2010 To the Shareholders of Universal Security Instruments, Inc.: The Annual Meeting of Shareholders of Universal Security Instruments, Inc., a Maryland corporation (the “Company”) will be held at the Hilton Pikesville, 1726 Reisterstown Road, Pikesville, Maryland, on Monday, October 4, 2010 at 8:30 a.m., local time, for the following purposes: 1.To elect two dir

held on Wednesday, September 8, 2010 at 2:30 p.m. at the Palais De Beaulieu, Rome Room, in Lausanne, Switzerland. Enclosed is the Invitation and Proxy Statement for the meeting, which includes an agenda and discussion of the items to be voted on at the meeting, information on how you can exercise your voting rights, information concerning Logitech’s compensation of its Board members and e

HELD ON MONDAY, SEPTEMBER 13, 2010 NOTICE IS HEREBY GIVEN that the Annual Meeting of Stockholders of OPNET Technologies, Inc. will be held at our principal executive offices, 7255 Woodmont Avenue, Bethesda, Maryland 20814, on Monday, September 13, 2010 at 10:00 a.m., local time (the “Annual Meeting”), for the purpose of considering and voting upon the following matters: 1.To elect one Clas

held on Monday, September 13, 2010 To the Shareholders of ePlus inc.: The Annual Meeting of Shareholders of ePlus inc., a Delaware corporation, will be held on September 13, 2010, at the Hyatt Regency, 1800 Presidents Street, Reston, Virginia, 20190 at 8:00 a.m. local time for the purposes stated below: 1.To elect directors named in the attached proxy statement, each to se

held at the offices of the Company, 470 East Paces Ferry Road, N.E., Atlanta, Georgia, on Monday, August 16, 2010 at 4:00 p.m. for the following purposes: 1.To elect seven directors of the Company, three of whom will be elected by the holders of Class A Common Shares and four of whom will be elected by the holders of Class B Common Shares. 2.To approve the adoption of the Company’s 2

Edited by tobject

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Gee this is a toughy. The only thing I can think of is starting at the end and working your way to the start. Armed with a database of Countries, Cities, Towns, and/or zip codes, try to match each word until you find what might be the last line of an address. Then try to identify the rest of the address using the commas as markers for each line. I'm not sure how you would do this, but there are some typical markers such as street (St) road (Rd), avenue (Ave), boulevard, place etc...

I don't think this is at all easy, but you may be able to combine this approach with your original idea. Interesting project. :blink:

One more comment: Look how many times the word 'at' appears in the examples you gave =>

at the Hilton Pikesville, 1726 Reisterstown Road, Pikesville, Maryland, on Monday

at the Palais De Beaulieu, Rome Room, in Lausanne, Switzerland.

at our principal executive offices, 7255 Woodmont Avenue, Bethesda, Maryland 20814, on Monday, September 13, 2010 at 10:00 a.m., local time

at the Hyatt Regency, 1800 Presidents Street, Reston, Virginia, 20190 at 8:00 a.m.

at the offices of the Company, 470 East Paces Ferry Road, N.E., Atlanta, Georgia, on Monday, August 16, 2010 at 4:00 p.m.

Edited by czardas

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

I'm sure this would fail somewhere ... but you should get the gist on how to fix it if it does.

#include <Array.au3>; Just for _ArrayDisplay

#region the important data
Global $s_sre_address = "[\w ]+\s*(?:Road|Drive|Avenue|Street)?"

Global $s_sre_direction = "(?:,\s*(?:N.|N.E.|N.W.|S.|S.E.|S.W.|W.|E.))?"

Global $s_sre_city = ",\s*(?:[A-Z ]+)"

Global $s_sre_states = ",\s*(?:"
$s_sre_states &= "Alabama|Alaska|Arizona|Arkansas|California|"
$s_sre_states &= "Colorado|Connecticut|Delaware|District of Columbia|Florida|"
$s_sre_states &= "Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|"
$s_sre_states &= "Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|"
$s_sre_states &= "Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|"
$s_sre_states &= "New Jersey|New Mexico|New York|North Carolina|North Dakota|"
$s_sre_states &= "Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|"
$s_sre_states &= "South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|"
$s_sre_states &= "West Virginia|Wisconsin|Wyoming)"

Global $s_sre_zip = "(?:,\s*\d{5}(?:\s*-\s*\d{4})?)?"

Global $s_sre_pattern = "(?i)(?s),\s*(" & $s_sre_address & $s_sre_direction & $s_sre_city & $s_sre_states & $s_sre_zip & ")"
#endregion the important data

Global $s_test_str = ""
$s_test_str &= "held October 4, 2010 To the Shareholders of Universal Security Instruments, Inc.: The Annual Meeting"
$s_test_str &= " of Shareholders of Universal Security Instruments, Inc., a Maryland corporation (the “Company”) wil"
$s_test_str &= "l be held at the Hilton Pikesville, 1726 Reisterstown Road, Pikesville, Maryland, on Monday, October"
$s_test_str &= " 4, 2010 at 8:30 a.m., local time, for the following purposes: 1.To elect two dirheld on Wednesday, "
$s_test_str &= "September 8, 2010 at 2:30 p.m. at the Palais De Beaulieu, Rome Room, in Lausanne, Switzerland. Enclo"
$s_test_str &= "sed is the Invitation and Proxy Statement for the meeting, which includes an agenda and discussion o"
$s_test_str &= "f the items to be voted on at the meeting, information on how you can exercise your voting rights, i"
$s_test_str &= "nformation concerning Logitech’s compensation of its Board members and eHELD ON MONDAY, SEPTEMBER 13"
$s_test_str &= ", 2010 NOTICE IS HEREBY GIVEN that the Annual Meeting of Stockholders of OPNET Technologies, Inc. wi"
$s_test_str &= "ll be held at our principal executive offices, 7255 Woodmont Avenue, Bethesda, Maryland 20814, on Mo"
$s_test_str &= "nday, September 13, 2010 at 10:00 a.m., local time (the “Annual Meeting”), for the purpose of consid"
$s_test_str &= "ering and voting upon the following matters: 1.To elect one Clasheld on Monday, September 13, 2010 T"
$s_test_str &= "o the Shareholders of ePlus inc.: The Annual Meeting of Shareholders of ePlus inc., a Delaware corpo"
$s_test_str &= "ration, will be held on September 13, 2010, at the Hyatt Regency, 1800 Presidents Street, Reston, Vi"
$s_test_str &= "rginia, 20190 at 8:00 a.m. local time for the purposes stated below: 1.To elect directors named in t"
$s_test_str &= "he attached proxy statement, each to seheld at the offices of the Company, 470 East Paces Ferry Road"
$s_test_str &= ", N.E., Atlanta, Georgia, on Monday, August 16, 2010 at 4:00 p.m. for the following purposes: 1.To e"
$s_test_str &= "lect seven directors of the Company, three of whom will be elected by the holders of Class A Common "
$s_test_str &= "Shares and four of whom will be elected by the holders of Class B Common Shares. 2.To approve the ad"
$s_test_str &= "option of the Company’s 2 "

Global $a_sre = StringRegExp($s_test_str, $s_sre_pattern, 3)
_ArrayDisplay($a_sre)
Edited by SmOke_N

[center]Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.[/center]

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Thanks, SmOke_N! why it misses some ZIP codes? Yikes, I wish I'd new how to construct Regular Expressions. is there a tool which does it for you?

Good thing I have corporate address so if meeting is held there I can match strings from address

but I also enconter problems like address is

"7400 49TH AVE NORTH, NEW HOPE MN 55428"
and in the letter it is spelled like
"7400 49th Avenue North New Hope, Minnesota 55428"

2nd problem - Not everything is in USA, i.e. "Palais De Beaulieu, Rome Room, in Lausanne, Switzerland"

I'm looking for a quick solution. maybe a web service - I pass a string and it gets me an address or even better geo location.

Resume parsing maybe? Is there like a no registration resume upload site which parses the address?

Also if I can just get where start of the address is and where it ends

when I can pass it to google.Maps to get Geo location without parsing address further

Edited by tobject

Share this post


Link to post
Share on other sites

String manipulation needs anchors, things that are constants to be able to be pulled off.

You would have to build a very elaborate AI system to pull off what you're wanting more than likely unless you had all the output address rules.

It's obviously not as simple as "I want so give me" type of thing.

I gave you a base to work with, I'd suggest ( if you know RegEx ) to work from that.

If you don't know regex, then as far as websites that do this type of thing, I'd imagine you're time on "Google" would be just as efficient as mine.

The only other option if you don't know all the rules, is to give someone step by step how you get this data, give them access to be able to pull the data out and examine it, and more than likely, be willing to pay for the countless hours it would take for them to be able to distinguish all the string manipulation rules it would take to accomplish what you want.


[center]Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.[/center]

Share this post


Link to post
Share on other sites

SmOke_N - that's a nice example of SRE to learn from. Indeed this is something of an AI type project. Each country will have different postal code formats. Some houses in England have names instead of numbers. In this case, it might be easier to concentrate on the language surrounding the address, such as:

The meeting will be held at ...

Meet at ...

The address is ...

The address is as follows ...

the following address ...

Write to ...

Reply to ...

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

You would have to build a very elaborate AI system to pull off what you're wanting more than likely unless you had all the output address rules.

We're like hackers. Someone somewhere already done this - No need to re-invent the wheel!

We just need to find where it is done and use it. Some web service

or maybe I just wasting my time and there's a site like YourNextShareholderMeetingDotCom with all the addresses

Here's what I got so far with SmOke_N help and my probability meetup in the company's office

all raw data in Line1,Line2,Line3

parsed address in MeetAddr1,MeetAddr2, MeetAddr3 using SmOke_N's example

Date and time almost perfect

Edited by tobject

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

Checking Resume parsers

Looks like this guy does somewhat good job. Requires Country selection

Edited by tobject

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

I see some ocasional e-mails there

what's the RegExp to get e-mail address?

Edited by tobject

Share this post


Link to post
Share on other sites

http://www.regular-expressions.info/tutorial.html

Again, google is your friend.

To help, the RegEx engine we use is PCRE.


[center]Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.[/center]

Share this post


Link to post
Share on other sites

thanks!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0