Comparing 2 addresses formatted differently

Reply
Frequent Contributor
Posts: 95

Comparing 2 addresses formatted differently

hello,

 

I need to compare two addresses from same customer, formatted differently, in order to identify whether address is mismatch to the other one.

 

In dataset A, the address is properly formatted, and contains variables for: 'Apartment_Number' (if applicable); 'Street_Number'; 'Street'; 'Province'; 'Postal Code'.

 

I plan to compare datasets accross variable to identify address mismatches (street to street, postal to postal...)

 

In dataset B, the first 3 variables from above, are compiled in opposite order into one variable, 'Number_Street_Apt', where address is generally written as name suggests, but not always.  (The other two variables are the same). Here are some examples of variable 'Number_Street_Apt':

 

1039     VIEW ST APT 613
88       HAMPTONS CIR NW
1         RUE DE LHOTEL DE VIL
10038     150 ST #58
1760     CENTRAL ST
20865     CH DE LA COTE N #200
791       7E RUE O
116       ZACHARY PL
46       IRENE CRES
1385     RIVERSIDE DR W #905
542       RUE FILIATREAULT
11A       MINTO ST
3001     4E RANG O
281       WILLOW ST
1416     6 AV S
85       12E AV
330       VILLAGE GREEN SQ TH
3735     CASORSO RD APT 215

 

Using the Scan function, i have isolated the 'Street_Number' from dataset B, as that is fairly easy, as each observations generally begins with a series of numbers.

 

I now need to isolate the Apt number at end of string, where listed (not always), and sometimes may be preceded by various prefixes (#, No, Apt...).

 

Any suggestions? Is there is a function to instruct "Search from end of string, If numeric variable (i.e., the apt no.) found within last 8 characters, then transpose/copy into new variable; if not found, do nothing"? Or alternative ?

 

Thanks in advance,

Super User
Super User
Posts: 7,970

Re: Comparing 2 addresses formatted differently

Looks like you have incredibly messy data there, I mean what does: 150 ST #58, reference?  I suppose you could try building a set of rules, however looking at that data, every element seems to be different.  I would imagine the rules would be heirachical, something like:

if <rule1> then do;

  set value to result of <rule 1>;

  remove result from string;

end;

...

 

 

 

Frequent Contributor
Posts: 95

Re: Comparing 2 addresses formatted differently

Agreed.. it is messy.. 150 ST #58,  means Street OneFifty, unit 58. Thank you for the tip

Super User
Posts: 5,431

Re: Comparing 2 addresses formatted differently

Perhaps you should not try to reinvent the wheel. This is a common problem, and the variation are almost endless.

SAS Data Management Studio with it's Quality Knowledge Database (QKB) facilitates this kind of scenarios, perhaps you could take a look at that...

Data never sleeps
Frequent Contributor
Posts: 95

Re: Comparing 2 addresses formatted differently

thanks for the tip!
Super User
Posts: 5,511

Re: Comparing 2 addresses formatted differently

At least in the U.S., the problem can get even messier.  We have street names that are numbers.  You might need to parse:

 

150 West 29th Street, #58

 

At least you can locate the separate numeric words:

 

delimiters = compress(string, '0123456789');

numbers1 = scan(string, 1, delimiters);

numbers2 = scan(string, 2, delimiters);

numbers3 = scan(string, 3, delimiters);

 

There may be simpler ways to express this, but here is the idea.  The COMPRESS function locates all non-digits in the original string, so that the SCAN function can use those as delimiters.

 

This doesn't tell you whether the final set of numbers is an apartment or not.  After all, you could be looking at:

 

150 West 29th St.

 

But it gives you separate words holding each set of numbers as a starting point.

Frequent Contributor
Posts: 95

Re: Comparing 2 addresses formatted differently

Posted in reply to Astounding
thanks for your tip. The compress function is useful toward solving my issue.
David
Ask a Question
Discussion stats
  • 6 replies
  • 368 views
  • 2 likes
  • 4 in conversation