09-13-2016 07:53 AM
I need to compare two addresses from same customer, formatted differently, in order to identify whether address is mismatch to the other one.
In dataset A, the address is properly formatted, and contains variables for: 'Apartment_Number' (if applicable); 'Street_Number'; 'Street'; 'Province'; 'Postal Code'.
I plan to compare datasets accross variable to identify address mismatches (street to street, postal to postal...)
In dataset B, the first 3 variables from above, are compiled in opposite order into one variable, 'Number_Street_Apt', where address is generally written as name suggests, but not always. (The other two variables are the same). Here are some examples of variable 'Number_Street_Apt':
|1039 VIEW ST APT 613|
|88 HAMPTONS CIR NW|
|1 RUE DE LHOTEL DE VIL|
|10038 150 ST #58|
|1760 CENTRAL ST|
|20865 CH DE LA COTE N #200|
|791 7E RUE O|
|116 ZACHARY PL|
|46 IRENE CRES|
|1385 RIVERSIDE DR W #905|
|542 RUE FILIATREAULT|
|11A MINTO ST|
|3001 4E RANG O|
|281 WILLOW ST|
|1416 6 AV S|
|85 12E AV|
|330 VILLAGE GREEN SQ TH|
|3735 CASORSO RD APT 215|
Using the Scan function, i have isolated the 'Street_Number' from dataset B, as that is fairly easy, as each observations generally begins with a series of numbers.
I now need to isolate the Apt number at end of string, where listed (not always), and sometimes may be preceded by various prefixes (#, No, Apt...).
Any suggestions? Is there is a function to instruct "Search from end of string, If numeric variable (i.e., the apt no.) found within last 8 characters, then transpose/copy into new variable; if not found, do nothing"? Or alternative ?
Thanks in advance,
09-13-2016 08:16 AM
Looks like you have incredibly messy data there, I mean what does: 150 ST #58, reference? I suppose you could try building a set of rules, however looking at that data, every element seems to be different. I would imagine the rules would be heirachical, something like:
if <rule1> then do;
set value to result of <rule 1>;
remove result from string;
09-13-2016 09:25 AM
Perhaps you should not try to reinvent the wheel. This is a common problem, and the variation are almost endless.
SAS Data Management Studio with it's Quality Knowledge Database (QKB) facilitates this kind of scenarios, perhaps you could take a look at that...
09-13-2016 12:09 PM
At least in the U.S., the problem can get even messier. We have street names that are numbers. You might need to parse:
150 West 29th Street, #58
At least you can locate the separate numeric words:
delimiters = compress(string, '0123456789');
numbers1 = scan(string, 1, delimiters);
numbers2 = scan(string, 2, delimiters);
numbers3 = scan(string, 3, delimiters);
There may be simpler ways to express this, but here is the idea. The COMPRESS function locates all non-digits in the original string, so that the SCAN function can use those as delimiters.
This doesn't tell you whether the final set of numbers is an apartment or not. After all, you could be looking at:
150 West 29th St.
But it gives you separate words holding each set of numbers as a starting point.