Suppose you have a single address string
HAVE
ENROLLMENT GROUP 39 E COCONUT ST 3RD FL
WANT
[
(u'ENROLLMENT', 'Recipient'),
(u'GROUP', 'Recipient'),
(u'39', 'AddressNumber'),
(u'E', 'StreetNamePreDirectional'),
(u'COCONUT', 'StreetName'),
(u'ST', 'StreetNamePostType'),
(u'3RD', 'OccupancyIdentifier'),
(u'FL\n', 'OccupancyType')
]
* the Python USADDRESS package can do much more than
this. But lets see if it can parse the 6 permutations
of the same address below. I have already used PERL
to do some translations , ie East to E, Street to ST, THIRD to 3rd
floor to FL.
* create input for Python;
data adr_badadr;
file "d:/txt/KeyBadAdr.txt";
length keybadadr $100.;
input;
key=put(10000000000+_n_,z11.);
KeyBadAdr=cats(key,_infile_);
putlog KeyBadAdr;
put KeyBadAdr;
cards4;
ENROLLMENT GROUP 39 E COCONUT ST 3RD FL
ENROLLMENT GROUP 3RD FL 39 E COCONUT ST
39 E COCONUT ST ENROLLMENT GROUP 3RD FL
39 E COCONUT ST 3RD FL ENROLLMENT GROUP
3RD FL ENROLLMENT GROUP 39 E COCONUT ST
3RD FL 39 E COCONUT ST ENROLLMENT GROUP
;;;;
;run;quit;
/* in the log
10000000001ENROLLMENT GROUP 39 E COCONUT ST 3RD FL
10000000002ENROLLMENT GROUP 3RD FL 39 E COCONUT ST
1000000000339 E COCONUT ST ENROLLMENT GROUP 3RD FL
1000000000439 E COCONUT ST 3RD FL ENROLLMENT GROUP
100000000053RD FL ENROLLMENT GROUP 39 E COCONUT ST
100000000063RD FL 39 E COCONUT ST ENROLLMENT GROUP
*/
* It looks like you will have to do a lot of manual
cleaning before you can use open source packages
to clean stadardize your addresses.
* the most critical address parts for a strong matchcode are
street number
street name
Note it is not really possible to know if
'Enrollment' is a recipient or placename
Group could be a state or locality like UTAH,
depending on its position.
Also it is not always possible to get the
occupancy parts. It depends on the location of the occupancy parts.
You can really mess up a mailing with assumptions
on non street parts.
I am going to build a matchcode based on
First character of first name, last name, street number, street name
and zipcode.
Python USADDRESS does an excellent job of parsing the
the key elements for a match code, ie Street Number and Street Name. It also handles components like PO BOX.
* here is the simple python code that takes your input
and creates the output above.
%utl_submit_py64(%nrbquote(
fo = open('d:/txt/KeyBadAdrFix.txt', 'w');
import usaddress;
for line in open('d:/txt/KeyBadAdr.txt'):;
. addr=line[11:];
. key=line[0:11];
. res=key + str(usaddress.parse(addr)) + '\n';
. lyn=str(addr);
. fo.write(str(res));
));
* good way to look the python output, you need full SAS for this.
proc fslist file="d:/txt/KeyBadAdrFix.txt";
run;quit;
* here is the parsing of the first address by Python
10000000001[
(u'ENROLLMENT', 'Recipient'),
(u'GROUP', 'Recipient'),
(u'39', 'AddressNumber'),
(u'E', 'StreetNamePreDirectional'),
(u'COCONUT', 'StreetName'),
(u'ST', 'StreetNamePostType'),
(u'3RD', 'OccupancyIdentifier'),
(u'FL\n', 'OccupancyType')]
Here is the last record
10000000006[
(u'3RD', 'OccupancyIdentifier'),
(u'FL', 'OccupancyType'),
(u'39', 'AddressNumber'),
(u'E', 'StreetNamePreDirectional'),
(u'COCONUT', 'StreetName'),
(u'ST', 'StreetNamePostType'),
(u'ENROLLMENT', 'PlaceName'),
(u'GROUP\n', 'StateName')
* lets see if all of these addressses have the same
street number
street name
* create name value pairs for further processing;
data KeyBadAdrNrm;
retain seq 0 adrrec name value;
keep seq adrrec name value;
length txt $100 key $11 str $700;
infile "d:/txt/KeyBadAdrFix.txt" lrecl=1034;
input;
adrrec=substr(_infile_,1,11);
parens=countc(_infile_,')');
str=substr(_infile_,12);
do i=1 to parens;
txt=substr(scan(str,i,')'),3);
txt=compress(txt,"'[()'");
txt=substr(prxchange('s/\\n//',2,txt),2);
seq=seq+1;
value=scan(txt,1,',');
name=scan(txt,2,',');
output;
end;
run;quit;
p to 40 obs WORK.KEYBADADRNRM total obs=48
ENROLLMENT GROUP 39 E COCONUT ST 3RD FL
bs SEQ ADRREC NAME VALUE
1 1 10000000001 Recipient ENROLLMENT
2 2 10000000001 Recipient GROUP
3 3 10000000001 AddressNumber 39
4 4 10000000001 StreetNamePreDirectional E
5 5 10000000001 StreetName COCONUT
6 6 10000000001 StreetNamePostType ST
7 7 10000000001 OccupancyIdentifier 3RD
8 8 10000000001 OccupancyType FL
9 9 10000000002 Recipient ENROLLMENT
10 10 10000000002 Recipient GROUP
11 11 10000000002 Recipient 3RD
12 12 10000000002 Recipient FL
13 13 10000000002 AddressNumber 39
14 14 10000000002 StreetNamePreDirectional E
15 15 10000000002 StreetName COCONUT
16 16 10000000002 StreetNamePostType ST If you check you will see all addresses had street number and street name parsed correctly. Other elements depend on position in string, no way around this.
* you need to upcase, remove punctuation and
convert to standard USPS abreviations.
Perl has some free translations.
CAUSEWAY CSWY
CAUSWAY CSWY
CASWAY CSWY
CEN CTR
CENT CTR
CENTER CTR
CENTERS CTRS
CENTR CTR
CENTRE CTR
CIRC CIR
CIRCL CIR
CIRCLE CIR
CIRCLES CIRS
;
* invoking python from SAS;
%macro utl_submit_py64(pgm)/des="Semi colon separated set of py commands";
* write the program to a temporary file;
filename py_pgm "%sysfunc(pathname(work))/py_pgm.py" lrecl=32766 recfm=v;
data _null_;
length pgm $32755 cmd $255;
file py_pgm ;
pgm="&pgm";
semi=countc(pgm,';');
do idx=1 to semi;
cmd=cats(scan(pgm,idx,';'));
if cmd=:'.' then cmd=substr(cmd,2);
put cmd $char96.;
putlog cmd $char96.;
end;
run;
%let _loc=%sysfunc(pathname(py_pgm));
%put &_loc;
filename rut pipe "C:\Python_27_64bit/python.exe &_loc";
data _null_;
file print;
infile rut;
input;
put _infile_;
run;
filename rut clear;
filename py_pgm clear;
%mend utl_submit_py64;
... View more