Working with a large dataset of geocode results and I would like clean the zip codes by applying the zip of the more (or most) accurate values (indicated by SCORE) within each group*. Thanks as always.
*by id, street, city, and st.
HAVE:
ID SCORE STREET CITY ST ZIP YEAR
1001 75 123 MAIN ST MYTOWN MI 48009 2002
1001 50 123 MAIN ST MYTOWN MI 48999 2004
1001 75 123 MAIN ST MYTOWN MI 48009 2006
1001 50 123 MAIN ST MYTOWN MI 48999 2008
1001 75 123 MAIN ST MYTOWN MI 48009 2010
1001 60 999 WE MOVED XTOWN CA 94903 2012
1001 60 999 WE MOVED XTOWN CA 94903 2014
WANT:
ID SCORE STREET CITY ST ZIP YEAR
1001 75 123 MAIN ST MYTOWN MI 48009 2002
1001 50 123 MAIN ST MYTOWN MI 48009 2004
1001 75 123 MAIN ST MYTOWN MI 48009 2006
1001 50 123 MAIN ST MYTOWN MI 48009 2008
1001 75 123 MAIN ST MYTOWN MI 48009 2010
1001 60 999 WE MOVED XTOWN CA 94903 2012
1001 60 999 WE MOVED XTOWN CA 94903 2014
data have;
input id $ score street $9-20 city $ state $ zip $ year;
datalines;
1001 75 123 MAIN ST MYTOWN MI 48009 2002
1001 50 123 MAIN ST MYTOWN MI 48999 2004
1001 75 123 MAIN ST MYTOWN MI 48009 2006
1001 50 123 MAIN ST MYTOWN MI 48999 2008
1001 75 123 MAIN ST MYTOWN MI 48009 2010
1001 60 999 WE MOVED XTOWN CA 94903 2012
1001 60 999 WE MOVED XTOWN CA 94903 2014
;
run;
* step 1: assign number to keep original order;
data int;
set have;
n = _N_;
run;
* step 2: sort;
proc sort data=int;
by id state city street descending score;
run;
* step 3 replace zips;
data int2;
set int;
by id state city street;
retain int_zip;
if first.street
then int_zip = zip;
else zip = int_zip;
run;
* step 4: recreate original order in final step;
proc sort
data=int2
out=want (drop=int_zip n)
;
by n;
run;
* step 1: assign number to keep original order;
data int;
set have;
n = _N_;
run;
* step 2: sort;
proc sort data=int;
by id state city street descending score;
run;
* step 3 replace zips;
data int2;
set int;
by id state city street;
retain int_zip;
if first.street
then int_zip = zip;
else zip = int_zip;
run;
* step 4: recreate original order in final step;
proc sort
data=int2
out=want (drop=int_zip n)
;
by n;
run;
Its not clear to me why the records as number 50 in the below should be changed or this:
"the more (or most) accurate values (indicated by SCORE)"
Do you meant the max number within that group is the one to replace, if so and there is only one combination of ID+max(SCORE)+ZIP then you could just merge is back on with:
proc sql; create table WANT as select A.ID, A.SCORE, A.STREET, A.CITY, A.STATE, B.ZIP, A.YEAR from HAVE A left join (select distinct ID,ZIP from HAVE group by ID having SCORE=max(SCORE)) B on A.ID=B.ID; quit;
However, what if you have:
1 75 A8001
1 75 A8002
So 75 is the max, but has two differing postcode, which one to use? I would probably do a sort, then take first record, something like:
proc sort data=have out=inter; by id descending score; run; data inter; set inter; by id; if first.id then output; run;
And merge that on, you can add the other variables into the sort, say zipcode for first one etc.
If you are running at least 9.4M1, then Proc GEOCODE will return the ZIP Code used in the results in M_ZIP. So, while ZIP is 48999, M_ZIP will be 48009. You can look at the variable _NOTES_. If it contains NOZCM (No Zip Code Match) then the ZIP Code given by "you" does not match the ZIP Code of the address (city and state).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.