Solved: by group zip code cleaning

UMAnalyst · Posted 03-17-2016 08:59 AM

Working with a large dataset of geocode results and I would like clean the zip codes by applying the zip of the more (or most) accurate values (indicated by SCORE) within each group*. Thanks as always.
*by id, street, city, and st.

HAVE:
ID SCORE STREET CITY ST ZIP YEAR
1001 75 123 MAIN ST MYTOWN MI 48009 2002
1001 50 123 MAIN ST MYTOWN MI 48999 2004
1001 75 123 MAIN ST MYTOWN MI 48009 2006
1001 50 123 MAIN ST MYTOWN MI 48999 2008
1001 75 123 MAIN ST MYTOWN MI 48009 2010
1001 60 999 WE MOVED XTOWN CA 94903 2012
1001 60 999 WE MOVED XTOWN CA 94903 2014

WANT:

ID SCORE STREET CITY ST ZIP YEAR
1001 75 123 MAIN ST MYTOWN MI 48009 2002
1001 50 123 MAIN ST MYTOWN MI 48009 2004
1001 75 123 MAIN ST MYTOWN MI 48009 2006
1001 50 123 MAIN ST MYTOWN MI 48009 2008
1001 75 123 MAIN ST MYTOWN MI 48009 2010
1001 60 999 WE MOVED XTOWN CA 94903 2012
1001 60 999 WE MOVED XTOWN CA 94903 2014

data have;
input id $ score street $9-20 city $ state $ zip $ year;
datalines;
1001 75 123 MAIN ST MYTOWN MI 48009 2002
1001 50 123 MAIN ST MYTOWN MI 48999 2004
1001 75 123 MAIN ST MYTOWN MI 48009 2006
1001 50 123 MAIN ST MYTOWN MI 48999 2008
1001 75 123 MAIN ST MYTOWN MI 48009 2010
1001 60 999 WE MOVED XTOWN CA 94903 2012
1001 60 999 WE MOVED XTOWN CA 94903 2014
;
run;

Kurt_Bremser · Posted 03-17-2016 09:21 AM

* step 1: assign number to keep original order;

data int;
set have;
n = _N_;
run;

* step 2: sort;

proc sort data=int;
by id state city street descending score;
run;

* step 3 replace zips;

data int2;
set int;
by id state city street;
retain int_zip;
if first.street
then int_zip = zip;
else zip = int_zip;
run;

* step 4: recreate original order in final step;

proc sort
  data=int2
  out=want (drop=int_zip n)
;
by n;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 03-17-2016 09:21 AM

* step 1: assign number to keep original order;

data int;
set have;
n = _N_;
run;

* step 2: sort;

proc sort data=int;
by id state city street descending score;
run;

* step 3 replace zips;

data int2;
set int;
by id state city street;
retain int_zip;
if first.street
then int_zip = zip;
else zip = int_zip;
run;

* step 4: recreate original order in final step;

proc sort
  data=int2
  out=want (drop=int_zip n)
;
by n;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RW9 · Posted 03-17-2016 09:28 AM

Its not clear to me why the records as number 50 in the below should be changed or this:

"the more (or most) accurate values (indicated by SCORE)"

Do you meant the max number within that group is the one to replace, if so and there is only one combination of ID+max(SCORE)+ZIP then you could just merge is back on with:

proc sql;
  create table WANT as
  select  A.ID,
          A.SCORE,
          A.STREET,
          A.CITY,
          A.STATE,
          B.ZIP,
          A.YEAR
  from    HAVE A
  left join (select distinct ID,ZIP from HAVE group by ID having SCORE=max(SCORE)) B
  on      A.ID=B.ID;
quit;

However, what if you have:

1 75 A8001

1 75 A8002

So 75 is the max, but has two differing postcode, which one to use? I would probably do a sort, then take first record, something like:

proc sort data=have out=inter;
  by id descending score;
run;

data inter;
  set inter;
  by id;
  if first.id then output;
run;

And merge that on, you can add the other variables into the sort, say zipcode for first one etc.

Darrell_sas · Posted 03-21-2016 01:01 PM

If you are running at least 9.4M1, then Proc GEOCODE will return the ZIP Code used in the results in M_ZIP. So, while ZIP is 48999, M_ZIP will be 48009. You can look at the variable _NOTES_. If it contains NOZCM (No Zip Code Match) then the ZIP Code given by "you" does not match the ZIP Code of the address (city and state).

by group zip code cleaning

Re: by group zip code cleaning

Re: by group zip code cleaning

Re: by group zip code cleaning

Re: by group zip code cleaning

by group zip code cleaning

Re: by group zip code cleaning

Re: by group zip code cleaning

Re: by group zip code cleaning

Re: by group zip code cleaning

Register Today!

SAS Training: Just a Click Away