DATA Step, Macro, Functions and more

by group zip code cleaning

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 14
Accepted Solution

by group zip code cleaning

Working with a large dataset of geocode results and I would like clean the zip codes by applying the zip of the more (or most) accurate values (indicated by SCORE) within each group*. Thanks as always.
*by id, street, city, and st.

 

HAVE:
ID      SCORE  STREET               CITY            ST  ZIP     YEAR
1001  75         123 MAIN ST        MYTOWN    MI  48009  2002
1001  50         123 MAIN ST        MYTOWN    MI  48999  2004
1001  75         123 MAIN ST        MYTOWN    MI  48009  2006
1001  50         123 MAIN ST        MYTOWN    MI  48999  2008
1001  75         123 MAIN ST        MYTOWN    MI  48009  2010
1001  60         999 WE MOVED   XTOWN      CA  94903  2012
1001  60         999 WE MOVED   XTOWN      CA  94903  2014

 

WANT:

ID      SCORE  STREET               CITY            ST  ZIP     YEAR
1001  75         123 MAIN ST        MYTOWN    MI  48009  2002
1001  50         123 MAIN ST        MYTOWN    MI  48009  2004
1001  75         123 MAIN ST        MYTOWN    MI  48009  2006
1001  50         123 MAIN ST        MYTOWN    MI  48009  2008
1001  75         123 MAIN ST        MYTOWN    MI  48009  2010
1001  60         999 WE MOVED   XTOWN      CA  94903  2012
1001  60         999 WE MOVED   XTOWN      CA  94903  2014

 

data have;
input id $ score street $9-20 city $ state $ zip $ year;
datalines;
1001 75 123 MAIN ST MYTOWN MI 48009 2002
1001 50 123 MAIN ST MYTOWN MI 48999 2004
1001 75 123 MAIN ST MYTOWN MI 48009 2006
1001 50 123 MAIN ST MYTOWN MI 48999 2008
1001 75 123 MAIN ST MYTOWN MI 48009 2010
1001 60 999 WE MOVED XTOWN CA 94903 2012
1001 60 999 WE MOVED XTOWN CA 94903 2014
;
run;

 


Accepted Solutions
Solution
‎03-17-2016 09:30 AM
Super User
Posts: 6,936

Re: by group zip code cleaning

* step 1: assign number to keep original order;

data int;
set have;
n = _N_;
run;

* step 2: sort;

proc sort data=int;
by id state city street descending score;
run;

* step 3 replace zips;

data int2;
set int;
by id state city street;
retain int_zip;
if first.street
then int_zip = zip;
else zip = int_zip;
run;

* step 4: recreate original order in final step;

proc sort
  data=int2
  out=want (drop=int_zip n)
;
by n;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Solution
‎03-17-2016 09:30 AM
Super User
Posts: 6,936

Re: by group zip code cleaning

* step 1: assign number to keep original order;

data int;
set have;
n = _N_;
run;

* step 2: sort;

proc sort data=int;
by id state city street descending score;
run;

* step 3 replace zips;

data int2;
set int;
by id state city street;
retain int_zip;
if first.street
then int_zip = zip;
else zip = int_zip;
run;

* step 4: recreate original order in final step;

proc sort
  data=int2
  out=want (drop=int_zip n)
;
by n;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,401

Re: by group zip code cleaning

Its not clear to me why the records as number 50 in the below should be changed or this:

"the more (or most) accurate values (indicated by SCORE)"

Do you meant the max number within that group is the one to replace, if so and there is only one combination of ID+max(SCORE)+ZIP then you could just merge is back on with:

proc sql;
  create table WANT as
  select  A.ID,
          A.SCORE,
          A.STREET,
          A.CITY,
          A.STATE,
          B.ZIP,
          A.YEAR
  from    HAVE A
  left join (select distinct ID,ZIP from HAVE group by ID having SCORE=max(SCORE)) B
  on      A.ID=B.ID;
quit;

However, what if you have:

1  75  A8001

1  75  A8002

So 75 is the max, but has two differing postcode, which one to use?  I would probably do a sort, then take first record, something like:

proc sort data=have out=inter;
  by id descending score;
run;

data inter;
  set inter;
  by id;
  if first.id then output;
run;

And merge that on, you can add the other variables into the sort, say zipcode for first one etc.

 

SAS Employee
Posts: 170

Re: by group zip code cleaning

If you are running at least 9.4M1, then Proc GEOCODE will return the ZIP Code used in the results in M_ZIP.  So, while ZIP is 48999, M_ZIP will be 48009.  You can look at the variable _NOTES_.  If it contains NOZCM (No Zip Code Match) then the ZIP Code given by "you" does not match the ZIP Code of the address (city and state).

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 252 views
  • 2 likes
  • 4 in conversation