- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello SAS community!
I'm kind of stuck here so would appreciate any and all advice. I've got a dataset with a list of building locations including ID, address, city, state, and zip. Is there a way to pull out only rows where it's the same address but different city, or rows with the same zip but different state? These are data quality errors that I'd like to identify.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You COULD try a cartesian join in SQL with an appropriate WHERE clause.
data have;
infile datalines dsd;
ID+1;
informat address $25. city $10. state $2. zip;
input address city state zip;
datalines;
550 Jack Horner, Rhymes, NJ, 06430
123 My Road, Omaha, NE, 23143
123 My Road, Auburn, NE, 23143
123 My Road, Omaha, NE, 23144
123 My Road, Auburn, NE, 23144
123 My Road, Auburn, ME, 04210
311 Main St, Big Sur, CA, 90201
;
proc sql;
title "Complete table";
select *
from have
;
title "Problem Rows";
select *
from have as a
, have as b
where a.zip=b.zip
and a.address=b.address
and a.State=b.State
and a.City <> b.City
;
quit;
title;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the idea is to check for data quality you might consider some of the ZIP related functions.
Such as ZIPSTATE, ZIPNAME or ZIPNAMEL (depending on the way you have state in the data).
If you have 2 letter state codes you could do something like:
Data zipstateprob; set have; if zipname(zipcode) ne Statecode then output; end;
to get a list where the zip appears to disagree with the state.
There is also a ZIPCITY that returns the name of a city and the 2 letter state code.
These do find "mismatches" for some areas because of local spelliing, like NY, NY or ABQ, NM or CdA, ID
but may be helpful.
I'm not sure I would start with "same address, different city" as there are just too many streets with common numbers. Quick, how far are you from a street with a name like Washington, Jefferson, Franklin, Lincoln, or other president or state names? Not to mention "Main St" or "Frontage Rd".
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc sql;
create table want as
select * from have
group by address
having count(distinct city) > 1;
quit;