Hi All,
I feel like this is a pretty straightforward one, but I am still pretty new at data steps, so I thought I would call for backup.
Here's the situation.
I have a rather large dataset that has multiple records for about 80,000 individuals. Each idividual has anywhere from 1-15 different records associated with their ID. What I'm trying to determine is how often postal code changes among people with multiple records
Ideally the output would be a table with one record per ID with an indicator variable telling me if all the postal codes for a person are the same, or if there is one that is different.
for example, I would like this:
PATIENT ID POSTAL CODE
1 M4R
1 M4R
1 M4R
2 M4T
2 M4X
2 M4T
to turn into this:
PATIENT ID POSTAL CODE CHANGE
1 'NO' or 0
2 'YES' or 1
I assume the first step would be to sort by patient ID and by Postal Code, followed by so sort of "if"/ "then" statement, but I'm having trouble sorting out exactly how that statement should look.
Any thougths would be much appreciated!
Thanks so much.
Mike
I'd go about it slightly different:
(make sure that dataset is sorted by patient_id)
data have;
input patient_id post_code $;
cards;
1 M4R
1 M4R
1 M4R
2 M4T
2 M4X
2 M4T
;
run;
data want;
set have;
by patient_id;
retain code_change;
if post_code ne lag(post_code) then code_change = 1;
if first.patient_id then code_change = 0;
if last.patient_id then output;
keep patient_id code_change;
run;
proc print data=want noobs;
run;
Result:
patient_ code_ id change 1 0 2 1
Note how I presented your example data in a simple data step. Please do so in the future.
Here's a data step version of such an analysis, I could also imagine that you could write this in PROC SQL
/* UNTESTED CODE */ data want; set have; by patient_id; /* Assumes data set have is sorted by patient_id */ if first.patient_id then count=0; prev_postal_code=lag(postal_code); if prev_postal_code^=postal_code and not first.patient_id then count+1; run;
I'd go about it slightly different:
(make sure that dataset is sorted by patient_id)
data have;
input patient_id post_code $;
cards;
1 M4R
1 M4R
1 M4R
2 M4T
2 M4X
2 M4T
;
run;
data want;
set have;
by patient_id;
retain code_change;
if post_code ne lag(post_code) then code_change = 1;
if first.patient_id then code_change = 0;
if last.patient_id then output;
keep patient_id code_change;
run;
proc print data=want noobs;
run;
Result:
patient_ code_ id change 1 0 2 1
Note how I presented your example data in a simple data step. Please do so in the future.
You could squeeze just a bit more information out of the data in this way:
proc sort data=have;
by patient_id postal_code;
run;
data want;
set have;
by patient_id postal_code;
if first.patient_id then count=0;
if last.postal_code then count + 1;
if last.patient_id;
keep patient_id count;
run;
That way, you get more than a yes/no, you get a count of how many postal codes exist for each patient.
data have; input patient_id post_code $; cards; 1 M4R 1 M4R 1 M4R 2 M4T 2 M4X 2 M4T ; run; proc sql; select patient_id,case when count(distinct post_code)=1 then 0 else 1 end as flag from have group by patient_id; quit;
Thanks so much everyone. It's crazy to see how many different ways there are to get to the same output.
Also, thanks for the heads up about how to present data in the future. I can see how that would make life much easier for people offering solutions.
The help of this community is always much appreciated!
Mike
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.