I have a dataset where some IDs have multiple observations. I want to create an indicator variable that tells me if certain variable values for all observations for an ID are the same (=1), or if any of the variable values for any of the ID's observations are different (=0). For an ID with a singular observation, I would want that to be same=1.
In the example below, I only care about a,b,c being the same. So, in the want dataset the "same" variable should be 1 for any ID where a,b,c were the same (regardless of what the other variables were), or if there was only one observation for the ID (id 2 and 4). For any IDs where any of a,b,c, were different in any observations I would want the "same" variable to be a 0 for each observation.
How could I best do this? Thank you.
data have;
input id date :yymmdd10. a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0
1 2018-07-06 0 1 1 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3
4 2020-01-30 0 2 2 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2
;
RUN;
data want;
input id date :yymmdd10. a b c d e f same;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0 1
1 2018-07-06 0 1 1 0 0 1 1
2 2015-02-15 1 2 0 1 2 3 1
3 2020-02-01 1 1 1 3 3 3 0
3 2021-07-05 1 1 0 3 3 3 0
3 2021-08-09 1 1 1 3 3 3 0
4 2020-01-30 0 2 2 1 0 0 1
5 2018-10-10 1 2 0 0 1 2 1
5 2019-10-11 1 2 0 1 2 2 1
;
RUN;
data have;
input id date :yymmdd10. a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0
1 2018-07-06 0 1 1 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3
4 2020-01-30 0 2 2 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2
;
RUN;
proc sql;
create table want as
select *,(range(a)+range(b)+range(c))=0 as same
from have
group by id;
quit;
data have;
input id date :yymmdd10. a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0
1 2018-07-06 0 1 1 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3
4 2020-01-30 0 2 2 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2
;
RUN;
proc sql;
create table want as
select *,(range(a)+range(b)+range(c))=0 as same
from have
group by id;
quit;
Thanks, ksharp. This worked perfectly!
Ah, so I found one problem in this dataset. If an id has any missing values for any of the variables of interest it appears this results in same=0 as well.
How can I modify the proc sql code so even same missing values are same=1? In other words, as long as the same missingness exists across multiple observations for that id, and all other variables are the same, I would want same=1.
data have;
input id date :yymmdd10. a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 . 0 0 0
1 2018-07-06 0 1 . 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3
4 2020-01-30 0 2 . 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2
;
RUN;
data want;
input id date :yymmdd10. a b c d e f same;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 . 0 0 0 1
1 2018-07-06 0 1 . 0 0 1 1
2 2015-02-15 1 2 0 1 2 3 1
3 2020-02-01 1 1 1 3 3 3 0
3 2021-07-05 1 1 0 3 3 3 0
3 2021-08-09 1 1 1 3 3 3 0
4 2020-01-30 0 2 . 1 0 0 1
5 2018-10-10 1 2 0 0 1 2 1
5 2019-10-11 1 2 0 1 2 2 1
;
RUN;
In that case it might be better to use the NLEVEL option of PROC FREQ instead. The only drawback would be if the variable could have so many distinct levels that PROC FREQ runs out of memory. But that type of variable should probably not be tested in this way anyway.
ods select none;
ods output nlevels=nlevels;
proc freq nlevels data=have ;
by id ;
tables a b c / noprint;
run;
ods select all;
proc transpose data=nlevels out=multilevel;
by id;
where nlevels ne 1;
id tablevar;
var nlevels;
run;
data want;
merge have multilevel(in=in1 keep=id);
by id;
same = not in1;
run;
[EDIT]
OK. Using SUM() function to replace with + operator, SUM() would take care of missing value.
data have;
input id date :yymmdd10. a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 . 0 0 0
1 2018-07-06 0 1 . 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3
4 2020-01-30 0 2 . 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2
;
RUN;
proc sql;
create table want as
select *,sum(range(a),range(b),range(c))=0 as same
from have
group by id;
quit;
And if you also need to consider about missing and zero or other non-mssing values (id=0), check this:
data have;
input id date :yymmdd10. a b c d e f;
format date yymmdd10.;
datalines;
0 2020-01-01 0 1 . 0 0 0
0 2018-07-06 0 1 0 0 0 1
0 2018-07-06 0 1 0 0 0 1
1 2020-01-01 0 1 . 0 0 0
1 2018-07-06 0 1 . 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3
4 2020-01-30 0 2 . 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2
;
RUN;
proc sql;
create table want as
select *,
sum(
count(distinct a)+(nmiss(a) ne 0),
count(distinct b)+(nmiss(b) ne 0),
count(distinct c)+(nmiss(c) ne 0)
)=3 as same
from have
group by id;
quit;
Use PROC SUMMARY to find the RANGE of the variables per ID group.
Then you can test it all of the ranges are zero.
If you want you an remerge the results back by ID.
data have;
input id date :yymmdd. a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0
1 2018-07-06 0 1 1 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3
4 2020-01-30 0 2 2 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2
;
proc summary data=have;
by id;
var a b c ;
output out=range(drop=_freq_ _type_) range=;
run;
data range;
set range;
same = 0=min(of a b c)=max(of a b c);
run;
data want;
merge have range(keep=id same);
by id;
run;
Thanks, Tom. I'm realizing proc summary can be used to do a lot of things!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.