Solved: Flagging all observations where certain variables differ/are same by i...

sasgorilla · Posted 12-17-2024 12:33 AM

I have a dataset where some IDs have multiple observations. I want to create an indicator variable that tells me if certain variable values for all observations for an ID are the same (=1), or if any of the variable values for any of the ID's observations are different (=0). For an ID with a singular observation, I would want that to be same=1.

In the example below, I only care about a,b,c being the same. So, in the want dataset the "same" variable should be 1 for any ID where a,b,c were the same (regardless of what the other variables were), or if there was only one observation for the ID (id 2 and 4). For any IDs where any of a,b,c, were different in any observations I would want the "same" variable to be a 0 for each observation.

How could I best do this? Thank you.

data have;
input id date :yymmdd10.  a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0
1 2018-07-06 0 1 1 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3 
4 2020-01-30 0 2 2 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2 
;
RUN;

data want;
input id date :yymmdd10.  a b c d e f same;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0 1
1 2018-07-06 0 1 1 0 0 1 1
2 2015-02-15 1 2 0 1 2 3 1
3 2020-02-01 1 1 1 3 3 3 0
3 2021-07-05 1 1 0 3 3 3 0
3 2021-08-09 1 1 1 3 3 3 0  
4 2020-01-30 0 2 2 1 0 0 1
5 2018-10-10 1 2 0 0 1 2 1
5 2019-10-11 1 2 0 1 2 2 1
;
RUN;

Ksharp · Posted 12-17-2024 02:22 AM

data have;
input id date :yymmdd10.  a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0
1 2018-07-06 0 1 1 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3 
4 2020-01-30 0 2 2 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2 
;
RUN;

proc sql;
create table want as
select *,(range(a)+range(b)+range(c))=0 as same
 from have
  group by id;
quit;

View solution in original post

Ksharp · Posted 12-17-2024 02:22 AM

data have;
input id date :yymmdd10.  a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0
1 2018-07-06 0 1 1 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3 
4 2020-01-30 0 2 2 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2 
;
RUN;

proc sql;
create table want as
select *,(range(a)+range(b)+range(c))=0 as same
 from have
  group by id;
quit;

sasgorilla · Posted 12-17-2024 09:35 AM

Thanks, ksharp. This worked perfectly!

sasgorilla · Posted 12-17-2024 09:52 AM

Ah, so I found one problem in this dataset. If an id has any missing values for any of the variables of interest it appears this results in same=0 as well.

How can I modify the proc sql code so even same missing values are same=1? In other words, as long as the same missingness exists across multiple observations for that id, and all other variables are the same, I would want same=1.

data have;
input id date :yymmdd10.  a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 . 0 0 0
1 2018-07-06 0 1 . 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3 
4 2020-01-30 0 2 . 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2 
;
RUN;

data want;
input id date :yymmdd10.  a b c d e f same;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 . 0 0 0 1
1 2018-07-06 0 1 . 0 0 1 1
2 2015-02-15 1 2 0 1 2 3 1
3 2020-02-01 1 1 1 3 3 3 0
3 2021-07-05 1 1 0 3 3 3 0
3 2021-08-09 1 1 1 3 3 3 0  
4 2020-01-30 0 2 . 1 0 0 1
5 2018-10-10 1 2 0 0 1 2 1
5 2019-10-11 1 2 0 1 2 2 1
;
RUN;

Tom · Posted 12-17-2024 02:05 PM

In that case it might be better to use the NLEVEL option of PROC FREQ instead. The only drawback would be if the variable could have so many distinct levels that PROC FREQ runs out of memory. But that type of variable should probably not be tested in this way anyway.

ods select none;
ods output nlevels=nlevels;
proc freq nlevels data=have ;
  by id ;
  tables a b c / noprint;
run;
ods select all;

proc transpose data=nlevels out=multilevel;
  by id;
  where nlevels ne 1;
  id tablevar;
  var nlevels;
run;

data want;
  merge have multilevel(in=in1 keep=id);
  by id;
  same = not in1;
run;

Ksharp · Posted 12-17-2024 08:25 PM

[EDIT]

OK. Using SUM() function to replace with + operator, SUM() would take care of missing value.

data have;
input id date :yymmdd10.  a b c d e f;
format date yymmdd10.;
datalines;
1 2020-01-01 0 1 . 0 0 0
1 2018-07-06 0 1 . 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3 
4 2020-01-30 0 2 . 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2 
;
RUN;
proc sql;
create table want as
select *,sum(range(a),range(b),range(c))=0 as same
 from have
  group by id;
quit;

And if you also need to consider about missing and zero or other non-mssing values (id=0), check this:

data have;
input id date :yymmdd10.  a b c d e f;
format date yymmdd10.;
datalines;
0 2020-01-01 0 1 . 0 0 0
0 2018-07-06 0 1 0 0 0 1
0 2018-07-06 0 1 0 0 0 1
1 2020-01-01 0 1 . 0 0 0
1 2018-07-06 0 1 . 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3 
4 2020-01-30 0 2 . 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2 
;
RUN;
proc sql;
create table want as
select *,
sum(
count(distinct a)+(nmiss(a) ne 0),
count(distinct b)+(nmiss(b) ne 0),
count(distinct c)+(nmiss(c) ne 0)
)=3 as same
 from have
  group by id;
quit;

Tom · Posted 12-17-2024 09:32 AM

Use PROC SUMMARY to find the RANGE of the variables per ID group.

Then you can test it all of the ranges are zero.

If you want you an remerge the results back by ID.

data have;
  input id date :yymmdd. a b c d e f;
  format date yymmdd10.;
datalines;
1 2020-01-01 0 1 1 0 0 0
1 2018-07-06 0 1 1 0 0 1
2 2015-02-15 1 2 0 1 2 3
3 2020-02-01 1 1 1 3 3 3
3 2021-07-05 1 1 0 3 3 3
3 2021-08-09 1 1 1 3 3 3 
4 2020-01-30 0 2 2 1 0 0
5 2018-10-10 1 2 0 0 1 2
5 2019-10-11 1 2 0 1 2 2 
;

proc summary data=have;
  by id;
  var a b c ;
  output out=range(drop=_freq_ _type_) range=;
run;

data range;
  set range;
  same = 0=min(of a b c)=max(of a b c);
run;

data want;
  merge have range(keep=id same);
  by id;
run;

sasgorilla · Posted 12-17-2024 09:35 AM

Thanks, Tom. I'm realizing proc summary can be used to do a lot of things!

Flagging all observations where certain variables differ/are same by id

Re: Flagging all observations where certain variables differ/are same by id

Re: Flagging all observations where certain variables differ/are same by id

Re: Flagging all observations where certain variables differ/are same by id

Re: Flagging all observations where certain variables differ/are same by id

Re: Flagging all observations where certain variables differ/are same by id

Re: Flagging all observations where certain variables differ/are same by id

Re: Flagging all observations where certain variables differ/are same by id

Re: Flagging all observations where certain variables differ/are same by id

Registration is open

SAS Training: Just a Click Away