I have a dataset like below, and want to collapse a subject so that I can see if they were diagnosed with an of the 3 diseases at all within the past 3 years using Disease1-3 are binary yes/no flags.
My original dataset
subject year disease1 disease2 disease 3
a 2020 1 1 1
a 2021 0 0 0
a 2022 0 0 0
b 2020 0 1 0
b 2021 1 0 0
b 2022 0 0 1
My desired output is below. For example - for subject a in 2021, since they had all 3 diseases in the prior year of 2020, they should also have flags for all those diseases in 2021 and 2022.
My desired output
subject year disease1 disease2 disease 3
a 2020 1 1 1
a 2021 1 1 1
a 2022 1 1 1
b 2020 0 1 0
b 2021 1 1 0
b 2022 1 1 1
What would be the best way about going to do this? I've tried using a do loop and the retain statement, but get stuck due to the fact that there are multiple columns to consider (disease1-disease3).
One way for your example:
data have;
  input subject $  year   disease1   disease2   disease3;
datalines;
a 2020       1               1                 1
a 2021       0               0                 0
a 2022       0               0                 0
b 2020       0               1                 0
b 2021       1               0                 0
b 2022       0               0                 1
;
data want;
   set have;
   by subject;
   retain ld1 ld2 ld3;
   if first.subject then do;
      ld1=disease1;
      ld2=disease2;
      ld3=disease3;
   end;
   else do;
      disease1= max(disease1,ld1);
      disease2= max(disease2,ld2);
      disease3= max(disease3,ld3);
      ld1=disease1;
      ld2=disease2;
      ld3=disease3;
   end;
   drop ld1- ld3;
run;
If you had more variables it would likely use a couple of Arrays to reduce code but at 3 values the saving in code lines isn't much if any.
Note: if your data is not actually sorted by the Subject either sort the data by subject and year, or if the values are grouped by subject use the NOTSORTED option on the By statement.
The time of the assignments such as the FIRST value of the Subject to not compare to previous values is one thing. The BY statement creates automatic variables of First. and Last. for each variable on the By statement. These variables are 1/0 numeric values that SAS will treat as True/False for use in logical comparisons.
You need to retain 3 separate values as you want 3 different "streams".
Thank you, this makes a lot of sense! I have another follow-up question - what if we were to have 4 years of data per individual, but we only wanted to see if they had that disease within the last 2 years?
My original dataset
subject year disease1 disease2 disease 3
a 2019 1 1 1
a 2020 0 0 0
a 2021 0 0 0
a 2022 0 0 0
b 2019 0 1 1
b 2020 1 0 0
b 2021 1 0 0
b 2022 0 0 1
My desired output is below. For example - for subject a in 2021, since they had all 3 diseases in the prior year of 2019, they should also have flags for all those diseases in 2020 and 2021. But since 2022 is after the 2 year mark, and they did not have the disease in 2020 and 2021, they will not have flags for the year 2022.
My desired output
subject year disease1 disease2 disease 3
a 2019 1 1 1
a 2020 1 1 1
a 2021 1 1 1
a 2022 0 0 0
b 2019 0 1 1
b 2020 1 1 1
b 2021 1 1 1
b 2022 1 0 1
data have;
input subject $ year disease1 disease2 disease3;
datalines;
a 2020 1 1 1
a 2021 0 0 0
a 2022 0 0 0
b 2020 0 1 0
b 2021 1 0 0
b 2022 0 0 1
;
data temp;
   set have;
   array d disease:;
   do over d; 
      if d = 0 then d = .;
   end;
run;
data want;
   update temp(obs=0) temp;
   by subject;
   array d disease:;
   do over d; 
      if d = . then d = 0;
   end;
   output;
run;data have;
  input subject $  year   disease1   disease2   disease3;
datalines;
a              2019       1               1                 1
a              2020       0               0                 0
a              2021       0               0                 0
a              2022       0               0                 0
b              2019       0               1                 1
b              2020       1               0                 0
b              2021       1               0                 0
b              2022       0               0                 1
;
data d1;
 set have(where=(disease1=1));
 year=year+1;output;
 year=year+1;output;
 keep subject year;
run;
data d2;
 set have(where=(disease2=1));
 year=year+1;output;
 year=year+1;output;
 keep subject year;
run;
data d3;
 set have(where=(disease3=1));
 year=year+1;output;
 year=year+1;output;
 keep subject year;
run;
data want;
 if _n_=1 then do;
  declare hash d1(dataset:'d1');
  d1.definekey('subject','year');
  d1.definedone();
  declare hash d2(dataset:'d2');
  d2.definekey('subject','year');
  d2.definedone();
  declare hash d3(dataset:'d3');
  d3.definekey('subject','year');
  d3.definedone();
 end;
set have;
if d1.check()=0 then disease1=1;
if d2.check()=0 then disease2=1;
if d3.check()=0 then disease3=1;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
