BookmarkSubscribeRSS Feed
gabagotati
Calcite | Level 5

I have a dataset like below, and want to collapse a subject so that I can see if they were diagnosed with an of the 3 diseases at all within the past 3 years using Disease1-3 are binary yes/no flags.

 

My original dataset

subject   year   disease1   disease2   disease 3

a              2020       1               1                 1

a              2021       0               0                 0

a              2022       0               0                 0

b              2020       0               1                 0

b              2021       1               0                 0

b              2022       0               0                 1

 

My desired output is below. For example - for subject a in 2021, since they had all 3 diseases in the prior year of 2020, they should also have flags for all those diseases in 2021 and 2022.

 

My desired output

subject   year   disease1   disease2   disease 3

a              2020       1               1                 1

a              2021       1               1                 1

a              2022       1               1                 1

b              2020       0               1                 0

b              2021       1               1                 0

b              2022       1               1                 1

 

What would be the best way about going to do this? I've tried using a do loop and the retain statement, but get stuck due to the fact that there are multiple columns to consider (disease1-disease3).

5 REPLIES 5
ballardw
Super User

One way for your example:

 

data have;
  input subject $  year   disease1   disease2   disease3;
datalines;
a 2020       1               1                 1
a 2021       0               0                 0
a 2022       0               0                 0
b 2020       0               1                 0
b 2021       1               0                 0
b 2022       0               0                 1
;

data want;
   set have;
   by subject;
   retain ld1 ld2 ld3;
   if first.subject then do;
      ld1=disease1;
      ld2=disease2;
      ld3=disease3;
   end;
   else do;
      disease1= max(disease1,ld1);
      disease2= max(disease2,ld2);
      disease3= max(disease3,ld3);
      ld1=disease1;
      ld2=disease2;
      ld3=disease3;
   end;
   drop ld1- ld3;
run;

If you had more variables it would likely use a couple of Arrays to reduce code but at 3 values the saving in code lines isn't much if any.

Note: if your data is not actually sorted by the Subject either sort the data by subject and year, or if the values are grouped by subject use the NOTSORTED option on the By statement.

The time of the assignments such as the FIRST value of the Subject to not compare to previous values is one thing. The BY statement creates automatic variables of First. and Last. for each variable on the By statement. These variables are 1/0 numeric values that SAS will treat as True/False for use in logical comparisons.

You need to retain 3 separate values as you want 3 different "streams".

 

 

gabagotati
Calcite | Level 5

Thank you, this makes a lot of sense! I have another follow-up question - what if we were to have 4 years of data per individual, but we only wanted to see if they had that disease within the last 2 years?

 

My original dataset

subject   year   disease1   disease2   disease 3

a              2019       1               1                 1

a              2020       0               0                 0

a              2021       0               0                 0

a              2022       0               0                 0

b              2019       0               1                 1

b              2020       1               0                 0

b              2021       1               0                 0

b              2022       0               0                 1

 

My desired output is below. For example - for subject a in 2021, since they had all 3 diseases in the prior year of 2019, they should also have flags for all those diseases in 2020 and 2021. But since 2022 is after the 2 year mark, and they did not have the disease in 2020 and 2021, they will not have flags for the year 2022. 

 

My desired output

subject   year   disease1   disease2   disease 3

a              2019       1               1                 1

a              2020       1               1                 1

a              2021       1               1                  1

a              2022       0               0                 0

b              2019       0               1                 1

b              2020       1               1                 1

b              2021       1               1                 1

b              2022       1               0                 1

PeterClemmensen
Tourmaline | Level 20
data have;
input subject $ year disease1 disease2 disease3;
datalines;
a 2020 1 1 1
a 2021 0 0 0
a 2022 0 0 0
b 2020 0 1 0
b 2021 1 0 0
b 2022 0 0 1
;

data temp;
   set have;
   array d disease:;
   do over d; 
      if d = 0 then d = .;
   end;
run;

data want;
   update temp(obs=0) temp;
   by subject;
   array d disease:;
   do over d; 
      if d = . then d = 0;
   end;
   output;
run;
gabagotati
Calcite | Level 5
Thank you, this works great as well! Do you have any guidance as to how to solve my follow-up question posted above?
Ksharp
Super User
data have;
  input subject $  year   disease1   disease2   disease3;
datalines;
a              2019       1               1                 1
a              2020       0               0                 0
a              2021       0               0                 0
a              2022       0               0                 0
b              2019       0               1                 1
b              2020       1               0                 0
b              2021       1               0                 0
b              2022       0               0                 1
;
data d1;
 set have(where=(disease1=1));
 year=year+1;output;
 year=year+1;output;
 keep subject year;
run;
data d2;
 set have(where=(disease2=1));
 year=year+1;output;
 year=year+1;output;
 keep subject year;
run;
data d3;
 set have(where=(disease3=1));
 year=year+1;output;
 year=year+1;output;
 keep subject year;
run;


data want;
 if _n_=1 then do;
  declare hash d1(dataset:'d1');
  d1.definekey('subject','year');
  d1.definedone();

  declare hash d2(dataset:'d2');
  d2.definekey('subject','year');
  d2.definedone();

  declare hash d3(dataset:'d3');
  d3.definekey('subject','year');
  d3.definedone();
 end;
set have;
if d1.check()=0 then disease1=1;
if d2.check()=0 then disease2=1;
if d3.check()=0 then disease3=1;
run;

Ksharp_0-1661428059149.png

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1210 views
  • 0 likes
  • 4 in conversation