Dear All,
I am working on a project for which I have to get mean levels of a disease marker (let's call it DM). The issue is that the mean must only include measures before the individual developed disease, and the time each person may have developed disease differs. So, for example, if ID=1 has 7 time points where DM was assessed and this person developed disease at time 5, I would need the mean of DM from time 1 to time 4. If ID=2 has DM measured from time 1 to 6 and the person developed disease at time 3, I'd need the mean DM from time 1 to time 2. If another person did not develop disease, I'd get the mean of all the available DM measures. I am really stuck on how to do this. Any help would be greatly appreciated!
Many thanks!
It all depends on your data structure. Please post an example.
Dear PGStats,
I have a wide data set (an example is given below).
data test;
input id dm1-dm7 disease time;
cards;
1 35 38 40 43 53 55 55 1 7
2 25 29 20 . 25 30 28 0 .
3 38 40 40 50 48 30 35 1 5
4 44 43 41 39 38 . 37 1 3
5 28 25 30 28 30 27 31 0 .
6 38 40 45 . 50 29 28 1 5
;
run;
Above, ID 1 developed disease at the 7th time point, so I'd need to get the means of DM1-DM6.
ID 2 did not develop disease, so I'd need the means of all available DM values.
ID 3 developed disease at time 5 so I'd get the mean of DM from times 1 to 4, etc.
Thank you for any help!
TDC.
run;
In case there is only one record per ID (and MEAN means AVERAGE)
than you can do it by:
data test;
input id dm1-dm7 disease time;
array dm {*} dm1-dm7;
sum_dm = 0;
if time > 1 then do;
do i=1 to (time -1);
sum_dm = sum(of sum_dm, dm(i));
end;
MEAN = sum_dm / (time -1);
end;
else MEAN = .;
cards;
1 35 38 40 43 53 55 55 1 7
2 25 29 20 . 25 30 28 0 .
3 38 40 40 50 48 30 35 1 5
4 44 43 41 39 38 . 37 1 3
5 28 25 30 28 30 27 31 0 .
6 38 40 45 . 50 29 28 1 5
;
run;
@Shmuel, your denominator will be wrong if missing dm values occur before time. If you want to calculate your own average, you must keep track of the count of non-missing dm values as well as the sum of dm values.
Something like this perhaps:
data want;
set test;
array dm dm1-dm7;
if disease and not missing(time) then do;
do i=1 to time;
dmsum=sum(dmsum,dm[i]);
end;
dmmean= dmsum/time;
end;
drop i;
run;
Transpose and use proc means.
proc sort data=test;
by id disease time;
run;
proc transpose data=test out=long;
by id disease time;
var dm1-dm7;
run;
data long;
set long;
*calculate measurement time from variable name, generally bad practice to store data in column headers;
dm_time=input(compress(_name_, , 'kd'), 8.);
run;
proc means data=long nway mean n std;
class id disease ;
var col1;
where dm_time < time or time=.;
run;
If you want to keep it wide:
data want;
set test;
array dm dm1-dm7;
array x{7};
do i = 1 to coalesce(time, dim(dm)+1) - 1;
x{i} = dm{i};
end;
meanDm = mean(of x{*});
drop i x:;
run;
Thank you All so much!! I really appreciate your help!!
TDC.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.