- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I believe I have a relatively simple question, but I'm new to SAS so its been a bit daunting to try and figure it out.
I have about 100 datasets that each contain a single column with step counts collected by a study subject for over one year (one observation/day, i.e. about 365 observations per person). I would like to take the average of the step counts every 30 days, and label each of these 30-day averages as T1, T2, T3 etc.
I would then like to have a dataset that contains the T1, T2, T3 etc.. step average for each participant.
Not all participants were good about recording their steps every day so there are some missing values. For each 30 day block, I would like to calculate compliance (#completed/30x100).
Therefore I am hoping to go from many datasets that have approx 365 step count recordings in a single column to something like this:
ID T1_avg T2_avg T3_avg ..... T1_compliance T2_compliance T3_compliance ......
1
2
3
4
5
6
Some participants went beyond the 1 yr mark and others finished earlier so not everyone has exactly 365 days of steps. Therefore, I may have some participants with T14, T15, T16, and even some with T17.
Would anyone be able to help with this?
I really appreciate any tips you might have!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Using standard SAS data manipulation methods:
data subject_1;
infile "&sasforum.\datasets\subject_1.txt" truncover firstobs=2;
input steps;
run;
data subject_2;
infile "&sasforum.\datasets\subject_2.txt" truncover firstobs=2;
input steps;
run;
data steps;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
day = d;
run;
proc sql;
create table steps30 as
select
subject,
ceil(day/30) as period,
mean(steps) as stepAverage,
100 * count(steps)/ count(*) as compliance
from steps
group by subject, calculated period;
quit;
/* Transform to wide data format (not recommended) */
proc transpose data=steps30 out=stepsAve(drop=_name_) prefix=ave_;
by subject;
id period;
var stepAverage;
run;
proc transpose data=steps30 out=stepsComp(drop=_name_) prefix=comp_;
by subject;
id period;
var compliance;
run;
data want;
merge stepsAve stepsComp;
by subject;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please provide data (in a text or CSV format) for us to play with.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is 2 example cases with some fake data, but it looks just like what I am dealing with!
I'd like to know the average steps per 30 day block, and then the compliance for that block (#complete/30x100)
Let me know if anything is unclear! Thanks so much
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Using standard SAS data manipulation methods:
data subject_1;
infile "&sasforum.\datasets\subject_1.txt" truncover firstobs=2;
input steps;
run;
data subject_2;
infile "&sasforum.\datasets\subject_2.txt" truncover firstobs=2;
input steps;
run;
data steps;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
day = d;
run;
proc sql;
create table steps30 as
select
subject,
ceil(day/30) as period,
mean(steps) as stepAverage,
100 * count(steps)/ count(*) as compliance
from steps
group by subject, calculated period;
quit;
/* Transform to wide data format (not recommended) */
proc transpose data=steps30 out=stepsAve(drop=_name_) prefix=ave_;
by subject;
id period;
var stepAverage;
run;
proc transpose data=steps30 out=stepsComp(drop=_name_) prefix=comp_;
by subject;
id period;
var compliance;
run;
data want;
merge stepsAve stepsComp;
by subject;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi again,
I really appreciated your help with the code. It worked perfectly, but one thing I realized is that I need to exclude all the blank cells before the first observation, before I take the average of 30 day blocks. The location of this first observation varies quite a bit between participants. If someone started the study on the 30th of the month their first observation would start in the 30th cell (subject 1), whereas someone whose start date was on 3rd of the month will have their first observation in the 3rd cell (subject 2). I would like to only calculate the 30 day block from this first number. I can't simply remove all blank cells, since I don't want to exclude blank cells that occurred after the start date. Any idea how I can modify the code to calculate the average of the first 30 day block but only starting from the first observation?
Let me know if you have any ideas!
Thanks in advance!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To remove the initial missing values, build the steps dataset with:
data steps;
retain day;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
if d = 1 then call missing(day);
if missing(day) then do;
if not missing(steps) then day = 0;
end;
if not missing(day) then do;
day = day + 1;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1. Combine your data into a single file. If they have similar names you can do the following:
data combined;
set data1-data100 indsname=source;
dsn=source;
run;
2. Define your 30 day intervals
Not sure - you didn't indicate how day is captured in the data.
3. Run PROC MEANS to get the average based on 30 day intervals for each interval.
https://github.com/statgeek/SAS-Tutorials/blob/master/proc_means_basic.sas
That will have exactly what you want.
@abc44 wrote:
Hi everyone,
I believe I have a relatively simple question, but I'm new to SAS so its been a bit daunting to try and figure it out.
I have about 100 datasets that each contain a single column with step counts collected by a study subject for over one year (one observation/day, i.e. about 365 observations per person). I would like to take the average of the step counts every 30 days, and label each of these 30-day averages as T1, T2, T3 etc.
I would then like to have a dataset that contains the T1, T2, T3 etc.. step average for each participant.
Not all participants were good about recording their steps every day so there are some missing values. For each 30 day block, I would like to calculate compliance (#completed/30x100).
Therefore I am hoping to go from many datasets that have approx 365 step count recordings in a single column to something like this:
ID T1_avg T2_avg T3_avg ..... T1_compliance T2_compliance T3_compliance ......
1
2
3
4
5
6
Some participants went beyond the 1 yr mark and others finished earlier so not everyone has exactly 365 days of steps. Therefore, I may have some participants with T14, T15, T16, and even some with T17.
Would anyone be able to help with this?
I really appreciate any tips you might have!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You will find it much easier in the long run to create and use a long set such as
ID Period Average
1 T1 value
1 T2 Value
In fact I think you would like be much better off with the Period as a numeric instead of character value.
And are the blanks in your example data days without a measurement that you want counted for the compliance?
You need to make sure there is an actual ID value attached to each record. You do realize that your compliance will be wrong for the last 30 day period of the year with your definition. 365 or 366 yields 12 periods of 30 days and a 13th period of 5 or 6 days. Which even with 100 percent data will be 5/30 with your definition.
My take on this would look like:
data combined; set data1-data100 indsname=source curobs=cobs; dsn=source; day=cobs; period = int((day-1)/30) + 1; run; proc summary data=have nway; class dsn period; var t; output out=work.summary (drop=_:) mean= n= nmiss=/autoname; run; data work.compliance; set work.summary; compliance= (100 * t_n/(t_n +t_nmiss)); run;
This uses the count of values and missing values for the denominator in the compliance calculation.
If you want a report with the periods across then Proc Report will do that fine with the dsn (or id or what every you decide the source should be called) as a group variable and period an across variable with t_mean and compliance nested under.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Suppose that you've already converted your text files into the corresponding number of SAS data sets named sub_1, sub_2, ..., etc.; and each data set contains a single variable T indicating the number of steps. Then one way of getting the output you want is to proceed as follows:
%let nsub = 100 ; * number of subjects ;
%let span = 30 ; * calculation span, days ;
/* Step #1: ID the subjects */
data sub ;
set sub_1-sub_&nsub indsname = dsn ;
ID + dsn ne lag (dsn) ;
run ;
/* Step #2: Compute max number of intervals for transposition */
data _null_ ;
do _q = 1 by 1 until (last.id) ;
do _n_ = 1 to &span until (last.id) ;
set sub (keep = id) end = z ;
by id ;
end ;
end ;
retain q ;
q = q max _q ;
if z then call symputx ("q", q) ;
run ;
/* Step #3: Compute metrics and shape output */
data want (keep = ID T_:) ;
do _q = 1 by 1 until (last.id) ;
call missing (_N, _S) ;
do _n_ = 1 to &span until (last.id) ;
set sub ;
by id ;
_N = sum (_N, N (T)) ;
_S = sum (_S, T) ;
end ;
avg = divide (_S, _N) ;
cmp = 100 * divide (_N, &span) ;
array T_avg [&q] ;
array T_cmp [&q] ;
T_avg [_q] = avg ;
T_cmp [_q] = cmp ;
end ;
run ;
HTH
Paul D.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content