SAS Programming

abc44 · Posted 11-15-2018 05:03 PM

Hi everyone,

I believe I have a relatively simple question, but I'm new to SAS so its been a bit daunting to try and figure it out.

I have about 100 datasets that each contain a single column with step counts collected by a study subject for over one year (one observation/day, i.e. about 365 observations per person). I would like to take the average of the step counts every 30 days, and label each of these 30-day averages as T1, T2, T3 etc.

I would then like to have a dataset that contains the T1, T2, T3 etc.. step average for each participant.

Not all participants were good about recording their steps every day so there are some missing values. For each 30 day block, I would like to calculate compliance (#completed/30x100).

Therefore I am hoping to go from many datasets that have approx 365 step count recordings in a single column to something like this:

ID T1_avg T2_avg T3_avg ..... T1_compliance T2_compliance T3_compliance ......

1

2

3

4

5

6

Some participants went beyond the 1 yr mark and others finished earlier so not everyone has exactly 365 days of steps. Therefore, I may have some participants with T14, T15, T16, and even some with T17.

Would anyone be able to help with this?

I really appreciate any tips you might have!

PGStats · Posted 11-16-2018 12:06 AM

Using standard SAS data manipulation methods:

data subject_1;
infile "&sasforum.\datasets\subject_1.txt" truncover firstobs=2;
input steps;
run;

data subject_2;
infile "&sasforum.\datasets\subject_2.txt" truncover firstobs=2;
input steps;
run;

data steps;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
day = d;
run;

proc sql;
create table steps30 as
select
    subject,
    ceil(day/30) as period,
    mean(steps) as stepAverage,
    100 * count(steps)/ count(*) as compliance
from steps
group by subject, calculated period;
quit;

/* Transform to wide data format (not recommended) */
proc transpose data=steps30 out=stepsAve(drop=_name_) prefix=ave_;
by subject;
id period;
var stepAverage;
run;

proc transpose data=steps30 out=stepsComp(drop=_name_) prefix=comp_;
by subject;
id period;
var compliance;
run;

data want;
merge stepsAve stepsComp;
by subject;
run;

PG

View solution in original post

PGStats · Posted 11-15-2018 05:10 PM

Please provide data (in a text or CSV format) for us to play with.

PG

abc44 · Posted 11-15-2018 05:28 PM

Here is 2 example cases with some fake data, but it looks just like what I am dealing with!

I'd like to know the average steps per 30 day block, and then the compliance for that block (#complete/30x100)

Let me know if anything is unclear! Thanks so much

PGStats · Posted 11-16-2018 12:06 AM

Using standard SAS data manipulation methods:

data subject_1;
infile "&sasforum.\datasets\subject_1.txt" truncover firstobs=2;
input steps;
run;

data subject_2;
infile "&sasforum.\datasets\subject_2.txt" truncover firstobs=2;
input steps;
run;

data steps;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
day = d;
run;

proc sql;
create table steps30 as
select
    subject,
    ceil(day/30) as period,
    mean(steps) as stepAverage,
    100 * count(steps)/ count(*) as compliance
from steps
group by subject, calculated period;
quit;

/* Transform to wide data format (not recommended) */
proc transpose data=steps30 out=stepsAve(drop=_name_) prefix=ave_;
by subject;
id period;
var stepAverage;
run;

proc transpose data=steps30 out=stepsComp(drop=_name_) prefix=comp_;
by subject;
id period;
var compliance;
run;

data want;
merge stepsAve stepsComp;
by subject;
run;

PG

abc44 · Posted 11-16-2018 09:51 AM

This worked perfectly - thanks SO much!

abc44 · Posted 11-19-2018 02:17 PM

Hi again,

I really appreciated your help with the code. It worked perfectly, but one thing I realized is that I need to exclude all the blank cells before the first observation, before I take the average of 30 day blocks. The location of this first observation varies quite a bit between participants. If someone started the study on the 30th of the month their first observation would start in the 30th cell (subject 1), whereas someone whose start date was on 3rd of the month will have their first observation in the 3rd cell (subject 2). I would like to only calculate the 30 day block from this first number. I can't simply remove all blank cells, since I don't want to exclude blank cells that occurred after the start date. Any idea how I can modify the code to calculate the average of the first 30 day block but only starting from the first observation?

Let me know if you have any ideas!

Thanks in advance!!

PGStats · Posted 11-19-2018 04:04 PM

To remove the initial missing values, build the steps dataset with:

data steps;
retain day;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
if d = 1 then call missing(day);
if missing(day) then do;
    if not missing(steps) then day = 0;
    end;
if not missing(day) then do;
    day = day + 1;
    output;
    end;
run;

PG

Reeza · Posted 11-15-2018 05:12 PM

1. Combine your data into a single file. If they have similar names you can do the following:

data combined;
set data1-data100 indsname=source;
dsn=source;
run;

2. Define your 30 day intervals
Not sure - you didn't indicate how day is captured in the data.

3. Run PROC MEANS to get the average based on 30 day intervals for each interval.
https://github.com/statgeek/SAS-Tutorials/blob/master/proc_means_basic.sas

That will have exactly what you want.

@abc44 wrote:

Hi everyone,

I believe I have a relatively simple question, but I'm new to SAS so its been a bit daunting to try and figure it out.

I have about 100 datasets that each contain a single column with step counts collected by a study subject for over one year (one observation/day, i.e. about 365 observations per person). I would like to take the average of the step counts every 30 days, and label each of these 30-day averages as T1, T2, T3 etc.

I would then like to have a dataset that contains the T1, T2, T3 etc.. step average for each participant.

Not all participants were good about recording their steps every day so there are some missing values. For each 30 day block, I would like to calculate compliance (#completed/30x100).

Therefore I am hoping to go from many datasets that have approx 365 step count recordings in a single column to something like this:

ID T1_avg T2_avg T3_avg ..... T1_compliance T2_compliance T3_compliance ......

1

2

3

4

5

6

Some participants went beyond the 1 yr mark and others finished earlier so not everyone has exactly 365 days of steps. Therefore, I may have some participants with T14, T15, T16, and even some with T17.

Would anyone be able to help with this?

I really appreciate any tips you might have!

ballardw · Posted 11-15-2018 07:01 PM

You will find it much easier in the long run to create and use a long set such as

ID Period Average

1 T1 value

1 T2 Value

In fact I think you would like be much better off with the Period as a numeric instead of character value.

And are the blanks in your example data days without a measurement that you want counted for the compliance?

You need to make sure there is an actual ID value attached to each record. You do realize that your compliance will be wrong for the last 30 day period of the year with your definition. 365 or 366 yields 12 periods of 30 days and a 13th period of 5 or 6 days. Which even with 100 percent data will be 5/30 with your definition.

My take on this would look like:

data combined;
   set data1-data100 indsname=source curobs=cobs;
   dsn=source;
   day=cobs;
   period = int((day-1)/30) + 1;
run;

proc summary data=have nway;
   class dsn period;
   var t;
   output out=work.summary (drop=_:) mean= n= nmiss=/autoname;
run;

data work.compliance;
   set work.summary;
   compliance= (100 * t_n/(t_n +t_nmiss));
run;

This uses the count of values and missing values for the denominator in the compliance calculation.

If you want a report with the periods across then Proc Report will do that fine with the dsn (or id or what every you decide the source should be called) as a group variable and period an across variable with t_mean and compliance nested under.

hashman · Posted 11-15-2018 11:00 PM

@abc44:

Suppose that you've already converted your text files into the corresponding number of SAS data sets named sub_1, sub_2, ..., etc.; and each data set contains a single variable T indicating the number of steps. Then one way of getting the output you want is to proceed as follows:

%let nsub = 100 ; * number of subjects     ;                      
%let span =  30 ; * calculation span, days ;                      
                                                                
/* Step #1: ID the subjects */                                  
data sub ;                                                      
  set sub_1-sub_&nsub indsname = dsn ;                          
  ID + dsn ne lag (dsn) ;                                       
run ;                                                           
                                                                
/* Step #2: Compute max number of intervals for transposition */
data _null_ ;                                                   
  do _q = 1 by 1 until (last.id) ;                              
    do _n_ = 1 to &span until (last.id) ;                       
      set sub (keep = id) end = z ;                             
      by id ;                                                   
    end ;                                                       
  end ;                                                         
  retain q ;                                                    
  q = q max _q ;                                                
  if z then call symputx ("q", q) ;                             
run ;                                                           
                                                                
/* Step #3: Compute metrics and shape output */             
data want (keep = ID T_:) ;                                     
  do _q = 1 by 1 until (last.id) ;                              
    call missing (_N, _S) ;                                     
    do _n_ = 1 to &span until (last.id) ;                       
      set sub ;                                                 
      by id ;                                                   
      _N = sum (_N, N (T)) ;                                    
      _S = sum (_S, T) ;                                        
    end ;                                                       
    avg = divide (_S, _N) ;                                     
    cmp = 100 * divide (_N, &span) ;                            
    array T_avg [&q] ;                                          
    array T_cmp [&q] ;                                          
    T_avg [_q] = avg ;                                          
    T_cmp [_q] = cmp ;                                          
  end ;                                                         
run ;

HTH

Paul D.

abc44 · Posted 11-16-2018 09:50 AM

This worked perfectly - thanks SO much!!!!

SAS Programming

How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

Re: How to average values over a range of observations

above average

How to average a range observation ?

How to replace missing values with values from another observation by ...

Write column values from one observation to another

VA Report Example: Month Average versus Total Average

Follow Us

What is...

SAS Programming

Register Today!

SAS Training: Just a Click Away

Follow Us

What is...