SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
abc44
Obsidian | Level 7

Hi everyone, 

 

I believe I have a relatively simple question, but I'm new to SAS so its been a bit daunting to try and figure it out. 

 

I have about 100 datasets that each contain a single column with step counts collected by a study subject for over one year (one observation/day, i.e. about 365 observations per person). I would like to take the average of the step counts every 30 days, and label each of these 30-day averages as T1, T2, T3 etc. 

 

I would then like to have a dataset that contains the T1, T2, T3 etc.. step average for each participant. 

 

Not all participants were good about recording their steps every day so there are some missing values. For each 30 day block, I would like to calculate compliance (#completed/30x100). 

 

Therefore I am hoping to go from many datasets that have approx 365 step count recordings in a single column to something like this:

 

ID    T1_avg   T2_avg  T3_avg ..... T1_compliance   T2_compliance   T3_compliance ...... 

1

2

3

4

5

6

 

Some participants went beyond the 1 yr mark and others finished earlier so not everyone has exactly 365 days of steps. Therefore, I may have some participants with T14, T15, T16, and even some with T17. 


Would anyone be able to help with this?

 

I really appreciate any tips you might have!

 

 

 

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Using standard SAS data manipulation methods:

 

data subject_1;
infile "&sasforum.\datasets\subject_1.txt" truncover firstobs=2;
input steps;
run;

data subject_2;
infile "&sasforum.\datasets\subject_2.txt" truncover firstobs=2;
input steps;
run;

data steps;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
day = d;
run;

proc sql;
create table steps30 as
select
    subject,
    ceil(day/30) as period,
    mean(steps) as stepAverage,
    100 * count(steps)/ count(*) as compliance
from steps
group by subject, calculated period;
quit;

/* Transform to wide data format (not recommended) */
proc transpose data=steps30 out=stepsAve(drop=_name_) prefix=ave_;
by subject;
id period;
var stepAverage;
run;

proc transpose data=steps30 out=stepsComp(drop=_name_) prefix=comp_;
by subject;
id period;
var compliance;
run;

data want;
merge stepsAve stepsComp;
by subject;
run;
PG

View solution in original post

10 REPLIES 10
PGStats
Opal | Level 21

Please provide data (in a text or CSV format) for us to play with.

PG
abc44
Obsidian | Level 7

Here is 2 example cases with some fake data, but it looks just like what I am dealing with!

 

I'd like to know the average steps per 30 day block, and then the compliance for that block (#complete/30x100)

 

Let me know if anything is unclear! Thanks so much 

PGStats
Opal | Level 21

Using standard SAS data manipulation methods:

 

data subject_1;
infile "&sasforum.\datasets\subject_1.txt" truncover firstobs=2;
input steps;
run;

data subject_2;
infile "&sasforum.\datasets\subject_2.txt" truncover firstobs=2;
input steps;
run;

data steps;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
day = d;
run;

proc sql;
create table steps30 as
select
    subject,
    ceil(day/30) as period,
    mean(steps) as stepAverage,
    100 * count(steps)/ count(*) as compliance
from steps
group by subject, calculated period;
quit;

/* Transform to wide data format (not recommended) */
proc transpose data=steps30 out=stepsAve(drop=_name_) prefix=ave_;
by subject;
id period;
var stepAverage;
run;

proc transpose data=steps30 out=stepsComp(drop=_name_) prefix=comp_;
by subject;
id period;
var compliance;
run;

data want;
merge stepsAve stepsComp;
by subject;
run;
PG
abc44
Obsidian | Level 7
This worked perfectly - thanks SO much!
abc44
Obsidian | Level 7

Hi again, 

 

I really appreciated your help with the code. It worked perfectly, but one thing I realized is that I need to exclude all the blank cells before the first observation, before I take the average of 30 day blocks. The location of this first observation varies quite a bit between participants. If someone started the study on the 30th of the month their first observation would start in the 30th cell (subject 1), whereas someone whose start date was on 3rd of the month will have their first observation in the 3rd cell (subject 2). I would like to only calculate the 30 day block from this first number. I can't simply remove all blank cells, since I don't want to exclude blank cells that occurred after the start date. Any idea how I can modify the code to calculate the average of the first 30 day block but only starting from the first observation?

 

Let me know if you have any ideas!

 

Thanks in advance!!

PGStats
Opal | Level 21

To remove the initial missing values, build the steps dataset with:

 

data steps;
retain day;
set work.subject_: indsname=dsn curobs=d;
subject = scan(dsn,2);
if d = 1 then call missing(day);
if missing(day) then do;
    if not missing(steps) then day = 0;
    end;
if not missing(day) then do;
    day = day + 1;
    output;
    end;
run;
PG
Reeza
Super User

1. Combine your data into a single file. If they have similar names you can do the following:

data combined;
set data1-data100 indsname=source;
dsn=source;
run;

2. Define your 30 day intervals
Not sure - you didn't indicate how day is captured in the data.


3. Run PROC MEANS to get the average based on 30 day intervals for each interval.
https://github.com/statgeek/SAS-Tutorials/blob/master/proc_means_basic.sas

 

That will have exactly what you want.

 


@abc44 wrote:

Hi everyone, 

 

I believe I have a relatively simple question, but I'm new to SAS so its been a bit daunting to try and figure it out. 

 

I have about 100 datasets that each contain a single column with step counts collected by a study subject for over one year (one observation/day, i.e. about 365 observations per person). I would like to take the average of the step counts every 30 days, and label each of these 30-day averages as T1, T2, T3 etc. 

 

I would then like to have a dataset that contains the T1, T2, T3 etc.. step average for each participant. 

 

Not all participants were good about recording their steps every day so there are some missing values. For each 30 day block, I would like to calculate compliance (#completed/30x100). 

 

Therefore I am hoping to go from many datasets that have approx 365 step count recordings in a single column to something like this:

 

ID    T1_avg   T2_avg  T3_avg ..... T1_compliance   T2_compliance   T3_compliance ...... 

1

2

3

4

5

6

 

Some participants went beyond the 1 yr mark and others finished earlier so not everyone has exactly 365 days of steps. Therefore, I may have some participants with T14, T15, T16, and even some with T17. 


Would anyone be able to help with this?

 

I really appreciate any tips you might have!

 

 

 

 

 

 

 

 


 

ballardw
Super User

You will find it much easier in the long run to create and use a long set such as

ID Period Average

1  T1        value

1  T2        Value

 

In fact I think you would like be much better off with the Period as a numeric instead of character value.

And are the blanks in your example data days without a measurement that you want counted for the compliance?

You need to make sure there is an actual ID value attached to each record. You do realize that your compliance will be wrong for the last 30 day period of the year with your definition. 365 or 366 yields 12 periods of 30 days and a 13th period of 5 or 6 days. Which even with 100 percent data will be 5/30 with your definition.

 

My take on this would look like:

data combined;
   set data1-data100 indsname=source curobs=cobs;
   dsn=source;
   day=cobs;
   period = int((day-1)/30) + 1;
run;

proc summary data=have nway;
   class dsn period;
   var t;
   output out=work.summary (drop=_:) mean= n= nmiss=/autoname;
run;

data work.compliance;
   set work.summary;
   compliance= (100 * t_n/(t_n +t_nmiss));
run;

This uses the count of values and missing values for the denominator in the compliance calculation.

 

If you want a report with the periods across then Proc Report will do that fine with the dsn (or id or what every you decide the source should be called) as a group variable and period an across variable with t_mean and compliance nested under.

hashman
Ammonite | Level 13

@abc44

Suppose that you've already converted your text files into the corresponding number of SAS data sets named sub_1, sub_2, ..., etc.; and each data set contains a single variable T indicating the number of steps. Then one way of getting the output you want is to proceed as follows:

 

%let nsub = 100 ; * number of subjects     ;                      
%let span =  30 ; * calculation span, days ;                      
                                                                
/* Step #1: ID the subjects */                                  
data sub ;                                                      
  set sub_1-sub_&nsub indsname = dsn ;                          
  ID + dsn ne lag (dsn) ;                                       
run ;                                                           
                                                                
/* Step #2: Compute max number of intervals for transposition */
data _null_ ;                                                   
  do _q = 1 by 1 until (last.id) ;                              
    do _n_ = 1 to &span until (last.id) ;                       
      set sub (keep = id) end = z ;                             
      by id ;                                                   
    end ;                                                       
  end ;                                                         
  retain q ;                                                    
  q = q max _q ;                                                
  if z then call symputx ("q", q) ;                             
run ;                                                           
                                                                
/* Step #3: Compute metrics and shape output */             
data want (keep = ID T_:) ;                                     
  do _q = 1 by 1 until (last.id) ;                              
    call missing (_N, _S) ;                                     
    do _n_ = 1 to &span until (last.id) ;                       
      set sub ;                                                 
      by id ;                                                   
      _N = sum (_N, N (T)) ;                                    
      _S = sum (_S, T) ;                                        
    end ;                                                       
    avg = divide (_S, _N) ;                                     
    cmp = 100 * divide (_N, &span) ;                            
    array T_avg [&q] ;                                          
    array T_cmp [&q] ;                                          
    T_avg [_q] = avg ;                                          
    T_cmp [_q] = cmp ;                                          
  end ;                                                         
run ;                                                           

HTH

 

Paul D.

abc44
Obsidian | Level 7
This worked perfectly - thanks SO much!!!!

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 6908 views
  • 3 likes
  • 5 in conversation