Averaging multiple measurements within time interval

luch25 · Posted 02-09-2022 06:27 PM

Hello,

I'm looking for some advice on how to go about coding this. I have a dataset with longitudinal measurements of heart rate (HR) and respirations (resp). Each individual (ID) has multiple measurements. The three main things I need to do are:

By unique ID, identify measurements that are taken within 5 minutes of each other
For measurements that are NOT taken within 5 minutes of another measurement, retain the original value of HR and resp.
For measurements that are taken within 5 minutes of each other, apply these rules to produce the final dataset
1. If there are 2 measurements within the 5-minute interval, take the average of the measurements, as well as the datetime value to produce a new value
2. If there are >2 measurements within the 5-minute interval, delete the first value, and take the average fo the remaining measurements.

I've been able to write code for steps 2 and 3a, but having issues where step 1 doesn't run completely correct when there are >2 measurements within a 5 minute interval. And that also affects step 3b. Code for an example dataset, as well as the code I wrote, are provided below. I've also provided a description of "what's going on" for each individual below the code (sorry I can't get the datetime22.3 variable to import into SAS correctly using this mock dataset):

data have;
      input id datetime hr resp;
      format datetime datetime22.3;
      cards;
1 19MAR2009:13:15:00.000 90 16
1 19MAR2009:13:17:00.000 70 18
1 19MAR2009:13:18:00.000 80 17
1 19MAR2009:13.22:00.000 100 22
1 23MAR2009:10:36:00.000 83 21
2 29OCT2009:10:36:00.000 121 13
2 29OCT2009:10:38:00.000 83 14
2 08JUL2011:13:04:00.000 131 18
3 07SEP2016:14:26:59.000 50 12
3 06APR2017:13:39:00.000 76 18
3 08JUL2017:13:04:00.000 98 17
3 08JUL2017:13:08:00.000 96 18
3 24OCT2017:15:11:00.000 80 19
4 18APR2012:08:42:00.000 80 17
4 31OCT2012:10:36:00.000 96 19
4 31OCT2012:10:37:00.000 103 21
4 31OCT2012:10:38:00.000 105 25
4 31OCT2012:10:39:00.000 110 28
5 06JAN2008:05:37:00.000 98 17
5 15FEB2010:10:24:00.000 110 18
;
 
/*Take avg of measurements taken within 5 minutes of each other*/
DATA have2;                
SET have;                       
  _n_ = lag (datetime);               
  IF nmiss (_n_) THEN result = _n_;
    ELSE diff_datetime = _n_ - datetime;
  hr2 = lag(hr);
  resp2 = lag(resp);
  IF -300 <= diff_datetime <=0 THEN hr = ((hr + hr2)/2);
    ELSE IF diff_datetime <= -300 THEN hr_final = hr;
  IF -300 <= diff_datetime <=0 THEN resp_final = ((resp + resp2)/2);
    ELSE IF diff_datetime <= -300 THEN resp_final = resp;
  DROP result hr2 resp2;
RUN;  
 
data have3; set have2(keep=id diff_datetime); run;
 
/*Delete the preceding value after taking the average*/
data have2;
set have3;
  by id;
  if not first.id and -300 <= diff_datetime <=0 then do;
  point = _n_-1;
  modify have2 point=point;
  remove;
end;
run;

ID=1 has two "5 minute intervals". The first 3 measurements make up one interval (delete the one at 13:15 and take the average of 13:17 and 13:18. Then the 2nd, 3rd, and 4th measurements make up a second interval (delete the one at 13:17, and take the average of 13:18 and 13:22).
ID=2 has one "5 minute interval" with 2 measurements, so take the average of these two
ID=3 has two measurements within a "5 minute interval"
ID=4 has 4 measurements within a "5 minute interval". Delete the one at 10:36 and take the average of the remaining ones
ID=5 does not have any measurements within a 5 minute interval
And as you can see, I want to retain measurements even if they are not in the 5 minute interval!

Applying these rules, this is what I envision the final dataset looking like:

ID	Datetime (format=datetime22.3)	HR_final	Resp_final
1	19MAR2009:13:17:30.000	75	17.5
1	19MAR2009:13.19:00.000	83.3	19
1	23MAR2009:10:36:00.000	83	21
2	29OCT2009:10:37:00.000	102	13.5
2	08JUL2011:13:04:00.000	131	18
3	07SEP2016:14:26:59.000	50	12
3	06APR2017:13:39:00.000	76	18
3	08JUL2017:13:06:00.000	97	17.5
3	24OCT2017:15:11:00.000	80	19
4	18APR2012:08:42:00.000	80	17
4	31OCT2012:10:37:30.000	103.5	23.25
5	06JAN2008:05:37:00.000	98	17
5	15FEB2010:10:24:00.000	110	18

Any advice on how to code this would be greatly appreciated! If that means taking my code in a different direction, that's totally okay. Thank you in advance!

ballardw · Posted 02-09-2022 08:01 PM

First thing I would be tempted to do if this were my data would be to remove showing the artificial precision of the time portion of that value. You are recording to 1000ths of a second. Since your data is heart rate and respiration neither of those measures is started and completed within an interval that short. Basically that would mean changing the format to a datetime with no decimal portion at all. Using INFORMATS with decimals indicated may be a poor idea as well.

Your data step doesn't run as posted as you did not provide and Informat to read the data and the 4th record has a . instead of : after the hour component.

I would suggest that since the data step wants to use _n_ as an automatic variable it can lead to odd things when you create a variable by that name.

See if this creates the basic 5 minute intervals. All I am doing is incrementing a counter at the start of each new interval when the time is greater than 5 minutes from the previous measure for the same ID.

data have;
      input id datetime : datetime23. hr resp;
      format datetime datetime22.;
      cards;
1 19MAR2009:13:15:00.000 90 16
1 19MAR2009:13:17:00.000 70 18
1 19MAR2009:13:18:00.000 80 17
1 19MAR2009:13:22:00.000 100 22
1 23MAR2009:10:36:00.000 83 21
2 29OCT2009:10:36:00.000 121 13
2 29OCT2009:10:38:00.000 83 14
2 08JUL2011:13:04:00.000 131 18
3 07SEP2016:14:26:59.000 50 12
3 06APR2017:13:39:00.000 76 18
3 08JUL2017:13:04:00.000 98 17
3 08JUL2017:13:08:00.000 96 18
3 24OCT2017:15:11:00.000 80 19
4 18APR2012:08:42:00.000 80 17
4 31OCT2012:10:36:00.000 96 19
4 31OCT2012:10:37:00.000 103 21
4 31OCT2012:10:38:00.000 105 25
4 31OCT2012:10:39:00.000 110 28
5 06JAN2008:05:37:00.000 98 17
5 15FEB2010:10:24:00.000 110 18
;

data want;
   set have;
   by id ;
   retain timegroup;
   interval = intck('minute',lag(datetime),datetime,'C');
   if first.id then timegroup=1;
   else if interval>5 then timegroup+1;
run;

If you agree this is marking the intervals correctly then we can get to the next bits.

Note the use of the Function INTCK to deal with calculating the interval between successive records.

A similar counter could be used to count number of values within a timegroup. You apparently want a "look ahead counter" but I am not sure that you have actually adequately defined your "5 minute interval" to implement such.

Consider this data example and then tell us where the "5 minute interval boundaries" actually are.

data have2;
      input id datetime : datetime23. hr resp;
      format datetime datetime22.;
      cards;
1 19MAR2009:13:15:00.000 90 16
1 19MAR2009:13:17:00.000 70 18
1 19MAR2009:13:18:00.000 80 17
1 19MAR2009:13:22:00.000 100 22
1 19MAR2009:13:23:00.000 100 22
1 19MAR2009:13:24:00.000 100 22
1 19MAR2009:13:25:00.000 100 22
1 19MAR2009:13:26:00.000 100 22
1 19MAR2009:13:27:00.000 100 22
1 19MAR2009:10:28:00.000 83 21
;

I suspect once you get the timegroup values identified it would likely be a tad easier to us Proc Means to the actual averages instead of mucking about with data step code. Use the ID and Timegroup as CLASS variables and get the means of the measures.

luch25 · Posted 02-09-2022 09:36 PM

Hi @ballardw , thank you so much for the help! Yes, the code works to identify the intervals, but I think your data example shows why it's a bit more complicated. In your data example, there are actually multiple 5-minute windows. Observations 1, 2, 3 make up a single interval (times 13:15, 13:17, 13:18) and could be thought of as a timegroup=1. And then after that, another 5-minute interval begins (since the first started with time 13:15). Observations 4-8 make up another time interval (timegroup=2). And then observations 9-10 make up a timegroup=3 interval. Totally agree that after the timegroup variable is created, I can count the number of observations within a unique timegroup variable and then take the mean.

(As a note: I initially had a bit more complex situation in my first post where the time intervals could overlap, but I think this method would help simplify things. So my initial example won't directly align with this.)

Thank you again!!

Averaging multiple measurements within time interval

Re: Averaging multiple measurements within time interval

Re: Averaging multiple measurements within time interval

Averaging multiple measurements within time interval

Re: Averaging multiple measurements within time interval

Re: Averaging multiple measurements within time interval

Click image to register for webinar

Classroom Training Available!