Solved: Re: Running average for range of observations

BerndSE · Posted 09-29-2017 05:20 AM

Hi everyone,

this is my first question. So I hope I chose the right section + the question is not too simple.

I do have a question regarding averages. Here I am going to use the sashelp.class dataset to show what I want to do.

In the SAS dataset I do have Height and Weight. For each observation x, I would like to do the following:

Chose all observations, where the height is within a certain range of the height of observation x and take the average of the weights. So lets assume a kid (Carol) has the height 62.8. I would like to calculate the average weight for all kids with a height between (62.8-2) and (82.8+3). I want to add this to the class data set, so that each kid has a kind of "average weight" for his "hight class"

Thanks for your help!

BerndSE · Posted 09-29-2017 11:35 PM

BerndSE

You said you are new to the Community. Your problem is not for the beginner. I try to give a data step solution which is not as good as the hash solution. There is an inefficiency to search for the RANGE all over the records. If we sort the data set by age, then we can control the range and will not look records that has an age greater than the greater of the range. I am leaving it to you.

The steps are:

[1] Take each observation. Get the range of age.

[2] Using POINT= option, examine each observation whether its AGE falls in the Range. If so, accumulate height and count.

[3] At the ending of reading all data, output mean height.

data sample;
input Name $ Age Height;
datalines;
Max 10 120
Paul 12 140
Rick 15 130
Morti 8 90
Carl 9 100
Maria 14 150
Claus 13 160
Mario 6 50
Tina 11 100
;
run;

%let range = 1;
data want;
   set sample nobs = num ;
   loage = age - &range;
   hiage = age + &range;
   sumheight = 0; count = 0;

   do p = 1 to num;
      set sample(keep = age height rename=(height = ht age = age1)) point = p;
      flag = (loage <= age1 <= hiage);
      if flag > 0 then do;
         sumheight + ht;
         count + 1;
      end;
   end;
   averageheight = sumheight / count;
drop age1 ht flag;
run;

Name     Age    Height    loage    hiage    sumheight    count    averageheight

Max       10      120        9       11        320         3         106.667
Paul      12      140       11       13        400         3         133.333
Rick      15      130       14       16        280         2         140.000
Morti      8       90        7        9        190         2          95.000
Carl       9      100        8       10        310         3         103.333
Maria     14      150       13       15        440         3         146.667
Claus     13      160       12       14        450         3         150.000
Mario      6       50        5        7         50         1          50.000
Tina      11      100       10       12        360         3         120.000

Another hash solution for you.

%let range = 1;
data want;
   if _n_ = 1 then do;
      if 0 then set sample;
      declare hash h(dataset:"sample", ordered: 'Y');
      h.definekey('Age');
      h.definedata('Height');
      h.definedone();
   end;
   set sample;
   lo_age = age - &range;
   hi_age = age + &range;
   sumheight = 0; count = 0;
   do nage = lo_age to hi_age;
      if h.find(key:nage) = 0 then do;
         sumheight + height;
         count + 1;
      end;
   end;
   averageHeight = sumheight / count;
run;

Hope these answers are acceptable to you. As you are new, please see there is a wheel on the top right corner. Click the Arrow next to it. Choose to tick what you like.

View solution in original post

ChrisBrooks · Posted 09-29-2017 05:24 AM

Can you give your expected output just to make sure we're 100% clear on what you want?

FredrikE · Posted 09-29-2017 06:15 AM

First create a new dataset with the ranges and averages

The result would look something like this:

range_from range_to avg

10 29 13.6

30 39 38.2

40 99 66.4

Then join the original table with this new one on

age between range_from and range_to.

//Fredrik

PaigeMiller · Posted 09-29-2017 08:07 AM

Chose all observations, where the height is within a certain range of the height of observation x and take the average of the weights. So lets assume a kid (Carol) has the height 62.8. I would like to calculate the average weight for all kids with a height between (62.8-2) and (82.8+3). I want to add this to the class data set, so that each kid has a kind of "average weight" for his "hight class"

Is that a typo, the range should be (62.8-2) to (62.8+3)?? Or did you really mean (82.8+3)? We would need to know how you are determining these ranges. As the other responder in this thread has said, we need more detailed information about what you are doing, and probably the output you want.

--
Paige Miller

BerndSE · Posted 09-29-2017 10:37 AM

/*Hey there, thanks for your quick responses.
And yes, it is a typo, sorry about that. I attached an example, which makes it more clear, I hope. I honestly have no idea on how to approach this */

data sample;
input Name $ Age Height;
datalines;
Max 10 120
Paul 12 140
Rick 15 130
Morti 8 90
Carl 9 100
Maria 14 150
Claus 13 160
Mario 6 50
Tina 11 100
;run;
/* For each Person I would like the average Height, but not only for each Person, but an average of the height for a given range around the age, such as Age-1 till Age+1.
The result should look like this:
Average for:
Max: (120 (Max) + 100 (Carl) + 100 (Tina)) / 3 =320/3
Paul: (140 (Paul) + 160 (Claus) + 100 (Tina)) / 3 =400/3
The range should be easy to change. It would be nice to simply define difference_to_lower_boundry (here 1) and difference_to_upper_boundry (here 1). To add these ranges I could add two variables/columns with the ranges:
*/

data sample;
set sample;
lower_Age_range=Age-1;
upper_Age_range=Age+1;
;run;

/*In my data the age could also be something like 10.2 or 9.43.

Thanks alot for your help!

Bernd*/

PaigeMiller · Posted 09-29-2017 11:54 AM

The only way I can think of to obtain dynamic ranges, which change for each person in the dataset, is to use a macro loop and brute force compute the range for each person in each iteration of the loop, and then compute the mean, and then append all the results together.

Perhaps someone can come up with a brilliant SQL way of doing this, but I cannot.

--
Paige Miller

BerndSE · Posted 09-29-2017 11:35 PM

BerndSE

You said you are new to the Community. Your problem is not for the beginner. I try to give a data step solution which is not as good as the hash solution. There is an inefficiency to search for the RANGE all over the records. If we sort the data set by age, then we can control the range and will not look records that has an age greater than the greater of the range. I am leaving it to you.

The steps are:

[1] Take each observation. Get the range of age.

[2] Using POINT= option, examine each observation whether its AGE falls in the Range. If so, accumulate height and count.

[3] At the ending of reading all data, output mean height.

data sample;
input Name $ Age Height;
datalines;
Max 10 120
Paul 12 140
Rick 15 130
Morti 8 90
Carl 9 100
Maria 14 150
Claus 13 160
Mario 6 50
Tina 11 100
;
run;

%let range = 1;
data want;
   set sample nobs = num ;
   loage = age - &range;
   hiage = age + &range;
   sumheight = 0; count = 0;

   do p = 1 to num;
      set sample(keep = age height rename=(height = ht age = age1)) point = p;
      flag = (loage <= age1 <= hiage);
      if flag > 0 then do;
         sumheight + ht;
         count + 1;
      end;
   end;
   averageheight = sumheight / count;
drop age1 ht flag;
run;

Name     Age    Height    loage    hiage    sumheight    count    averageheight

Max       10      120        9       11        320         3         106.667
Paul      12      140       11       13        400         3         133.333
Rick      15      130       14       16        280         2         140.000
Morti      8       90        7        9        190         2          95.000
Carl       9      100        8       10        310         3         103.333
Maria     14      150       13       15        440         3         146.667
Claus     13      160       12       14        450         3         150.000
Mario      6       50        5        7         50         1          50.000
Tina      11      100       10       12        360         3         120.000

Another hash solution for you.

%let range = 1;
data want;
   if _n_ = 1 then do;
      if 0 then set sample;
      declare hash h(dataset:"sample", ordered: 'Y');
      h.definekey('Age');
      h.definedata('Height');
      h.definedone();
   end;
   set sample;
   lo_age = age - &range;
   hi_age = age + &range;
   sumheight = 0; count = 0;
   do nage = lo_age to hi_age;
      if h.find(key:nage) = 0 then do;
         sumheight + height;
         count + 1;
      end;
   end;
   averageHeight = sumheight / count;
run;

Hope these answers are acceptable to you. As you are new, please see there is a wheel on the top right corner. Click the Arrow next to it. Choose to tick what you like.

mkeintz · Posted 09-29-2017 02:54 PM

I understand you to want, for each observation of variable X, with value X{i}, the average of all observations with X between X{i}-delta and X{i}+delta, right? The problem, of course, is that you don't know in advance how many obs are within the range.

Here's a hash object solution:

%let delta=1;
%let var=weight;

data want;
  set sashelp.class ;

  if _n_=1 then do;
    declare hash h (dataset:'sashelp.class',ordered:'A');
      h.definekey("&var",'name');
      h.definedata("&var");
      h.definedone();
    declare hiter ih ('h');
    rc=h.output(dataset:'hash');
  end;

  orig_&var=&var;

  sum=orig_&var;
  n=1;

  rc=ih.setcur(key:orig_&var,key:name);
  do rc=ih.next() by 0 while (rc=0 and &var<=orig_&var+&delta);
    sum=sum+&var;
    n=n+1;
    rc=ih.next();
  end;

  rc=ih.setcur(key:orig_&var,key:name);
  do rc=ih.prev() by 0 while (rc=0 and &var>=orig_&var-&delta);
    sum=sum+&var;
    n=n+1;
    rc=ih.prev();
  end;
  
  drop &var;
  mean=sum/n;
run;

Notes:

The definekey method adds enough variables to the variable of interest to make the key uniquely identify a single observation (if you have to deal with duplicates, you could add come complexity to the hash object methods. But there is no need for these other vars to be in the definedata method.
The setcur method sets the hash iterator to point to the data item corresponding to the current obs. This allows the ih.next() and ih.prev() methods to get the nearest "neighbors"

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ScottBass · Posted 10-03-2017 03:29 AM

If I understand the problem statement correctly, you want to take a record, select all the other records within a certain range (+/- <some amount> from a given variable), and derive the mean across those columns.

The hash object approach already listed will certainly give those results, and may perform better than my code below.

I was going to suggest an approach such as:

data want;
   set have;
   do i=1 to nobs;
      set have (rename=(whatever=whatever2)) point=i nobs=nobs;
      * your logic goes here ;
   end;
run;

but it occurred to me that this is really just manually creating a Cartesian product, i.e. crossing each observation with all other observations.

Of course, SQL will create Cartesian products automatically for us.

(As an aside, a Cartesian product is a concept, not just something limited to PROC SQL. SQL makes it easy since it automatically does it for us, but a Cartesian product isn't limited to PROC SQL. For example, the hash object approach can yield the same results, you just have to write the code to do so. You could also use an index key lookup as well.)

You haven't commented on the size of your data; often a Cartesian product can be unwanted, or a mistake, and can result in a huge, potentially "runaway" table.

But, if that isn't an issue for you, then perhaps you can modify the below code to meet your needs.

Hope this helps...

* full Cartesian ;
proc sql;
   create table test1 as
   select
       a.Name     as A_Name
      ,b.Name     as B_Name
      ,a.Sex      as A_Sex
      ,b.Sex      as B_Sex      
      ,a.Age      as A_Age
      ,b.Age      as B_Age
      ,a.Height   as A_Height
      ,b.Height   as B_Height
      ,a.Weight   as A_Weight
      ,b.Weight   as B_Weight
   from
       sashelp.class a
      ,sashelp.class b
   ;
quit;

* Cartesian within sex ;
proc sql;
   create table test2 as
   select
       a.Name     as A_Name
      ,b.Name     as B_Name
      ,a.Sex      as A_Sex
      ,b.Sex      as B_Sex      
      ,a.Age      as A_Age
      ,b.Age      as B_Age
      ,a.Height   as A_Height
      ,b.Height   as B_Height
      ,a.Weight   as A_Weight
      ,b.Weight   as B_Weight
   from
       sashelp.class a
      ,sashelp.class b
   where
      a.sex=b.sex
   ;
quit;

* Cartesian within sex and height +/- 2 ;
proc sql;
   create table test3 as
   select
       a.Name     as A_Name
      ,b.Name     as B_Name
      ,a.Sex      as A_Sex
      ,b.Sex      as B_Sex      
      ,a.Age      as A_Age
      ,b.Age      as B_Age
      ,a.Height   as A_Height
      ,b.Height   as B_Height
      ,a.Weight   as A_Weight
      ,b.Weight   as B_Weight
   from
       sashelp.class a
      ,sashelp.class b
   where
      a.sex=b.sex
      and
      b.height between a.height-2 and a.height+2
   ;
quit;

* And remove the match against yourself ;
proc sql;
   create table test4 as
   select
       a.Name     as A_Name
      ,b.Name     as B_Name
      ,a.Sex      as A_Sex
      ,b.Sex      as B_Sex      
      ,a.Age      as A_Age
      ,b.Age      as B_Age
      ,a.Height   as A_Height
      ,b.Height   as B_Height
      ,a.Weight   as A_Weight
      ,b.Weight   as B_Weight
   from
       sashelp.class a
      ,sashelp.class b
   where
      a.sex=b.sex
      and
      b.height between a.height-2 and a.height+2
      and
      a.Name ne b.Name
   ;
quit;

* This is closer to your example... ;
* You could also drop the height columns, they are just for debugging ;
proc sql;
   create table test5 as
   select
       a.Name     as A_Name
      ,a.Height   as A_Height
      ,b.Height   as B_Height
      ,a.Weight   as A_Weight
      ,b.Weight   as B_Weight
   from
       sashelp.class a
      ,sashelp.class b
   where
      b.height between a.height-2 and a.height+2
      and
      a.Name ne b.Name
   ;
quit;

* Now get the mean ;
proc summary data=test5 nway;
   class A_Name;
   var B_Weight;
   output out=means mean= / autoname;
run;

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

s_lassen · Posted 10-03-2017 04:51 AM

@BerndSE

Sounds like you want something like this:

proc sql;
   create table want as
   select
       name
      ,height
      ,(select mean(weight) from sashelp.class where
         height between a.height-2 and a.height+3) as expected_weight
      ,weight
   from
       sashelp.class a
   ;
quit;

- assuming that the range being H-2 to H+3 is not a typo.

Rick_SAS · Posted 10-03-2017 09:18 AM

My two cents: If you step back and think about what you are trying to accomplish, it sounds like you are looking for a local regression of degree 0. In the old days (1970s) there were various "kernel regression" algorithms that used a local kernel of a fixed bandwidth to determine a local regression function for bivariate data. The kernel regression methods suffered from a fundamental problem: when the bandwidth was too small the predicted curve would not reflect the data. Cleveland addressed this problem with his "LOcal Weighted EStimation" (or LOWESS) procedure (which eventually became LOESS), which uses a variable-width neighborhood of each point to determine the average weighting.

It turns out that degree=0 (piecewise constant) polynomials are not used much in practice, but PROC LOESS in SAS supports other degrees. I would suggest that you think about whether a nonparametric regression procedure such as PROC LOESS might accomplish your goals without requiring you to use the DATA step to write an algorithm that has fallen out of favor because it is known to have problems.

For more about PROC LOESS, including how to use DEGREE=0, if you want to, see:

BerndSE · Posted 10-05-2017 10:11 AM

Hi everyone,

finally back from the Oktoberfest! The responses are great!!! I am glad this question was not as simple as I thought. And there are quiet a few things I learned from this. To answer my initial question, you gave me at least three great solutions:

1. PROC SQL: Easy to understand from a database perspective
(please consider that in this solution an issue may arise with duplicates, where names might exist twice in the have-dataset. This should be easy to solve though)

2. DATA STEP: Good to understand and this prob is a good example to see how the data step goes through all observations step by step

3. Hash Object (within data step): New to me, but once understood, probably the most efficient approach

I will most likely use the hash solution, as it seems to be the most efficient (from what i read on the internet)

@KachiM: Thanks a lot! I learned a lot from this. I should have appr. 7,000-10,000 observations. The hash seems a good choice. Btw.: do i need the "if 0 then set sample" from your code?
@mkeintz: The code still is quite complex for me, but I have the feeling that with the setcur command, your code might be the most efficient code provided here. I need to look a bit more into the hash code to get it all.

@ScottBass @s_lassen: I really like your approaches and it helped me get more into the Cartesian product topic. I am curious on how it performs in terms of efficiency though.
@Rick_SAS: You pointed me towards a solution and topic, which addresses the very background of my question. I actually have around 10,000 data which can take values from 0 to 1 where most observations have a 0 or 1. I cannot explain all the variation and I cannot predict the 0 and 1 values, but predicting averages for groups would be great. I will have to look into this in more detail. Thanks!

Kind regards,
Bernd

BerndSE · Posted 10-05-2017 10:24 AM

One last post from my side:

If anyone is interested in the different approaches, attached you will find 4 solutions (two sql, one data set with loop, one with hash). These are all basically copied from the answers above (so not my solution). This post is only ment to provide a basis for otheres who want to go deeper into this topic (e.g. assess performance).

Kind regards and thanks again to everyone!

data sample;
input Name $ Age Height;
datalines;
Max 10 120
Paul 12 140
Rick 15 130
Morti 8 90
Carl 9 100
Maria 14 150
Claus 13 160
Mario 6 50
Tina 11 100
;
run;
%let range = 1;


/**********************START: SOLUTION USING PROC SQL***************************************/
proc sql;
   create table test as
   select
       a.Name     as A_Name
      ,a.Age   	  as Age
      ,b.Age   	  as Age
      ,a.Height   as A_Height
      ,b.Height   as B_Height
   from
       sample a
      ,sample b
   where
      b.Age between a.Age-&range. and a.Age+&range.
   ;
quit;
proc summary data=test nway;
   class A_Name;
   var B_Height;
   output out=want_proc_sql mean= / autoname;
run;
proc delete data=test;
run;
/**********************END: SOLUTION USING PROC SQL***************************************/


/**********************START: SOLUTION USING PROC SQL***************************************/
proc sql;
   create table want_proc_sql_2 as
   select	a.name
      		,a.Age
			,(select mean(Height) 
				from sample 
				where Age between a.Age-&range and a.Age+&range) 
				as expected_Height
	      	,a.Height
   from		sample a
   ;
quit;
/**********************END: SOLUTION USING PROC SQL***************************************/


/**********************START: SOLUTION USING DATA STEP***************************************/
data want_data_step;
   set sample nobs = num ; /*num initialized and assigned number of observations (nobs)*/
   loage = age - &range;
   hiage = age + &range;
   sumheight = 0; count = 0;

   do p = 1 to num; /* for each observation in the new want data, these steps are executed NUM times*/
      set sample(keep = age height rename=(height = ht age = age1)) point = p; /*p points at number of current observation (within the loop data set)*/
      flag = (loage <= age1 <= hiage);
      if flag > 0 then do;
         sumheight + ht;
         count + 1;
      end;
   end;
   averageheight = sumheight / count;
drop age1 ht flag;
run;
/**********************END: SOLUTION USING DATA STEP***************************************/



/**********************START: SOLUTION USING Hash in Data Step***************************************/
data want_Hash;
   if _n_ = 1 then do;
      /*if 0 then set sample;*/
      declare hash h(dataset:"sample", ordered: 'Y');
      h.definekey('Age');
      h.definedata('Height');
      h.definedone();
   end;
   set sample;
   lo_age = age - &range;
   hi_age = age + &range;
   sumheight = 0; count = 0;
   do nage = lo_age to hi_age;
      if h.find(key:nage) = 0 then do;
         sumheight + height;
         count + 1;
      end;
   end;
   averageHeight = sumheight / count;
run;
/**********************END: SOLUTION USING Hash in Data Step***************************************/

/**********************RESULT***************************************/
proc sql;
	create table Sample_Result as
	select	X1.Name
			,X2.averageheight as Average_Data_Step
			,X3.averageHeight as Average_Hash
			,X4.B_Height_Mean as Average_Proc_Sql
			,X5.expected_height as Average_Proc_Sql_2
	from		Sample	X1
	left join	Want_data_step 	X2 on X1.Name=X2.Name
	left join	Want_hash 		X3 on X1.Name=X3.Name
	left join	Want_proc_sql 	X4 on X1.Name=X4.A_Name
	left join	Want_proc_sql_2	X5 on X1.Name=X5.Name
;quit;
/**********************RESULT***************************************/

ScottBass · Posted 10-05-2017 08:28 PM

I really like your approaches and it helped me get more into the Cartesian product topic. I am curious on how it performs in terms of efficiency though.

And my last reply on this subject as well 🙂

The hash object approach will perform better. The beauty of it is 1) no sorting of the data (Google the "_method" option for proc sql), and 2) the hash object is in memory, so once the hash object is loaded, data access is very fast.

However, there is often a trade off between performance and simplicity/code maintenance/etc...as long as the performance is "good enough". If the hash object approach runs in say 5 seconds and the proc sql approach runs in 15 seconds, is that "good enough"? And a very different answer if the hash object takes 5 seconds and proc sql takes 1 hour.

Finally, in general, I usually prefer not to "roll my own algorithm", if SAS already provides a solution. I won't say there aren't exceptions, esp. for simple algorithms like sum or mean, and esp. when coding my own algorithm gives much better performance.

But say you wanted the P-value or STD of matching observations - how would your approach change then?

It's a hypothetical situation, since you want the mean, which is easy to calculate, but something to keep in mind.

(Hope Oktoberfest was a blast...)

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

mkeintz · Posted 10-05-2017 09:04 PM

I would only add one other consideration. What if this process needs to be done for, say, 10 variables?

In the hash object case, you would keep most of the code as is, managing a hash object for each of the 10 vars. But you would not need to code ten hash declarations. Instead you'd declare a hash object once (using the H=_new_ hash() ... statement) in a loop iterating ten times. It would be tracked by a hash-of-hashes. The same loop would be used for generating the means. The moment you've coded to accommodate two variables, it would be easily scaled to much larger N of variables.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away