Hi everyone,
this is my first question. So I hope I chose the right section + the question is not too simple.
I do have a question regarding averages. Here I am going to use the sashelp.class dataset to show what I want to do.
In the SAS dataset I do have Height and Weight. For each observation x, I would like to do the following:
Chose all observations, where the height is within a certain range of the height of observation x and take the average of the weights. So lets assume a kid (Carol) has the height 62.8. I would like to calculate the average weight for all kids with a height between (62.8-2) and (82.8+3). I want to add this to the class data set, so that each kid has a kind of "average weight" for his "hight class"
Thanks for your help!
You said you are new to the Community. Your problem is not for the beginner. I try to give a data step solution which is not as good as the hash solution. There is an inefficiency to search for the RANGE all over the records. If we sort the data set by age, then we can control the range and will not look records that has an age greater than the greater of the range. I am leaving it to you.
The steps are:
[1] Take each observation. Get the range of age.
[2] Using POINT= option, examine each observation whether its AGE falls in the Range. If so, accumulate height and count.
[3] At the ending of reading all data, output mean height.
data sample;
input Name $ Age Height;
datalines;
Max 10 120
Paul 12 140
Rick 15 130
Morti 8 90
Carl 9 100
Maria 14 150
Claus 13 160
Mario 6 50
Tina 11 100
;
run;
%let range = 1;
data want;
set sample nobs = num ;
loage = age - ⦥
hiage = age + ⦥
sumheight = 0; count = 0;
do p = 1 to num;
set sample(keep = age height rename=(height = ht age = age1)) point = p;
flag = (loage <= age1 <= hiage);
if flag > 0 then do;
sumheight + ht;
count + 1;
end;
end;
averageheight = sumheight / count;
drop age1 ht flag;
run;
Name Age Height loage hiage sumheight count averageheight
Max 10 120 9 11 320 3 106.667
Paul 12 140 11 13 400 3 133.333
Rick 15 130 14 16 280 2 140.000
Morti 8 90 7 9 190 2 95.000
Carl 9 100 8 10 310 3 103.333
Maria 14 150 13 15 440 3 146.667
Claus 13 160 12 14 450 3 150.000
Mario 6 50 5 7 50 1 50.000
Tina 11 100 10 12 360 3 120.000
Another hash solution for you.
%let range = 1;
data want;
if _n_ = 1 then do;
if 0 then set sample;
declare hash h(dataset:"sample", ordered: 'Y');
h.definekey('Age');
h.definedata('Height');
h.definedone();
end;
set sample;
lo_age = age - ⦥
hi_age = age + ⦥
sumheight = 0; count = 0;
do nage = lo_age to hi_age;
if h.find(key:nage) = 0 then do;
sumheight + height;
count + 1;
end;
end;
averageHeight = sumheight / count;
run;
Hope these answers are acceptable to you. As you are new, please see there is a wheel on the top right corner. Click the Arrow next to it. Choose to tick what you like.
Can you give your expected output just to make sure we're 100% clear on what you want?
First create a new dataset with the ranges and averages
The result would look something like this:
range_from range_to avg
10 29 13.6
30 39 38.2
40 99 66.4
Then join the original table with this new one on
age between range_from and range_to.
//Fredrik
Chose all observations, where the height is within a certain range of the height of observation x and take the average of the weights. So lets assume a kid (Carol) has the height 62.8. I would like to calculate the average weight for all kids with a height between (62.8-2) and (82.8+3). I want to add this to the class data set, so that each kid has a kind of "average weight" for his "hight class"
Is that a typo, the range should be (62.8-2) to (62.8+3)?? Or did you really mean (82.8+3)? We would need to know how you are determining these ranges. As the other responder in this thread has said, we need more detailed information about what you are doing, and probably the output you want.
/*Hey there, thanks for your quick responses.
And yes, it is a typo, sorry about that. I attached an example, which makes it more clear, I hope. I honestly have no idea on how to approach this */
data sample;
input Name $ Age Height;
datalines;
Max 10 120
Paul 12 140
Rick 15 130
Morti 8 90
Carl 9 100
Maria 14 150
Claus 13 160
Mario 6 50
Tina 11 100
;run;
/* For each Person I would like the average Height, but not only for each Person, but an average of the height for a given range around the age, such as Age-1 till Age+1.
The result should look like this:
Average for:
Max: (120 (Max) + 100 (Carl) + 100 (Tina)) / 3 =320/3
Paul: (140 (Paul) + 160 (Claus) + 100 (Tina)) / 3 =400/3
The range should be easy to change. It would be nice to simply define difference_to_lower_boundry (here 1) and difference_to_upper_boundry (here 1). To add these ranges I could add two variables/columns with the ranges:
*/
data sample;
set sample;
lower_Age_range=Age-1;
upper_Age_range=Age+1;
;run;
/*In my data the age could also be something like 10.2 or 9.43.
Thanks alot for your help!
Bernd*/
The only way I can think of to obtain dynamic ranges, which change for each person in the dataset, is to use a macro loop and brute force compute the range for each person in each iteration of the loop, and then compute the mean, and then append all the results together.
Perhaps someone can come up with a brilliant SQL way of doing this, but I cannot.
You said you are new to the Community. Your problem is not for the beginner. I try to give a data step solution which is not as good as the hash solution. There is an inefficiency to search for the RANGE all over the records. If we sort the data set by age, then we can control the range and will not look records that has an age greater than the greater of the range. I am leaving it to you.
The steps are:
[1] Take each observation. Get the range of age.
[2] Using POINT= option, examine each observation whether its AGE falls in the Range. If so, accumulate height and count.
[3] At the ending of reading all data, output mean height.
data sample;
input Name $ Age Height;
datalines;
Max 10 120
Paul 12 140
Rick 15 130
Morti 8 90
Carl 9 100
Maria 14 150
Claus 13 160
Mario 6 50
Tina 11 100
;
run;
%let range = 1;
data want;
set sample nobs = num ;
loage = age - ⦥
hiage = age + ⦥
sumheight = 0; count = 0;
do p = 1 to num;
set sample(keep = age height rename=(height = ht age = age1)) point = p;
flag = (loage <= age1 <= hiage);
if flag > 0 then do;
sumheight + ht;
count + 1;
end;
end;
averageheight = sumheight / count;
drop age1 ht flag;
run;
Name Age Height loage hiage sumheight count averageheight
Max 10 120 9 11 320 3 106.667
Paul 12 140 11 13 400 3 133.333
Rick 15 130 14 16 280 2 140.000
Morti 8 90 7 9 190 2 95.000
Carl 9 100 8 10 310 3 103.333
Maria 14 150 13 15 440 3 146.667
Claus 13 160 12 14 450 3 150.000
Mario 6 50 5 7 50 1 50.000
Tina 11 100 10 12 360 3 120.000
Another hash solution for you.
%let range = 1;
data want;
if _n_ = 1 then do;
if 0 then set sample;
declare hash h(dataset:"sample", ordered: 'Y');
h.definekey('Age');
h.definedata('Height');
h.definedone();
end;
set sample;
lo_age = age - ⦥
hi_age = age + ⦥
sumheight = 0; count = 0;
do nage = lo_age to hi_age;
if h.find(key:nage) = 0 then do;
sumheight + height;
count + 1;
end;
end;
averageHeight = sumheight / count;
run;
Hope these answers are acceptable to you. As you are new, please see there is a wheel on the top right corner. Click the Arrow next to it. Choose to tick what you like.
I understand you to want, for each observation of variable X, with value X{i}, the average of all observations with X between X{i}-delta and X{i}+delta, right? The problem, of course, is that you don't know in advance how many obs are within the range.
Here's a hash object solution:
%let delta=1;
%let var=weight;
data want;
set sashelp.class ;
if _n_=1 then do;
declare hash h (dataset:'sashelp.class',ordered:'A');
h.definekey("&var",'name');
h.definedata("&var");
h.definedone();
declare hiter ih ('h');
rc=h.output(dataset:'hash');
end;
orig_&var=&var;
sum=orig_&var;
n=1;
rc=ih.setcur(key:orig_&var,key:name);
do rc=ih.next() by 0 while (rc=0 and &var<=orig_&var+&delta);
sum=sum+&var;
n=n+1;
rc=ih.next();
end;
rc=ih.setcur(key:orig_&var,key:name);
do rc=ih.prev() by 0 while (rc=0 and &var>=orig_&var-&delta);
sum=sum+&var;
n=n+1;
rc=ih.prev();
end;
drop &var;
mean=sum/n;
run;
Notes:
If I understand the problem statement correctly, you want to take a record, select all the other records within a certain range (+/- <some amount> from a given variable), and derive the mean across those columns.
The hash object approach already listed will certainly give those results, and may perform better than my code below.
I was going to suggest an approach such as:
data want;
set have;
do i=1 to nobs;
set have (rename=(whatever=whatever2)) point=i nobs=nobs;
* your logic goes here ;
end;
run;
but it occurred to me that this is really just manually creating a Cartesian product, i.e. crossing each observation with all other observations.
Of course, SQL will create Cartesian products automatically for us.
(As an aside, a Cartesian product is a concept, not just something limited to PROC SQL. SQL makes it easy since it automatically does it for us, but a Cartesian product isn't limited to PROC SQL. For example, the hash object approach can yield the same results, you just have to write the code to do so. You could also use an index key lookup as well.)
You haven't commented on the size of your data; often a Cartesian product can be unwanted, or a mistake, and can result in a huge, potentially "runaway" table.
But, if that isn't an issue for you, then perhaps you can modify the below code to meet your needs.
Hope this helps...
* full Cartesian ;
proc sql;
create table test1 as
select
a.Name as A_Name
,b.Name as B_Name
,a.Sex as A_Sex
,b.Sex as B_Sex
,a.Age as A_Age
,b.Age as B_Age
,a.Height as A_Height
,b.Height as B_Height
,a.Weight as A_Weight
,b.Weight as B_Weight
from
sashelp.class a
,sashelp.class b
;
quit;
* Cartesian within sex ;
proc sql;
create table test2 as
select
a.Name as A_Name
,b.Name as B_Name
,a.Sex as A_Sex
,b.Sex as B_Sex
,a.Age as A_Age
,b.Age as B_Age
,a.Height as A_Height
,b.Height as B_Height
,a.Weight as A_Weight
,b.Weight as B_Weight
from
sashelp.class a
,sashelp.class b
where
a.sex=b.sex
;
quit;
* Cartesian within sex and height +/- 2 ;
proc sql;
create table test3 as
select
a.Name as A_Name
,b.Name as B_Name
,a.Sex as A_Sex
,b.Sex as B_Sex
,a.Age as A_Age
,b.Age as B_Age
,a.Height as A_Height
,b.Height as B_Height
,a.Weight as A_Weight
,b.Weight as B_Weight
from
sashelp.class a
,sashelp.class b
where
a.sex=b.sex
and
b.height between a.height-2 and a.height+2
;
quit;
* And remove the match against yourself ;
proc sql;
create table test4 as
select
a.Name as A_Name
,b.Name as B_Name
,a.Sex as A_Sex
,b.Sex as B_Sex
,a.Age as A_Age
,b.Age as B_Age
,a.Height as A_Height
,b.Height as B_Height
,a.Weight as A_Weight
,b.Weight as B_Weight
from
sashelp.class a
,sashelp.class b
where
a.sex=b.sex
and
b.height between a.height-2 and a.height+2
and
a.Name ne b.Name
;
quit;
* This is closer to your example... ;
* You could also drop the height columns, they are just for debugging ;
proc sql;
create table test5 as
select
a.Name as A_Name
,a.Height as A_Height
,b.Height as B_Height
,a.Weight as A_Weight
,b.Weight as B_Weight
from
sashelp.class a
,sashelp.class b
where
b.height between a.height-2 and a.height+2
and
a.Name ne b.Name
;
quit;
* Now get the mean ;
proc summary data=test5 nway;
class A_Name;
var B_Weight;
output out=means mean= / autoname;
run;
Sounds like you want something like this:
proc sql; create table want as select name ,height ,(select mean(weight) from sashelp.class where height between a.height-2 and a.height+3) as expected_weight ,weight from sashelp.class a ; quit;
- assuming that the range being H-2 to H+3 is not a typo.
My two cents: If you step back and think about what you are trying to accomplish, it sounds like you are looking for a local regression of degree 0. In the old days (1970s) there were various "kernel regression" algorithms that used a local kernel of a fixed bandwidth to determine a local regression function for bivariate data. The kernel regression methods suffered from a fundamental problem: when the bandwidth was too small the predicted curve would not reflect the data. Cleveland addressed this problem with his "LOcal Weighted EStimation" (or LOWESS) procedure (which eventually became LOESS), which uses a variable-width neighborhood of each point to determine the average weighting.
It turns out that degree=0 (piecewise constant) polynomials are not used much in practice, but PROC LOESS in SAS supports other degrees. I would suggest that you think about whether a nonparametric regression procedure such as PROC LOESS might accomplish your goals without requiring you to use the DATA step to write an algorithm that has fallen out of favor because it is known to have problems.
For more about PROC LOESS, including how to use DEGREE=0, if you want to, see:
Hi everyone,
finally back from the Oktoberfest! The responses are great!!! I am glad this question was not as simple as I thought. And there are quiet a few things I learned from this. To answer my initial question, you gave me at least three great solutions:
1. PROC SQL: Easy to understand from a database perspective
(please consider that in this solution an issue may arise with duplicates, where names might exist twice in the have-dataset. This should be easy to solve though)
2. DATA STEP: Good to understand and this prob is a good example to see how the data step goes through all observations step by step
3. Hash Object (within data step): New to me, but once understood, probably the most efficient approach
I will most likely use the hash solution, as it seems to be the most efficient (from what i read on the internet)
@KachiM: Thanks a lot! I learned a lot from this. I should have appr. 7,000-10,000 observations. The hash seems a good choice. Btw.: do i need the "if 0 then set sample" from your code?
@mkeintz: The code still is quite complex for me, but I have the feeling that with the setcur command, your code might be the most efficient code provided here. I need to look a bit more into the hash code to get it all.
@ScottBass @s_lassen: I really like your approaches and it helped me get more into the Cartesian product topic. I am curious on how it performs in terms of efficiency though.
@Rick_SAS: You pointed me towards a solution and topic, which addresses the very background of my question. I actually have around 10,000 data which can take values from 0 to 1 where most observations have a 0 or 1. I cannot explain all the variation and I cannot predict the 0 and 1 values, but predicting averages for groups would be great. I will have to look into this in more detail. Thanks!
Kind regards,
Bernd
One last post from my side:
If anyone is interested in the different approaches, attached you will find 4 solutions (two sql, one data set with loop, one with hash). These are all basically copied from the answers above (so not my solution). This post is only ment to provide a basis for otheres who want to go deeper into this topic (e.g. assess performance).
Kind regards and thanks again to everyone!
data sample;
input Name $ Age Height;
datalines;
Max 10 120
Paul 12 140
Rick 15 130
Morti 8 90
Carl 9 100
Maria 14 150
Claus 13 160
Mario 6 50
Tina 11 100
;
run;
%let range = 1;
/**********************START: SOLUTION USING PROC SQL***************************************/
proc sql;
create table test as
select
a.Name as A_Name
,a.Age as Age
,b.Age as Age
,a.Height as A_Height
,b.Height as B_Height
from
sample a
,sample b
where
b.Age between a.Age-&range. and a.Age+&range.
;
quit;
proc summary data=test nway;
class A_Name;
var B_Height;
output out=want_proc_sql mean= / autoname;
run;
proc delete data=test;
run;
/**********************END: SOLUTION USING PROC SQL***************************************/
/**********************START: SOLUTION USING PROC SQL***************************************/
proc sql;
create table want_proc_sql_2 as
select a.name
,a.Age
,(select mean(Height)
from sample
where Age between a.Age-&range and a.Age+&range)
as expected_Height
,a.Height
from sample a
;
quit;
/**********************END: SOLUTION USING PROC SQL***************************************/
/**********************START: SOLUTION USING DATA STEP***************************************/
data want_data_step;
set sample nobs = num ; /*num initialized and assigned number of observations (nobs)*/
loage = age - ⦥
hiage = age + ⦥
sumheight = 0; count = 0;
do p = 1 to num; /* for each observation in the new want data, these steps are executed NUM times*/
set sample(keep = age height rename=(height = ht age = age1)) point = p; /*p points at number of current observation (within the loop data set)*/
flag = (loage <= age1 <= hiage);
if flag > 0 then do;
sumheight + ht;
count + 1;
end;
end;
averageheight = sumheight / count;
drop age1 ht flag;
run;
/**********************END: SOLUTION USING DATA STEP***************************************/
/**********************START: SOLUTION USING Hash in Data Step***************************************/
data want_Hash;
if _n_ = 1 then do;
/*if 0 then set sample;*/
declare hash h(dataset:"sample", ordered: 'Y');
h.definekey('Age');
h.definedata('Height');
h.definedone();
end;
set sample;
lo_age = age - ⦥
hi_age = age + ⦥
sumheight = 0; count = 0;
do nage = lo_age to hi_age;
if h.find(key:nage) = 0 then do;
sumheight + height;
count + 1;
end;
end;
averageHeight = sumheight / count;
run;
/**********************END: SOLUTION USING Hash in Data Step***************************************/
/**********************RESULT***************************************/
proc sql;
create table Sample_Result as
select X1.Name
,X2.averageheight as Average_Data_Step
,X3.averageHeight as Average_Hash
,X4.B_Height_Mean as Average_Proc_Sql
,X5.expected_height as Average_Proc_Sql_2
from Sample X1
left join Want_data_step X2 on X1.Name=X2.Name
left join Want_hash X3 on X1.Name=X3.Name
left join Want_proc_sql X4 on X1.Name=X4.A_Name
left join Want_proc_sql_2 X5 on X1.Name=X5.Name
;quit;
/**********************RESULT***************************************/
I really like your approaches and it helped me get more into the Cartesian product topic. I am curious on how it performs in terms of efficiency though.
And my last reply on this subject as well 🙂
The hash object approach will perform better. The beauty of it is 1) no sorting of the data (Google the "_method" option for proc sql), and 2) the hash object is in memory, so once the hash object is loaded, data access is very fast.
However, there is often a trade off between performance and simplicity/code maintenance/etc...as long as the performance is "good enough". If the hash object approach runs in say 5 seconds and the proc sql approach runs in 15 seconds, is that "good enough"? And a very different answer if the hash object takes 5 seconds and proc sql takes 1 hour.
Finally, in general, I usually prefer not to "roll my own algorithm", if SAS already provides a solution. I won't say there aren't exceptions, esp. for simple algorithms like sum or mean, and esp. when coding my own algorithm gives much better performance.
But say you wanted the P-value or STD of matching observations - how would your approach change then?
It's a hypothetical situation, since you want the mean, which is easy to calculate, but something to keep in mind.
(Hope Oktoberfest was a blast...)
I would only add one other consideration. What if this process needs to be done for, say, 10 variables?
In the hash object case, you would keep most of the code as is, managing a hash object for each of the 10 vars. But you would not need to code ten hash declarations. Instead you'd declare a hash object once (using the H=_new_ hash() ... statement) in a loop iterating ten times. It would be tracked by a hash-of-hashes. The same loop would be used for generating the means. The moment you've coded to accommodate two variables, it would be easily scaled to much larger N of variables.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.