Hello,
I am struggling with the groups' statistics in not grouped data set, I would be grateful if anyone please tell me an efficient way to count stats like std, mean, max by a group and add it to the original dataset. I found this http://support.sas.com/resources/papers/proceedings11/019-2011.pdf, but they show how to count or sum not mentioned std, max, mean. The other problem is efficiency, I have to create over 500 variables like mentioned one, working with more than 20 large tables(over 10 mln obs in each) so sorting and merging is time-consuming(SQL package process everything in about 15h, nevertheless there is need to rewrite everything into 4GL, SQL Oracle can hadle this problem with overt (partition by)).
In the table below, you will find desirable results.
Thanks.
Name | Sex | Age | Weight | Height | mean_age_by_sex | std_weight_by_age |
Philip | M | 16 | 72 | 150 | 13,4 | 0 |
Janet | F | 15 | 62,5 | 112,5 | 13,22 | 1,81573 |
Mary | F | 15 | 66,5 | 112 | 13,22 | 1,81573 |
Ronald | M | 15 | 67 | 133 | 13,4 | 1,81573 |
William | M | 15 | 66,5 | 112 | 13,4 | 1,81573 |
Carol | F | 14 | 62,8 | 102,5 | 13,22 | 2,425902 |
Judy | F | 14 | 64,3 | 90 | 13,22 | 2,425902 |
Alfred | M | 14 | 69 | 112,5 | 13,4 | 2,425902 |
Henry | M | 14 | 63,5 | 102,5 | 13,4 | 2,425902 |
Alice | F | 13 | 56,5 | 84 | 13,22 | 3,670907 |
Barbara | F | 13 | 65,3 | 98 | 13,22 | 3,670907 |
Jeffrey | M | 13 | 62,5 | 84 | 13,4 | 3,670907 |
Jane | F | 12 | 59,8 | 84,5 | 13,22 | 2,949305 |
Louise | F | 12 | 56,3 | 77 | 13,22 | 2,949305 |
James | M | 12 | 57,3 | 83 | 13,4 | 2,949305 |
John | M | 12 | 59 | 99,5 | 13,4 | 2,949305 |
Robert | M | 12 | 64,8 | 128 | 13,4 | 2,949305 |
Joyce | F | 11 | 51,3 | 50,5 | 13,22 | 3,1 |
Thomas | M | 11 | 57,5 | 85 | 13,4 | 3,1 |
Any time I see a request to merge summary values into a data set I want to know how the resulting data set is to be used.
Next, with the same question, comes from "I have to create over 500 variables like mentioned one". Is this 500 in a single set or across the 20 (adding 25 variables to a single set makes much more sense than 500 to each of 20 sets).
How you use the result can make a difference on approachs. For some things I would tend to make a custom format and attach that to a variable for reporting purposes. For example, your "mean_age_by_sex" variable is going to take exactly 2 values in any given data set.
So an additional variable is really not needed as if the subject is male or female you already know the value.
It would not hurt to mention that you started with SASHELP.CLASS for example data so folks can work with the data.
with that said:
proc summary data=sashelp.class; class sex age; var height weight; output out=classsummary mean= max= min= std= median= range= /autoname; run;
Will in a single pass through the data provide the requested statistics for 1) All records 2) by sex 3) by age and 4) by age and sex.
The autoname appends the statistic to the name of the variable. There will be a variable in the output set named _type_. It will have values of 0 to 3 in this case and indicates which combinations of group variables are used to make the statistic.
I would be strongly tempted to not merge the statistics onto the data until I am actually using them as then I can indicate which _type_ to use and match on the appropriate class variable(s) or none at all. But that's me.
Any time I see a request to merge summary values into a data set I want to know how the resulting data set is to be used.
Next, with the same question, comes from "I have to create over 500 variables like mentioned one". Is this 500 in a single set or across the 20 (adding 25 variables to a single set makes much more sense than 500 to each of 20 sets).
How you use the result can make a difference on approachs. For some things I would tend to make a custom format and attach that to a variable for reporting purposes. For example, your "mean_age_by_sex" variable is going to take exactly 2 values in any given data set.
So an additional variable is really not needed as if the subject is male or female you already know the value.
It would not hurt to mention that you started with SASHELP.CLASS for example data so folks can work with the data.
with that said:
proc summary data=sashelp.class; class sex age; var height weight; output out=classsummary mean= max= min= std= median= range= /autoname; run;
Will in a single pass through the data provide the requested statistics for 1) All records 2) by sex 3) by age and 4) by age and sex.
The autoname appends the statistic to the name of the variable. There will be a variable in the output set named _type_. It will have values of 0 to 3 in this case and indicates which combinations of group variables are used to make the statistic.
I would be strongly tempted to not merge the statistics onto the data until I am actually using them as then I can indicate which _type_ to use and match on the appropriate class variable(s) or none at all. But that's me.
1. Resulting data set is going to be use as analytical base table customer 360 customer view*.
2. Current SQL package build about 20 ABT tables from diffrent point of view and merge them into one ABT 360 customer view*.
I have no idea what " base table customer 360 degree" means.
What specific types of analysis are contemplated? I am not seeing an advantage to all of the redundant information in the proposed structure especially with such large numbers of records.
And is "ABT table" supposed to be something special or is that local to your organization jargon?
Would it be possible to post a few rows of one of these final tables?
No that`s no possible(personal details), unfortunately.
ABT is commonly used for predictive models, especially in the banking sector or telecommunication industry.
https://en.wikipedia.org/wiki/Analytical_base_table
Nevertheless, thank you for your help. I am going to use proc summary.
I am not a banking expert and only a basic modeller but this sounds like one approach, depending on how many individuals you have, would be
1) summarize from the entire data such as the Proc Summary shown.
2) select a sample from the main data, maybe a 20%, or enough to give a reasonable shot at developing a model
3) test the model with the remaining data.
If the time crunch is the mergin this may help.
write multiple SQL statement, everyone correspond to a statisical .
proc sql;
create table a as
select *,mean(age) as mean_age_by_sex
from sashelp.class
group by sex;
create table b as
select *,std(weight) as std_weight_by_age
from a
group by age;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.