Solved: Re: Grouping stats in no grouped data set

Matt3 · Posted 08-08-2017 06:06 PM

Hello,

I am struggling with the groups' statistics in not grouped data set, I would be grateful if anyone please tell me an efficient way to count stats like std, mean, max by a group and add it to the original dataset. I found this http://support.sas.com/resources/papers/proceedings11/019-2011.pdf, but they show how to count or sum not mentioned std, max, mean. The other problem is efficiency, I have to create over 500 variables like mentioned one, working with more than 20 large tables(over 10 mln obs in each) so sorting and merging is time-consuming(SQL package process everything in about 15h, nevertheless there is need to rewrite everything into 4GL, SQL Oracle can hadle this problem with overt (partition by)).

In the table below, you will find desirable results.

Thanks.

Name	Sex	Age	Weight	Height	mean_age_by_sex	std_weight_by_age
Philip	M	16	72	150	13,4	0
Janet	F	15	62,5	112,5	13,22	1,81573
Mary	F	15	66,5	112	13,22	1,81573
Ronald	M	15	67	133	13,4	1,81573
William	M	15	66,5	112	13,4	1,81573
Carol	F	14	62,8	102,5	13,22	2,425902
Judy	F	14	64,3	90	13,22	2,425902
Alfred	M	14	69	112,5	13,4	2,425902
Henry	M	14	63,5	102,5	13,4	2,425902
Alice	F	13	56,5	84	13,22	3,670907
Barbara	F	13	65,3	98	13,22	3,670907
Jeffrey	M	13	62,5	84	13,4	3,670907
Jane	F	12	59,8	84,5	13,22	2,949305
Louise	F	12	56,3	77	13,22	2,949305
James	M	12	57,3	83	13,4	2,949305
John	M	12	59	99,5	13,4	2,949305
Robert	M	12	64,8	128	13,4	2,949305
Joyce	F	11	51,3	50,5	13,22	3,1
Thomas	M	11	57,5	85	13,4	3,1

ballardw · Posted 08-08-2017 06:27 PM

Any time I see a request to merge summary values into a data set I want to know how the resulting data set is to be used.

Next, with the same question, comes from "I have to create over 500 variables like mentioned one". Is this 500 in a single set or across the 20 (adding 25 variables to a single set makes much more sense than 500 to each of 20 sets).

How you use the result can make a difference on approachs. For some things I would tend to make a custom format and attach that to a variable for reporting purposes. For example, your "mean_age_by_sex" variable is going to take exactly 2 values in any given data set.

So an additional variable is really not needed as if the subject is male or female you already know the value.

It would not hurt to mention that you started with SASHELP.CLASS for example data so folks can work with the data.

with that said:

proc summary data=sashelp.class;
   class sex age;
   var height weight;
   output out=classsummary mean= max= min= std= median= range= /autoname;
run;

Will in a single pass through the data provide the requested statistics for 1) All records 2) by sex 3) by age and 4) by age and sex.

The autoname appends the statistic to the name of the variable. There will be a variable in the output set named _type_. It will have values of 0 to 3 in this case and indicates which combinations of group variables are used to make the statistic.

I would be strongly tempted to not merge the statistics onto the data until I am actually using them as then I can indicate which _type_ to use and match on the appropriate class variable(s) or none at all. But that's me.

View solution in original post

ballardw · Posted 08-08-2017 06:27 PM

Any time I see a request to merge summary values into a data set I want to know how the resulting data set is to be used.

Next, with the same question, comes from "I have to create over 500 variables like mentioned one". Is this 500 in a single set or across the 20 (adding 25 variables to a single set makes much more sense than 500 to each of 20 sets).

How you use the result can make a difference on approachs. For some things I would tend to make a custom format and attach that to a variable for reporting purposes. For example, your "mean_age_by_sex" variable is going to take exactly 2 values in any given data set.

So an additional variable is really not needed as if the subject is male or female you already know the value.

It would not hurt to mention that you started with SASHELP.CLASS for example data so folks can work with the data.

with that said:

proc summary data=sashelp.class;
   class sex age;
   var height weight;
   output out=classsummary mean= max= min= std= median= range= /autoname;
run;

Will in a single pass through the data provide the requested statistics for 1) All records 2) by sex 3) by age and 4) by age and sex.

The autoname appends the statistic to the name of the variable. There will be a variable in the output set named _type_. It will have values of 0 to 3 in this case and indicates which combinations of group variables are used to make the statistic.

I would be strongly tempted to not merge the statistics onto the data until I am actually using them as then I can indicate which _type_ to use and match on the appropriate class variable(s) or none at all. But that's me.

Matt3 · Posted 08-09-2017 02:11 AM

1. Resulting data set is going to be use as analytical base table customer 360 customer view*.
2. Current SQL package build about 20 ABT tables from diffrent point of view and merge them into one ABT 360 customer view*.

ballardw · Posted 08-09-2017 11:32 AM

I have no idea what " base table customer 360 degree" means.

What specific types of analysis are contemplated? I am not seeing an advantage to all of the redundant information in the proposed structure especially with such large numbers of records.

And is "ABT table" supposed to be something special or is that local to your organization jargon?

Would it be possible to post a few rows of one of these final tables?

Matt3 · Posted 08-09-2017 03:15 PM

No that`s no possible(personal details), unfortunately.

ABT is commonly used for predictive models, especially in the banking sector or telecommunication industry.

https://en.wikipedia.org/wiki/Analytical_base_table

Nevertheless, thank you for your help. I am going to use proc summary.

ballardw · Posted 08-09-2017 03:37 PM

I am not a banking expert and only a basic modeller but this sounds like one approach, depending on how many individuals you have, would be

1) summarize from the entire data such as the Proc Summary shown.

2) select a sample from the main data, maybe a 20%, or enough to give a reasonable shot at developing a model

3) test the model with the remaining data.

If the time crunch is the mergin this may help.

Ksharp · Posted 08-09-2017 09:28 AM

write multiple SQL statement, everyone correspond to a statisical .

proc sql;
create table a as
 select *,mean(age) as mean_age_by_sex
  from sashelp.class
   group by sex;
create table b as
 select *,std(weight) as std_weight_by_age
  from a 
   group by age;
quit;

SAS Innovate 2025: Call for Content

Classroom Training Available!