BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Matt3
Quartz | Level 8

Hello,

I  am struggling with the groups' statistics in not grouped data set, I would be grateful if anyone please tell me an efficient way to count stats like std, mean, max by a group and add it to the original dataset. I found this http://support.sas.com/resources/papers/proceedings11/019-2011.pdf, but they show how to count or sum not mentioned std, max, mean. The other problem is efficiency, I have to create over 500 variables like mentioned one, working with more than 20 large tables(over 10 mln obs in each) so sorting and merging is time-consuming(SQL package process everything in about 15h, nevertheless there is need to rewrite everything into 4GL, SQL Oracle can hadle this problem with overt (partition by)).  

In the table below, you will find desirable results.  

 

Thanks.

NameSexAgeWeightHeightmean_age_by_sexstd_weight_by_age
PhilipM167215013,40
JanetF1562,5112,513,221,81573
MaryF1566,511213,221,81573
RonaldM156713313,41,81573
WilliamM1566,511213,41,81573
CarolF1462,8102,513,222,425902
JudyF1464,39013,222,425902
AlfredM1469112,513,42,425902
HenryM1463,5102,513,42,425902
AliceF1356,58413,223,670907
BarbaraF1365,39813,223,670907
JeffreyM1362,58413,43,670907
JaneF1259,884,513,222,949305
LouiseF1256,37713,222,949305
JamesM1257,38313,42,949305
JohnM125999,513,42,949305
RobertM1264,812813,42,949305
JoyceF1151,350,513,223,1
ThomasM1157,58513,4

3,1

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Any time I see a request to merge summary values into a data set I want to know how the resulting data set is to be used.

 

Next, with the same question, comes from "I have to create over 500 variables like mentioned one". Is this 500 in a single set or across the 20 (adding 25 variables to a single set makes much more sense than 500 to each of 20 sets).

 

How you use the result can make a difference on approachs. For some things I would tend to make a custom format and attach that to a variable for reporting purposes. For example, your "mean_age_by_sex" variable is going to take exactly 2 values in any given data set.

So an additional variable is really not needed as if the subject is male or female you already know the value.

 

It would not hurt to mention that you started with SASHELP.CLASS for example data so folks can work with the data.

 

with that said:

proc summary data=sashelp.class;
   class sex age;
   var height weight;
   output out=classsummary mean= max= min= std= median= range= /autoname;
run;

Will in a single pass through the data provide the requested statistics for 1) All records 2) by sex 3) by age and 4) by age and sex.

 

The autoname appends the statistic to the name of the variable. There will be a variable in the output set named _type_. It will have values of 0 to 3 in this case and indicates which combinations of group variables are used to make the statistic.

I would be strongly tempted to not merge the statistics onto the data until I am actually using them as then I can indicate which _type_ to use and match on the appropriate class variable(s) or none at all. But that's me.

View solution in original post

6 REPLIES 6
ballardw
Super User

Any time I see a request to merge summary values into a data set I want to know how the resulting data set is to be used.

 

Next, with the same question, comes from "I have to create over 500 variables like mentioned one". Is this 500 in a single set or across the 20 (adding 25 variables to a single set makes much more sense than 500 to each of 20 sets).

 

How you use the result can make a difference on approachs. For some things I would tend to make a custom format and attach that to a variable for reporting purposes. For example, your "mean_age_by_sex" variable is going to take exactly 2 values in any given data set.

So an additional variable is really not needed as if the subject is male or female you already know the value.

 

It would not hurt to mention that you started with SASHELP.CLASS for example data so folks can work with the data.

 

with that said:

proc summary data=sashelp.class;
   class sex age;
   var height weight;
   output out=classsummary mean= max= min= std= median= range= /autoname;
run;

Will in a single pass through the data provide the requested statistics for 1) All records 2) by sex 3) by age and 4) by age and sex.

 

The autoname appends the statistic to the name of the variable. There will be a variable in the output set named _type_. It will have values of 0 to 3 in this case and indicates which combinations of group variables are used to make the statistic.

I would be strongly tempted to not merge the statistics onto the data until I am actually using them as then I can indicate which _type_ to use and match on the appropriate class variable(s) or none at all. But that's me.

Matt3
Quartz | Level 8

1. Resulting data set is going to be use as analytical base table customer 360 customer view*.
2. Current SQL package build about 20 ABT tables from diffrent point of view and merge them into one ABT 360 customer view*.

ballardw
Super User

I have no idea what " base table customer 360 degree" means.

What specific types of analysis are contemplated? I am not seeing an advantage to all of the redundant information in the proposed structure especially with such large numbers of records.

 

And is "ABT table" supposed to be something special or is that local to your organization jargon?

 

Would it be possible to post a few rows of one of these final tables?

 

 

Matt3
Quartz | Level 8

No that`s no possible(personal details), unfortunately.

 

ABT is commonly used for predictive models, especially in the banking sector or telecommunication industry.

https://en.wikipedia.org/wiki/Analytical_base_table

 

Nevertheless, thank you for your help. I am going to use proc summary.

ballardw
Super User

I am not a banking expert and only a basic modeller but this sounds like one approach, depending on how many individuals you have, would be

1) summarize from the entire data such as the Proc Summary shown.

2) select a sample from the main data, maybe a 20%, or enough to give a reasonable shot at developing a model

3) test the model with the remaining data.

 

If the time crunch is the mergin this may help.

Ksharp
Super User

write multiple SQL statement, everyone correspond to a statisical .

 

proc sql;
create table a as
 select *,mean(age) as mean_age_by_sex
  from sashelp.class
   group by sex;
create table b as
 select *,std(weight) as std_weight_by_age
  from a 
   group by age;
quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 861 views
  • 0 likes
  • 3 in conversation