BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
infotechmar
Calcite | Level 5

I'm combining a number of individual observations where I have the mean and standard deviation of each.  They are sub-groups of what I'm combining them to.  How do I compute the overall standard deviation?  I know it's not as simple as averaging the standard deviations.  Also, I'd like them weighted based on the size of the mean of each observation.  An example of two observations is below, but in many cases I'll have more than two observations to be combined.  I have base SAS and SAS Stat available.

Mean       Standard deviation

11865     3187

1010       658

1 ACCEPTED SOLUTION

Accepted Solutions
StatsMan
SAS Super FREQ

There is a lot of good information here, but one piece is missing.  Yes, you need the information (error sum of squares) contained in the group standard deviations, but you also need to take into account the variability of the group means about the overall mean in a calculation of the overall sample standard deviation.

 

You can write a lot of DATA Step and SAS Procedure code to come up with the answer to this question, but SQL can get you there fairly succintly.  Try this example shown below.  The initial DATA step generates sample data with 5 groups and a varying number of observations per group.  The first PROC MEANS step generates the data set described by the original poster, one with just the group n's, means, and standard deviations.  The second PROC MEANS step gives the mean and standard deviation for the overall sample.  It is this mean and standard deviation that we wish to replicate from the information in the data set GroupStats (the output data set from that first PROC MEANS).

 

The PROC SQL step gives us that result.  SQL is really good at doing many things.  Most are more familiar with doing joins using SQL, but it also works well at the type of calculation we need here.  The calculation of the overall standard deviation requires computing summary statistics on several different levels.  This calculation can be done in a single PROC SQL step.  

 

There are several sources on the web for explaining the derivation of this calculation, including some Youtube videos.  See those for the why's of the SQL step below.  

 

data test;
   call streaminit(613254);
   do group=1 to 5;
      do rep=1 to (5+group);
         y=(group*10) + rand('normal');
         output;
   end; end;
run;

 

proc means data=test;
   by group;
   var y;
   output out=GroupStats n=n mean=mean stddev=std;
run;

 

proc means data=test mean stddev;
   var y;
run;

 

proc sql;
   create table SQLFinalStats as
   select distinct sum_n, OverallMean, sqrt(sum(TSS)/(sum(n)-1)) as OverallStd
   from (
             select n, sum(n) as sum_n,
                       sum(n*mean)/sum(n) as OverallMean,
                       std*std*(n-1) + n*(mean-(sum(n*mean)/sum(n)))**2 as TSS

             from GroupStats);
quit;

View solution in original post

13 REPLIES 13
Reeza
Super User

Do you have the n, the number of observations in each group?

infotechmar
Calcite | Level 5

Yes, I would know the number of observations in each group.

Reeza
Super User

You need the correlation between the observations as well. If you can assume its independent then you put the covariance/correlation=0.

Variance - Wikipedia, the free encyclopedia

For the Mean you can add in N to your data and run proc means with the n as your weight and it will return the corrected weighted mean. I'm not sure about the variance yet.

PaigeMiller
Diamond | Level 26

Also, I'd like them weighted based on the size of the mean of each observation.

Use PROC MEANS, with the WEIGHT statement, on the variances. Take the square root of the final answer to get the weighted standard deviation

--
Paige Miller
Reeza
Super User

It depends on if you want to weight by the mean or if you're actually trying to recreate the total group from the summaries of a subgroup.

If you're trying to create the statistics for a total group, the weight option alone won't work AFAIK.

If you're trying to weight it by the mean then it will work but I'm not sure what interpretation that statistic will have.

infotechmar
Calcite | Level 5

Yes, I'm trying to get it for the total group.  We can assume the sub-groups are independent.

PaigeMiller
Diamond | Level 26

If you're trying to create the statistics for a total group, the weight option alone won't work AFAIK.

The WEIGHT statement, plus the VARDEF=DF option, should do the job.

If you're trying to weight it by the mean then it will work but I'm not sure what interpretation that statistic will have.

Yes, I agree, I am having trouble envisioning a situation where I would want to weight by the mean.

--
Paige Miller
SteveDenham
Jade | Level 19

Given everything that is going on, I think we need to think about what is additive under weighting, and that would be the error sums of squares.  I would suggest calculating these from the standard deviations and Ni's, then add them up and apply sqrt(SSE/(N-k)), where k is the number of records, and N is the sum of all the Ni's for all the records.

Steve Denham

PaigeMiller
Diamond | Level 26

I would suggest calculating these from the standard deviations and Ni's,

Don't you mean the variances?

Doesn't summing or averaging the variances with VARDEF=DF and the WEIGHT statement with the value of N get the variance we are after?

--
Paige Miller
SteveDenham
Jade | Level 19

I think the OP had standard deviations, so that was where I was going.  Of course, variances would work just as well.  I'm not so sure that a weighted sum of the variances will get what we need.  For example, say we had

N     SD     Var

10     10     100

20     10     100

20     20     400

The way I calculate it, SSE= 9*100 + 19*100 + 19*400 = 10400, so the pooled SD would be sqrt(10400/(47)) = 14.875...

Now if N is large, then n-1 -> n, and the weighted sum of the variances produced by PROC MEANS would look like the pooled weighted variance.  For this example it is 11000, for an SD of sqrt(11000/50) = 14.832... which is an error of less than 0.3%.  I can live with that.  If you calculate the weighted mean of the SDs (=14 in this example), the error is about 5.6%.  So, weighted mean of the variances all around as the easier method, so long as N is big.

Steve Denham

Astounding
PROC Star

I'm not sure if this is another way of saying what has already been said.  In any case ...

If you only have the means of your subgroups, you can't combine them to get an overall mean.  You would need the individual components, namely SUM and N for each group.  Then it's easy to get the overall mean.  Weighted formulas are a little harder, but the formulas exist.

Similarly, if you have standard deviations, you can't combine them to get an overall standard deviation.  But if you save the key components, you can combine them.  In the case of standard deviations, the key components are SUM, N, and the sum of squared values (which is also one of the statistics available from PROC MEANS).  The formula is a little more complex, and even more complex if you allow for weights.  But the formulas exist.

So if you have the ability to plan and calculate from scratch, save SUM, N, and the sum of squared values for each subgroup.  But if you don't have these pieces, you might be in trouble.  (Emphasis on the "might" just in case anybody has a way around this.)

Good luck.

Reeza
Super User

For means, if you have the n you can get the total mean, since sum = mean*n. I think its the same for Var, but I'm not sure.

Here's what I tried but now I need to go do some actual work :smileysilly:

data random;

    do group=1 to 5;

        do n=1 to 100;

            obs=rand('normal', 25, 5+group);

            obs2=obs**2;

            output;

        end;

    end;

run;

proc means data=random nway n mean std var;

class group;

ways 0 1;

var obs;

output out=summary1 n= mean= var=/autoname;

run;

proc means data=random nway n mean std var;

class group;

ways 0 1;

var obs2;

output out=summary2 n= mean= var=/autoname;

run;

proc means data=summary1 mean;

    where _type_=1;

    var obs_mean obs_var;

    weight obs_n;

run;

data variance_summed;

    set summary1 end=eof;

    where _type_=1;

    retain sum_x 0 sum_x2 0 n_total 0;

    sum_x=sum_x+obs_n*obs_mean;

    x2=obs_var+(obs_mean**2);

    sum_x2=sum_x2+x2*obs_n;

    n_total=n_total+obs_n;

    output;

    if eof then do;

        expected_x2=sum_x/n_total;

        expected_x22=sum_x2/n_total;

        var_x=expected_x22-(expected_x2**2);

        output;

    end;

run;

StatsMan
SAS Super FREQ

There is a lot of good information here, but one piece is missing.  Yes, you need the information (error sum of squares) contained in the group standard deviations, but you also need to take into account the variability of the group means about the overall mean in a calculation of the overall sample standard deviation.

 

You can write a lot of DATA Step and SAS Procedure code to come up with the answer to this question, but SQL can get you there fairly succintly.  Try this example shown below.  The initial DATA step generates sample data with 5 groups and a varying number of observations per group.  The first PROC MEANS step generates the data set described by the original poster, one with just the group n's, means, and standard deviations.  The second PROC MEANS step gives the mean and standard deviation for the overall sample.  It is this mean and standard deviation that we wish to replicate from the information in the data set GroupStats (the output data set from that first PROC MEANS).

 

The PROC SQL step gives us that result.  SQL is really good at doing many things.  Most are more familiar with doing joins using SQL, but it also works well at the type of calculation we need here.  The calculation of the overall standard deviation requires computing summary statistics on several different levels.  This calculation can be done in a single PROC SQL step.  

 

There are several sources on the web for explaining the derivation of this calculation, including some Youtube videos.  See those for the why's of the SQL step below.  

 

data test;
   call streaminit(613254);
   do group=1 to 5;
      do rep=1 to (5+group);
         y=(group*10) + rand('normal');
         output;
   end; end;
run;

 

proc means data=test;
   by group;
   var y;
   output out=GroupStats n=n mean=mean stddev=std;
run;

 

proc means data=test mean stddev;
   var y;
run;

 

proc sql;
   create table SQLFinalStats as
   select distinct sum_n, OverallMean, sqrt(sum(TSS)/(sum(n)-1)) as OverallStd
   from (
             select n, sum(n) as sum_n,
                       sum(n*mean)/sum(n) as OverallMean,
                       std*std*(n-1) + n*(mean-(sum(n*mean)/sum(n)))**2 as TSS

             from GroupStats);
quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 20902 views
  • 0 likes
  • 6 in conversation