Programming the statistical procedures from SAS

How to compute a combined standard deviation?

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

How to compute a combined standard deviation?

I'm combining a number of individual observations where I have the mean and standard deviation of each.  They are sub-groups of what I'm combining them to.  How do I compute the overall standard deviation?  I know it's not as simple as averaging the standard deviations.  Also, I'd like them weighted based on the size of the mean of each observation.  An example of two observations is below, but in many cases I'll have more than two observations to be combined.  I have base SAS and SAS Stat available.

Mean       Standard deviation

11865     3187

1010       658


Accepted Solutions
Solution
‎07-07-2017 08:20 AM
SAS Employee
Posts: 11

Re: How to compute a combined standard deviation?

There is a lot of good information here, but one piece is missing.  Yes, you need the information (error sum of squares) contained in the group standard deviations, but you also need to take into account the variability of the group means about the overall mean in a calculation of the overall sample standard deviation.

 

You can write a lot of DATA Step and SAS Procedure code to come up with the answer to this question, but SQL can get you there fairly succintly.  Try this example shown below.  The initial DATA step generates sample data with 5 groups and a varying number of observations per group.  The first PROC MEANS step generates the data set described by the original poster, one with just the group n's, means, and standard deviations.  The second PROC MEANS step gives the mean and standard deviation for the overall sample.  It is this mean and standard deviation that we wish to replicate from the information in the data set GroupStats (the output data set from that first PROC MEANS).

 

The PROC SQL step gives us that result.  SQL is really good at doing many things.  Most are more familiar with doing joins using SQL, but it also works well at the type of calculation we need here.  The calculation of the overall standard deviation requires computing summary statistics on several different levels.  This calculation can be done in a single PROC SQL step.  

 

There are several sources on the web for explaining the derivation of this calculation, including some Youtube videos.  See those for the why's of the SQL step below.  

 

data test;
   call streaminit(613254);
   do group=1 to 5;
      do rep=1 to (5+group);
         y=(group*10) + rand('normal');
         output;
   end; end;
run;

 

proc means data=test;
   by group;
   var y;
   output out=GroupStats n=n mean=mean stddev=std;
run;

 

proc means data=test mean stddev;
   var y;
run;

 

proc sql;
   create table SQLFinalStats as
   select distinct sum_n, OverallMean, sqrt(sum(TSS)/(sum(n)-1)) as OverallStd
   from (
             select n, sum(n) as sum_n,
                       sum(n*mean)/sum(n) as OverallMean,
                       std*std*(n-1) + n*(mean-(sum(n*mean)/sum(n)))**2 as TSS

             from GroupStats);
quit;

View solution in original post


All Replies
Super User
Posts: 18,526

Re: How to compute a combined standard deviation?

Do you have the n, the number of observations in each group?

New Contributor
Posts: 4

Re: How to compute a combined standard deviation?

Yes, I would know the number of observations in each group.

Super User
Posts: 18,526

Re: How to compute a combined standard deviation?

You need the correlation between the observations as well. If you can assume its independent then you put the covariance/correlation=0.

Variance - Wikipedia, the free encyclopedia

For the Mean you can add in N to your data and run proc means with the n as your weight and it will return the corrected weighted mean. I'm not sure about the variance yet.

Trusted Advisor
Posts: 1,659

Re: How to compute a combined standard deviation?

Also, I'd like them weighted based on the size of the mean of each observation.

Use PROC MEANS, with the WEIGHT statement, on the variances. Take the square root of the final answer to get the weighted standard deviation

Super User
Posts: 18,526

Re: How to compute a combined standard deviation?

It depends on if you want to weight by the mean or if you're actually trying to recreate the total group from the summaries of a subgroup.

If you're trying to create the statistics for a total group, the weight option alone won't work AFAIK.

If you're trying to weight it by the mean then it will work but I'm not sure what interpretation that statistic will have.

New Contributor
Posts: 4

Re: How to compute a combined standard deviation?

Yes, I'm trying to get it for the total group.  We can assume the sub-groups are independent.

Trusted Advisor
Posts: 1,659

Re: How to compute a combined standard deviation?

If you're trying to create the statistics for a total group, the weight option alone won't work AFAIK.

The WEIGHT statement, plus the VARDEF=DF option, should do the job.

If you're trying to weight it by the mean then it will work but I'm not sure what interpretation that statistic will have.

Yes, I agree, I am having trouble envisioning a situation where I would want to weight by the mean.

Respected Advisor
Posts: 2,655

Re: How to compute a combined standard deviation?

Given everything that is going on, I think we need to think about what is additive under weighting, and that would be the error sums of squares.  I would suggest calculating these from the standard deviations and Ni's, then add them up and apply sqrt(SSE/(N-k)), where k is the number of records, and N is the sum of all the Ni's for all the records.

Steve Denham

Trusted Advisor
Posts: 1,659

Re: How to compute a combined standard deviation?

I would suggest calculating these from the standard deviations and Ni's,

Don't you mean the variances?

Doesn't summing or averaging the variances with VARDEF=DF and the WEIGHT statement with the value of N get the variance we are after?

Respected Advisor
Posts: 2,655

Re: How to compute a combined standard deviation?

I think the OP had standard deviations, so that was where I was going.  Of course, variances would work just as well.  I'm not so sure that a weighted sum of the variances will get what we need.  For example, say we had

N     SD     Var

10     10     100

20     10     100

20     20     400

The way I calculate it, SSE= 9*100 + 19*100 + 19*400 = 10400, so the pooled SD would be sqrt(10400/(47)) = 14.875...

Now if N is large, then n-1 -> n, and the weighted sum of the variances produced by PROC MEANS would look like the pooled weighted variance.  For this example it is 11000, for an SD of sqrt(11000/50) = 14.832... which is an error of less than 0.3%.  I can live with that.  If you calculate the weighted mean of the SDs (=14 in this example), the error is about 5.6%.  So, weighted mean of the variances all around as the easier method, so long as N is big.

Steve Denham

Super User
Posts: 5,235

Re: How to compute a combined standard deviation?

I'm not sure if this is another way of saying what has already been said.  In any case ...

If you only have the means of your subgroups, you can't combine them to get an overall mean.  You would need the individual components, namely SUM and N for each group.  Then it's easy to get the overall mean.  Weighted formulas are a little harder, but the formulas exist.

Similarly, if you have standard deviations, you can't combine them to get an overall standard deviation.  But if you save the key components, you can combine them.  In the case of standard deviations, the key components are SUM, N, and the sum of squared values (which is also one of the statistics available from PROC MEANS).  The formula is a little more complex, and even more complex if you allow for weights.  But the formulas exist.

So if you have the ability to plan and calculate from scratch, save SUM, N, and the sum of squared values for each subgroup.  But if you don't have these pieces, you might be in trouble.  (Emphasis on the "might" just in case anybody has a way around this.)

Good luck.

Super User
Posts: 18,526

Re: How to compute a combined standard deviation?

For means, if you have the n you can get the total mean, since sum = mean*n. I think its the same for Var, but I'm not sure.

Here's what I tried but now I need to go do some actual work :smileysilly:

data random;

    do group=1 to 5;

        do n=1 to 100;

            obs=rand('normal', 25, 5+group);

            obs2=obs**2;

            output;

        end;

    end;

run;

proc means data=random nway n mean std var;

class group;

ways 0 1;

var obs;

output out=summary1 n= mean= var=/autoname;

run;

proc means data=random nway n mean std var;

class group;

ways 0 1;

var obs2;

output out=summary2 n= mean= var=/autoname;

run;

proc means data=summary1 mean;

    where _type_=1;

    var obs_mean obs_var;

    weight obs_n;

run;

data variance_summed;

    set summary1 end=eof;

    where _type_=1;

    retain sum_x 0 sum_x2 0 n_total 0;

    sum_x=sum_x+obs_n*obs_mean;

    x2=obs_var+(obs_mean**2);

    sum_x2=sum_x2+x2*obs_n;

    n_total=n_total+obs_n;

    output;

    if eof then do;

        expected_x2=sum_x/n_total;

        expected_x22=sum_x2/n_total;

        var_x=expected_x22-(expected_x2**2);

        output;

    end;

run;

Solution
‎07-07-2017 08:20 AM
SAS Employee
Posts: 11

Re: How to compute a combined standard deviation?

There is a lot of good information here, but one piece is missing.  Yes, you need the information (error sum of squares) contained in the group standard deviations, but you also need to take into account the variability of the group means about the overall mean in a calculation of the overall sample standard deviation.

 

You can write a lot of DATA Step and SAS Procedure code to come up with the answer to this question, but SQL can get you there fairly succintly.  Try this example shown below.  The initial DATA step generates sample data with 5 groups and a varying number of observations per group.  The first PROC MEANS step generates the data set described by the original poster, one with just the group n's, means, and standard deviations.  The second PROC MEANS step gives the mean and standard deviation for the overall sample.  It is this mean and standard deviation that we wish to replicate from the information in the data set GroupStats (the output data set from that first PROC MEANS).

 

The PROC SQL step gives us that result.  SQL is really good at doing many things.  Most are more familiar with doing joins using SQL, but it also works well at the type of calculation we need here.  The calculation of the overall standard deviation requires computing summary statistics on several different levels.  This calculation can be done in a single PROC SQL step.  

 

There are several sources on the web for explaining the derivation of this calculation, including some Youtube videos.  See those for the why's of the SQL step below.  

 

data test;
   call streaminit(613254);
   do group=1 to 5;
      do rep=1 to (5+group);
         y=(group*10) + rand('normal');
         output;
   end; end;
run;

 

proc means data=test;
   by group;
   var y;
   output out=GroupStats n=n mean=mean stddev=std;
run;

 

proc means data=test mean stddev;
   var y;
run;

 

proc sql;
   create table SQLFinalStats as
   select distinct sum_n, OverallMean, sqrt(sum(TSS)/(sum(n)-1)) as OverallStd
   from (
             select n, sum(n) as sum_n,
                       sum(n*mean)/sum(n) as OverallMean,
                       std*std*(n-1) + n*(mean-(sum(n*mean)/sum(n)))**2 as TSS

             from GroupStats);
quit;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 12517 views
  • 0 likes
  • 6 in conversation