Solved: Re: How to compute a combined standard deviation?

infotechmar · Posted 04-04-2013 11:21 AM

I'm combining a number of individual observations where I have the mean and standard deviation of each. They are sub-groups of what I'm combining them to. How do I compute the overall standard deviation? I know it's not as simple as averaging the standard deviations. Also, I'd like them weighted based on the size of the mean of each observation. An example of two observations is below, but in many cases I'll have more than two observations to be combined. I have base SAS and SAS Stat available.

Mean Standard deviation

11865 3187

1010 658

StatsMan · Posted 07-05-2017 11:24 AM

There is a lot of good information here, but one piece is missing. Yes, you need the information (error sum of squares) contained in the group standard deviations, but you also need to take into account the variability of the group means about the overall mean in a calculation of the overall sample standard deviation.

You can write a lot of DATA Step and SAS Procedure code to come up with the answer to this question, but SQL can get you there fairly succintly. Try this example shown below. The initial DATA step generates sample data with 5 groups and a varying number of observations per group. The first PROC MEANS step generates the data set described by the original poster, one with just the group n's, means, and standard deviations. The second PROC MEANS step gives the mean and standard deviation for the overall sample. It is this mean and standard deviation that we wish to replicate from the information in the data set GroupStats (the output data set from that first PROC MEANS).

The PROC SQL step gives us that result. SQL is really good at doing many things. Most are more familiar with doing joins using SQL, but it also works well at the type of calculation we need here. The calculation of the overall standard deviation requires computing summary statistics on several different levels. This calculation can be done in a single PROC SQL step.

There are several sources on the web for explaining the derivation of this calculation, including some Youtube videos. See those for the why's of the SQL step below.

data test;
call streaminit(613254);
do group=1 to 5;
do rep=1 to (5+group);
y=(group*10) + rand('normal');
output;
end; end;
run;

proc means data=test;
by group;
var y;
output out=GroupStats n=n mean=mean stddev=std;
run;

proc means data=test mean stddev;
var y;
run;

proc sql;
create table SQLFinalStats as
select distinct sum_n, OverallMean, sqrt(sum(TSS)/(sum(n)-1)) as OverallStd
from (
select n, sum(n) as sum_n,
sum(n*mean)/sum(n) as OverallMean,
std*std*(n-1) + n*(mean-(sum(n*mean)/sum(n)))**2 as TSS

from GroupStats);
quit;

View solution in original post

Reeza · Posted 04-04-2013 11:35 AM

Do you have the n, the number of observations in each group?

infotechmar · Posted 04-04-2013 12:15 PM

Yes, I would know the number of observations in each group.

Reeza · Posted 04-04-2013 12:34 PM

You need the correlation between the observations as well. If you can assume its independent then you put the covariance/correlation=0.

Variance - Wikipedia, the free encyclopedia

For the Mean you can add in N to your data and run proc means with the n as your weight and it will return the corrected weighted mean. I'm not sure about the variance yet.

PaigeMiller · Posted 04-04-2013 01:12 PM

Also, I'd like them weighted based on the size of the mean of each observation.

Use PROC MEANS, with the WEIGHT statement, on the variances. Take the square root of the final answer to get the weighted standard deviation

--
Paige Miller

Reeza · Posted 04-04-2013 01:30 PM

It depends on if you want to weight by the mean or if you're actually trying to recreate the total group from the summaries of a subgroup.

If you're trying to create the statistics for a total group, the weight option alone won't work AFAIK.

If you're trying to weight it by the mean then it will work but I'm not sure what interpretation that statistic will have.

infotechmar · Posted 04-04-2013 01:33 PM

Yes, I'm trying to get it for the total group. We can assume the sub-groups are independent.

PaigeMiller · Posted 04-04-2013 01:34 PM

If you're trying to create the statistics for a total group, the weight option alone won't work AFAIK.

The WEIGHT statement, plus the VARDEF=DF option, should do the job.

If you're trying to weight it by the mean then it will work but I'm not sure what interpretation that statistic will have.

Yes, I agree, I am having trouble envisioning a situation where I would want to weight by the mean.

--
Paige Miller

SteveDenham · Posted 04-04-2013 01:46 PM

Given everything that is going on, I think we need to think about what is additive under weighting, and that would be the error sums of squares. I would suggest calculating these from the standard deviations and Ni's, then add them up and apply sqrt(SSE/(N-k)), where k is the number of records, and N is the sum of all the Ni's for all the records.

Steve Denham

PaigeMiller · Posted 04-04-2013 01:49 PM

I would suggest calculating these from the standard deviations and Ni's,

Don't you mean the variances?

Doesn't summing or averaging the variances with VARDEF=DF and the WEIGHT statement with the value of N get the variance we are after?

--
Paige Miller

SteveDenham · Posted 04-04-2013 02:17 PM

I think the OP had standard deviations, so that was where I was going. Of course, variances would work just as well. I'm not so sure that a weighted sum of the variances will get what we need. For example, say we had

N SD Var

10 10 100

20 10 100

20 20 400

The way I calculate it, SSE= 9*100 + 19*100 + 19*400 = 10400, so the pooled SD would be sqrt(10400/(47)) = 14.875...

Now if N is large, then n-1 -> n, and the weighted sum of the variances produced by PROC MEANS would look like the pooled weighted variance. For this example it is 11000, for an SD of sqrt(11000/50) = 14.832... which is an error of less than 0.3%. I can live with that. If you calculate the weighted mean of the SDs (=14 in this example), the error is about 5.6%. So, weighted mean of the variances all around as the easier method, so long as N is big.

Steve Denham

Astounding · Posted 04-04-2013 02:36 PM

I'm not sure if this is another way of saying what has already been said. In any case ...

If you only have the means of your subgroups, you can't combine them to get an overall mean. You would need the individual components, namely SUM and N for each group. Then it's easy to get the overall mean. Weighted formulas are a little harder, but the formulas exist.

Similarly, if you have standard deviations, you can't combine them to get an overall standard deviation. But if you save the key components, you can combine them. In the case of standard deviations, the key components are SUM, N, and the sum of squared values (which is also one of the statistics available from PROC MEANS). The formula is a little more complex, and even more complex if you allow for weights. But the formulas exist.

So if you have the ability to plan and calculate from scratch, save SUM, N, and the sum of squared values for each subgroup. But if you don't have these pieces, you might be in trouble. (Emphasis on the "might" just in case anybody has a way around this.)

Good luck.

Reeza · Posted 04-04-2013 03:28 PM

For means, if you have the n you can get the total mean, since sum = mean*n. I think its the same for Var, but I'm not sure.

Here's what I tried but now I need to go do some actual work :smileysilly:

data random;

do group=1 to 5;

do n=1 to 100;

obs=rand('normal', 25, 5+group);

obs2=obs**2;

output;

end;

run;

proc means data=random nway n mean std var;

class group;

ways 0 1;

var obs;

output out=summary1 n= mean= var=/autoname;

run;

proc means data=random nway n mean std var;

class group;

ways 0 1;

var obs2;

output out=summary2 n= mean= var=/autoname;

run;

proc means data=summary1 mean;

where _type_=1;

var obs_mean obs_var;

weight obs_n;

run;

data variance_summed;

set summary1 end=eof;

where _type_=1;

retain sum_x 0 sum_x2 0 n_total 0;

sum_x=sum_x+obs_n*obs_mean;

x2=obs_var+(obs_mean**2);

sum_x2=sum_x2+x2*obs_n;

n_total=n_total+obs_n;

output;

if eof then do;

expected_x2=sum_x/n_total;

expected_x22=sum_x2/n_total;

var_x=expected_x22-(expected_x2**2);

output;

end;

run;

StatsMan · Posted 07-05-2017 11:24 AM

There is a lot of good information here, but one piece is missing. Yes, you need the information (error sum of squares) contained in the group standard deviations, but you also need to take into account the variability of the group means about the overall mean in a calculation of the overall sample standard deviation.

You can write a lot of DATA Step and SAS Procedure code to come up with the answer to this question, but SQL can get you there fairly succintly. Try this example shown below. The initial DATA step generates sample data with 5 groups and a varying number of observations per group. The first PROC MEANS step generates the data set described by the original poster, one with just the group n's, means, and standard deviations. The second PROC MEANS step gives the mean and standard deviation for the overall sample. It is this mean and standard deviation that we wish to replicate from the information in the data set GroupStats (the output data set from that first PROC MEANS).

The PROC SQL step gives us that result. SQL is really good at doing many things. Most are more familiar with doing joins using SQL, but it also works well at the type of calculation we need here. The calculation of the overall standard deviation requires computing summary statistics on several different levels. This calculation can be done in a single PROC SQL step.

There are several sources on the web for explaining the derivation of this calculation, including some Youtube videos. See those for the why's of the SQL step below.

data test;
call streaminit(613254);
do group=1 to 5;
do rep=1 to (5+group);
y=(group*10) + rand('normal');
output;
end; end;
run;

proc means data=test;
by group;
var y;
output out=GroupStats n=n mean=mean stddev=std;
run;

proc means data=test mean stddev;
var y;
run;

proc sql;
create table SQLFinalStats as
select distinct sum_n, OverallMean, sqrt(sum(TSS)/(sum(n)-1)) as OverallStd
from (
select n, sum(n) as sum_n,
sum(n*mean)/sum(n) as OverallMean,
std*std*(n-1) + n*(mean-(sum(n*mean)/sum(n)))**2 as TSS

from GroupStats);
quit;

Catch up on SAS Innovate 2026