BookmarkSubscribeRSS Feed
hellohere
Pyrite | Level 9

Here is a time-series sample dataset, x1-x4[time=i=1,2,3,...].

 

How to get the weighted standard deviation?! weight for x1-x4 are wt_x1-wt_x4.

 

data _temp;
do i=1 to 1000;
	x1=sin(i/10);
	x2=cos(i/20);
	x3=sin(i/100)-cos(i/30);
	x4=sin(i/50)+cos(i/200);
	wt_x1=10; 
	wt_x2=25;
	wt_x3=30;
	wt_x4=35;
	output;
end;
run;quit;

/*how to get weighted stdev in SQL/SAS*/

Thanks,  

 

 

 

20 REPLIES 20
PaigeMiller
Diamond | Level 26

This type of data is much better arranged as a long data set, rather than a wide data set. Why? Because SAS PROCs that can compute weighted statistics are designed to work from a long data set; and this gives you the benefit that SAS has tested these statistical calculations, and also the benefit that you don't have to program it yourself. So the first step is to create a long data set. The second step is to run PROC SUMMARY to get the weighted standard deviation.

 

proc transpose data=_temp out=aaa;
    var x:;
    by i;
run;
proc transpose data=_temp out=bbb;
    var wt:;
    by i;
run;
data long;
    merge aaa(rename=(col1=x)) bbb(rename=(col1=weights));
    by i;
run;
proc summary data=long nway;
    class i;
    var x;
    weight weights;
    output out=want stddev=s;
run;

 


Making the data into a long data set is even easier if you use the %MAKELONG macro at https://communities.sas.com/t5/SAS-Communities-Library/Transpose-your-analysis-data-with-the-MAKELON...

 

or this method https://communities.sas.com/t5/SAS-Global-Forum-Proceedings/Switching-Things-Up-With-the-TRANSPOSE-P...

 

or this method http://support.sas.com/resources/papers/proceedings15/2785-2015.pdf

 

--
Paige Miller
hellohere
Pyrite | Level 9
Anyway to do with SAS/SQL? Actually this is toy dataset. I am trying calculation with charging weights[weight for each each var is changing, calc'd from columns not shown].
PaigeMiller
Diamond | Level 26

Yes, I'm sure it can be done in SQL, but I personally refuse to accept the limitation that it must be in SQL, which then prevents me from using the power of SAS and prevents me from using all the features of SAS. In addition to the reasons I gave for doing this on a long data set and using PROC SUMMARY, if you have code that works for 4 values of x, and then at some point in the future you have more than 4 values of x (like 5 or 6 or 12 or 107 values of x), you have to modify the SQL code, whereas in my code no modification is necessary, even if you have 107 values of x. If you have a lot more than 4 values of x, this modification of SQL would become tedious and error-prone. So, I will stick with my answer. I'm sure someone will come along and provide SQL code to do this. For anyone else reading along, don't do this in SQL, don't make it harder than it needs to be.

 

The code I gave works if the weights change for every value of i.

--
Paige Miller
quickbluefish
Barite | Level 11

I really like SQL for doing simple weighted averages because the code can be written just like the equation, that is, the sum of the products over the sum of the weights.  However, partly for the reasons that @PaigeMiller pointed out re: wide data and partly because the formula for weighted SD is more complex than for a weighted mean, SQL is going to be very awkward.

 

Just for fun, here's a way you can do it in the DATA step (will readily admit I had to remind myself of the formula):

quickbluefish_0-1758908348820.png

 

In the DATA step, you can carry out the above like this:

** create your data (note only creating 20 rows here)... 

data _temp;
do i=1 to 20;
	x1=sin(i/10);
	x2=cos(i/20);
	x3=sin(i/100)-cos(i/30);
	x4=sin(i/50)+cos(i/200);
	wt_x1=10; 
	wt_x2=25;
	wt_x3=30;
	wt_x4=35;
	output;
end;
run;

...then do the rest in a separate data step:

data temp2;
set _temp;
array xs {4} x1-x4;
array ws {4} wt_x1-wt_x4;
sumXW=.; sumW=.;
do j=1 to dim(xs);
	sumXW+(xs[j]*ws[j]);
	sumW+ws[j];
end;
rawmean_x=mean(of xs[*]);	* raw mean of Xs ;
wtmean_x=sumXW/sumW;		* wt mean of Xs ;
sumD2w=.;
n_nzw=0;
do j=1 to dim(xs);
	sumD2w+((xs[j]-wtmean_x)**2*ws[j]);
	n_nzw+(ws[j]>0);
end;
rawSD=std(of xs[*]);		* raw SD of Xs ;

wtSD=					/* WT MEAN OF Xs = ... */
	sqrt(
		sumD2w/				/* sum of the weighted deviations */
						/* divided by... */
		(sumW*(				/* the sum of the weights... */
						/* corrected for the number of non-zero weights... */
			(n_nzw-1)		
			/n_nzw
			)
		)
	);
drop j;
run;

Result -- note raw and weighted values.  You can verify this works by setting all the weights in the input data set (first step) to be the same value -- in this case, the raw values would equal the weighted values.

quickbluefish_1-1758908711021.png

 

PaigeMiller
Diamond | Level 26

If you have missing values for x, I think you get the wrong answer here when missing values are present, whereas using PROC SUMMARY you get the right answers. PROC SUMMARY gives different weighted means and weighted standard deviations for observation 19 than your program does. So in addition to the difficulty of writing your own formulas, you don't get the benefit of the work SAS has already done (and your company or university has paid for) and you potentially get wrong answers writing your own formulas.

 

BTW — I think your weighted standard deviations are wrong even when there are no missings, they do not match PROC SUMMARY.

 

Which is why I strongly advise people against writing their own formulas when SAS has already done that for you and tested it thoroughly.

 

data _temp;
do i=1 to 20;
	x1=sin(i/10);
	x2=cos(i/20);
	x3=sin(i/100)-cos(i/30);
	x4=sin(i/50)+cos(i/200);
	wt_x1=10; 
	wt_x2=25;
	wt_x3=30;
	wt_x4=35;
    if i=19 then x4=.;
	output;
end;
run;
proc transpose data=_temp out=aaa;
    var x:;
    by i;
run;
proc transpose data=_temp out=bbb;
    var wt:;
    by i;
run;
data long;
    merge aaa(rename=(col1=x)) bbb(rename=(col1=weights));
    by i;
run;
proc summary data=long nway;
    class i;
    var x;
    weight weights;
    output out=want_weighted stddev=s mean=m;
run;
proc summary data=long nway;
    class i;
    var x;
    output out=want_unweighted mean= stddev=/autoname;
run;

 

 

 

 

--
Paige Miller
quickbluefish
Barite | Level 11

Yeah, for some reason, PROC SUMMARY is giving some very bizarre answers for weighted SD here.  They don't seem to make sense:

Raw data:

quickbluefish_0-1758911047450.png

 

output from PROC SUMMARY:

quickbluefish_1-1758911088440.png

 

quickbluefish
Barite | Level 11
ah, now I see what I did wrong!
quickbluefish
Barite | Level 11

Weird - I corrected what I thought the issue was, and I still get very different results from PROC SUMMARY.  If anyone is math-inclined and bored, I'd be interested to know what's wrong:

data temp2;
set _temp;
array xs {4} x1-x4;
array ws {4} wt_x1-wt_x4;
sumXW=.;  * this will hold the sum of the products of Xi and Wi ;
sumW=sum(of ws[*]);  * sum of the weights ;
n_nzw=0;  * this will hold the number of non-zero weights ;
do j=1 to dim(xs);
	sumXW+(xs[j]*ws[j]);
	n_nzw+(ws[j]>0);
end;
rawmean_x=mean(of xs[*]);	* raw mean of Xs ;
wtmean_x=sumXW/sumW;		* wt mean of Xs ;
sumD2w=.;  ** this will hold the sum of the weighted, squared differences (deviations) ;
sumWcorr=.; * this will hold the sum of the weights after correcting for the number of non-zero weights ;
do j=1 to dim(xs);
	sumD2w+((xs[j]-wtmean_x)**2*ws[j]);
	sumWcorr+(ws[j]*((n_nzw-1)/n_nzw));
end;
rawSD=std(of xs[*]);		* raw SD of Xs ;
wtSD=sqrt(sumD2w/sumWcorr);	* wt SD of Xs ;
drop j;
run;

proc print data=temp2; run;

Result:

quickbluefish_0-1758912919153.png

 

quickbluefish
Barite | Level 11

Well, now this is just a rabbit hole.  I found several slightly different versions of the formula (population-based, sample/unbiased version, etc.), and they all produce a weighted SD (for the first row) of between 0.89 and 1.06 using the DATA step in SAS.  I also ran this in R several different ways.  Same result (using R code generated by both Google AI and ChatGPT).  SAS, on the other hand (both PROC SUMMARY and PROC MEANS) produces a wildly different number -- over 5 -- for the same values.  

data _temp;
do i=1 to 20;
	x1=sin(i/10);
	x2=cos(i/20);
	x3=sin(i/100)-cos(i/30);
	x4=sin(i/50)+cos(i/200);
	wt_x1=10; 
	wt_x2=25;
	wt_x3=30;
	wt_x4=35;
	output;
end;
run;

** transpose, and, for testing, just keep the first record ;
data t2;
set _temp (obs=1);
array xs {*} x1-x4;
array ws {*} wt_x1-wt_x4;
do i=1 to dim(xs);
	x=xs[i];
	w=ws[i];
	output;
end;
keep i x w;
run;

proc print data=t2; run;

title 'weighted SD from SAS';
proc means data=t2 std;
weight w;
var x;
run;

Result:

quickbluefish_0-1758922822551.png

...same value as Paige Miller's PROC SUMMARY code produced for the first record.


In R, on the other hand:

vals <- c(0.09983, 0.99875, -0.98944, 1.01999)
wts  <- c(10, 25, 30, 35)

# weighted mean
wmean <- weighted.mean(vals, wts)

# weighted variance
wvar <- sum(wts * (vals - wmean)^2) / sum(wts)

# weighted standard deviation
wsd <- sqrt(wvar)

wsd

Result:

0.8977237

If anyone can explain this, that would be great.

 

hellohere
Pyrite | Level 9
Thanks to all. I really appreciate your helps. Still I need stick to SAS/SQL. I know how to do with matrix in Matlab or R. But the reality with huge data and trying varied calculations still make sense stay with SAS/SQL. Or I need find any SQL can do this with build-in function or alike. Approach with dataset, surely it is doable and I know how. But may I say not what I want.
quickbluefish
Barite | Level 11

PROC SQL is almost certainly going to be the least efficient option, both computationally and from a coding perspective.  This is especially the case since your data are "row-wise", whereas SQL is best suited for column-wise operations.  You would need to transpose.  But even so, PROC SQL is not nearly as efficient as "real" SQL engines, plus you cannot nest summary functions in PROC SQL, e.g., SUM ( ... SUM (...) ), which is going to make this equation impossible without pre-computing certain parameters and joining back to the original data.  In short, a nightmare.  It is possible there's a PROC IML (which does vectorized / matrix operations) solution if for some reason the other options here don't work for you, but really, is your input data that big?  Do you have 10 billion records??   I don't know PROC IML well enough to help on that front.    

Ksharp
Super User

I think I found the answer, you need specify VARDF= option to specify divisor.

 

data _temp;
input x1 wt_x1;
cards;
0.09983 10 
0.99875 25
-0.98944 30
1.01999 35
; proc sql; create table temp as select *,mean(x1) as mean_x1,sum(x1*wt_x1)/sum(wt_x1) as wt_mean_x1, (select count(wt_x1) from _temp where wt_x1 ne 0) as n from _temp; /*select sqrt( sum(wt_x1*(x1-wt_mean_x1)**2) / sum(wt_x1*( (n-1)/n )) ) as wt_std*/ select sqrt( sum(wt_x1*(x1-wt_mean_x1)**2) / sum(wt_x1) ) as wt_std from temp; quit; proc means data=_temp mean stddev VARDEF=WEIGHT; var x1; weight wt_x1; run;

Ksharp_0-1758960959148.png

 

Ksharp_1-1758960988161.png

 

Ksharp
Super User

Here is an example for X1.

But I don't know why the result is different with SAS. Maybe function used in sas is different.

 

data _temp;
do i=1 to 20;
	x1=sin(i/10);
	x2=cos(i/20);
	x3=sin(i/100)-cos(i/30);
	x4=sin(i/50)+cos(i/200);
	wt_x1=10; 
	wt_x2=25;
	wt_x3=30;
	wt_x4=35;
	output;
end;
run;

proc sql;
create table temp as
select *,sum(x1*wt_x1)/sum(wt_x1) as wt_mean_x1, (select count(wt_x1) from _temp where wt_x1 ne 0) as n
 from _temp;

/*select sqrt( sum(wt_x1*(x1-wt_mean_x1)**2)  /  sum(wt_x1*( (n-1)/n   )) ) as wt_std*/
 select sqrt( sum(wt_x1*(x1-wt_mean_x1)**2)  /  sum(wt_x1) ) as wt_std
 from temp;
quit;

proc means data=_temp mean stddev ;
var x1;
weight wt_x1;
run;

Ksharp_0-1758959255886.png

 

hellohere
Pyrite | Level 9
Thanks, The wt_std is on time/i, I bet. That is why diff.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 20 replies
  • 770 views
  • 4 likes
  • 4 in conversation