BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
overmar
Obsidian | Level 7

I have been answering along to https://communities.sas.com/t5/SAS-Procedures/weighted-percentiles/m-p/294127#U294127 the weighted percetiles article posted by Lida and it got me thinking about the differences between the weighted and unweighted statistics, so I put together a little expirement.

 

I would create a dataset which had the same observations but weighted and a second dataset that had them laid out as individual records to determine if there was a difference, and particularly if it could affect the percentiles (which was Lida's question in the first place). So here is what I found.

This should create two datasets LFC and LFC2 which are identical once the weights are applied.

DATA lfc;
   INPUT club $ goalspergame numgames ;
   DATALINES;
Liverpool 	5 	3
Liverpool 	6	5
Liverpool	2	3
Liverpool	0	1
Liverpool	1	4
Liverpool	3	2
;

data lfc2;
	INPUT club $ goalspergame;
   DATALINES;
Liverpool 	5
Liverpool 	5
Liverpool 	5
Liverpool 	6
Liverpool 	6
Liverpool 	6
Liverpool 	6
Liverpool 	6
Liverpool	2
Liverpool	2
Liverpool	2
Liverpool	0
Liverpool	1
Liverpool	1
Liverpool	1
Liverpool	1
Liverpool	3
Liverpool	3
;	

First two the question about percentiles and if they infact can be altered by weights.

 

proc means data=lfc min max mean stddev q1 median q3 p10;
	var goalspergame;
run;


proc means data=lfc min max mean stddev q1 median q3 p10;
	var goalspergame;
	weight numgames;
run;

Which demonstrates that you can infact alter the percntiles with weights, as p10 in the unweighted is 0, and in the weighted it is 1.

 

 

SAS Output

Unweighted

The MEANS Procedure

 

Analysis Variable : goalspergame
Minimum Maximum Mean Std Dev Lower Quartile Median Upper Quartile 10th Pctl
0 6.0000000 2.8333333 2.3166067 1.0000000 2.5000000 5.0000000 0

 


Weighted

The MEANS Procedure

 

Analysis Variable : goalspergame
Minimum Maximum Mean Std Dev Lower Quartile Median Upper Quartile 10th Pctl
0 6.0000000 3.3888889 4.0565448 1.0000000 3.0000000 6.0000000 1.0000000

 

But say you didn't want to use the weighted dataset and you felt like typing out each line, ie LFC2, would you get the same results as if you used the weighted? Sort of...

 

 


proc means data=lfc min max mean stddev q1 median q3 p10;
	var goalspergame;
	weight numgames;
	title "Weighted";
run;

proc means data=lfc2 min max mean stddev q1 median q3 p10;
	var goalspergame;
title "individual rows"; run;

 

 

SAS Output

Weighted

The MEANS Procedure

 

Analysis Variable : goalspergame
Minimum Maximum Mean Std Dev Lower Quartile Median Upper Quartile 10th Pctl
0 6.0000000 3.3888889 4.0565448 1.0000000 3.0000000 6.0000000 1.0000000

 


Individual Lines

The MEANS Procedure

 

Analysis Variable : goalspergame
Minimum Maximum Mean Std Dev Lower Quartile Median Upper Quartile 10th Pctl
0 6.0000000 3.3888889 2.1999703 1.0000000 3.0000000 6.0000000 1.0000000

 

But if you look all of the statistics are the same except for the Std Dev, and if you look at proc univariate the variance and kurtosis are also different between the data that is weighted, and the data which is written out. So my question is why? And this is PC-SAS 9.4.

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

You are confusing weights and frequencies. For the formulas for weighted descriptive statistics, see the doc for MEAN and StdDev. For weighted percentiles, see the doc for percentiles.

 

A positive integer frequency is easy to understand: all statistics will be the same as if you replicate each record 'freq[i]' number of times.  A frequency variable is just a notational convenience to save you from typing duplicate records.

 

A weight changes statistics by giving more weight to some observations than others. I have written about the weighted mean in SAS.  I have also shown that frequencies and weights give different results for regression analyses.

 

To compute percentiles, you order the data and create the empirical cumulative distribution (ECDF). Say your data are

1,2,3,4,5,6

 

In an unweighted analysis, each datum accounts for 1/n=1/6th of the sample, so the ECDF is a piecewise contant function that increases by 1/6 at each datum.  To find a quantile, you start on the Y axis, move across until you hit a "vertical line", which gives you a datum value (for the default PCTL=5 method). (I am omitting details such as the median for an even number of points is the averagea of two data values). The computation follows:

/* unweighted percentiles */
data a;
input x @@;
datalines;
1 2 3 4 5
;
proc univariate data=a;
cdf x;
run;
proc means data=a p20 p40 p60 p80;
var x;
run;

CDFPlot3.png

 

 

In a weighted analysis, the ECDF does not increase by 1/n at each datum. Instead, the ECDF increases by w[i]/W where W=sum_i w[i] is the sum of the weights. (This is equivalent to choosing weight that sum to 1, which would represent the proportion of weight for each datum.)  Thus the unweighted percentile p is not necessarily equal to the weighted percentile for p. However, the extreme values (min=0th percentile; max=100th percentile) remain unchanged because they are the endpoints of the weighted ECDF, which still increases from 0 to 1.

 

For example, suppose you choose weights 

x:      1,    2,      3,    4,    5,     6

wt: 0.3, 0.2, 0.05, 0.1, 0.3, 0.05

Then the following computation finds the weighted percentiles:

data Wt;
input x wt;
datalines;
1 0.3
2 0.2
3 0.05
4 0.1
5 0.3
6 0.05
;
proc means data=Wt p20 p40 p60 p80;
var x;
weight wt;
run;

The corresponding weighted ECDF looks like this:

CDFPlot4.png

To find the 20th weighted pctl, start at 0.2 on the Y axis. Go over until you hit the graph, then go down to get x=1.

To find the 40th weighted pctl, start at 0.4 on the Y axis. Go over until you hit the graph, then go down to get x=2

In a similar way, the 60th weighted pctl is x=4 and the 80th weighted pctl is x=5.

 

 

View solution in original post

6 REPLIES 6
Reeza
Super User

I think that's stated in the documentation. 

 

If you need standard deviation use proc surveymeans?

 

 

overmar
Obsidian | Level 7

So I'm not disagreeing that it may state it in the procedure manual, however proc surveymeans can't be the solution.

 

SAS Output


Statistics
Variable Minimum Maximum Mean Std Error of Mean Std Dev
goalspergame 0 6.000000 3.388889 1.079382 26.672083
Reeza
Super User

So, this is a question that's come up before and will again and I know there's an answer that makes sense, but can't recall right now. 

 

I'll move this post to the Statistical Procedures Forum and see if @Rick_SAS can contribute 🙂

ballardw
Super User

And since your "weight" variable is actually a count then the results of:

proc means data=lfc min max mean stddev q1 median q3 p10;
	var goalspergame;
	freq numgames;
run;

Pretty much match exactly the results from

 

proc means data=lfc2 min max mean stddev q1 median q3 p10;
	var goalspergame;
run;title;

The NUMBER of observations is very important when looking at standard deviation. WEIGHT does not affect the N used in the calculation, FREQ does however.

 

Rick_SAS
SAS Super FREQ

You are confusing weights and frequencies. For the formulas for weighted descriptive statistics, see the doc for MEAN and StdDev. For weighted percentiles, see the doc for percentiles.

 

A positive integer frequency is easy to understand: all statistics will be the same as if you replicate each record 'freq[i]' number of times.  A frequency variable is just a notational convenience to save you from typing duplicate records.

 

A weight changes statistics by giving more weight to some observations than others. I have written about the weighted mean in SAS.  I have also shown that frequencies and weights give different results for regression analyses.

 

To compute percentiles, you order the data and create the empirical cumulative distribution (ECDF). Say your data are

1,2,3,4,5,6

 

In an unweighted analysis, each datum accounts for 1/n=1/6th of the sample, so the ECDF is a piecewise contant function that increases by 1/6 at each datum.  To find a quantile, you start on the Y axis, move across until you hit a "vertical line", which gives you a datum value (for the default PCTL=5 method). (I am omitting details such as the median for an even number of points is the averagea of two data values). The computation follows:

/* unweighted percentiles */
data a;
input x @@;
datalines;
1 2 3 4 5
;
proc univariate data=a;
cdf x;
run;
proc means data=a p20 p40 p60 p80;
var x;
run;

CDFPlot3.png

 

 

In a weighted analysis, the ECDF does not increase by 1/n at each datum. Instead, the ECDF increases by w[i]/W where W=sum_i w[i] is the sum of the weights. (This is equivalent to choosing weight that sum to 1, which would represent the proportion of weight for each datum.)  Thus the unweighted percentile p is not necessarily equal to the weighted percentile for p. However, the extreme values (min=0th percentile; max=100th percentile) remain unchanged because they are the endpoints of the weighted ECDF, which still increases from 0 to 1.

 

For example, suppose you choose weights 

x:      1,    2,      3,    4,    5,     6

wt: 0.3, 0.2, 0.05, 0.1, 0.3, 0.05

Then the following computation finds the weighted percentiles:

data Wt;
input x wt;
datalines;
1 0.3
2 0.2
3 0.05
4 0.1
5 0.3
6 0.05
;
proc means data=Wt p20 p40 p60 p80;
var x;
weight wt;
run;

The corresponding weighted ECDF looks like this:

CDFPlot4.png

To find the 20th weighted pctl, start at 0.2 on the Y axis. Go over until you hit the graph, then go down to get x=1.

To find the 40th weighted pctl, start at 0.4 on the Y axis. Go over until you hit the graph, then go down to get x=2

In a similar way, the 60th weighted pctl is x=4 and the 80th weighted pctl is x=5.

 

 

overmar
Obsidian | Level 7
Well that makes sense, thank you for clarifying.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 5495 views
  • 2 likes
  • 4 in conversation