Solved: Re: Weighted v Unweighted data in proc means and proc univariate

overmar · Posted 08-25-2016 05:11 PM

I have been answering along to https://communities.sas.com/t5/SAS-Procedures/weighted-percentiles/m-p/294127#U294127 the weighted percetiles article posted by Lida and it got me thinking about the differences between the weighted and unweighted statistics, so I put together a little expirement.

I would create a dataset which had the same observations but weighted and a second dataset that had them laid out as individual records to determine if there was a difference, and particularly if it could affect the percentiles (which was Lida's question in the first place). So here is what I found.

This should create two datasets LFC and LFC2 which are identical once the weights are applied.

DATA lfc;
   INPUT club $ goalspergame numgames ;
   DATALINES;
Liverpool 	5 	3
Liverpool 	6	5
Liverpool	2	3
Liverpool	0	1
Liverpool	1	4
Liverpool	3	2
;

data lfc2;
	INPUT club $ goalspergame;
   DATALINES;
Liverpool 	5
Liverpool 	5
Liverpool 	5
Liverpool 	6
Liverpool 	6
Liverpool 	6
Liverpool 	6
Liverpool 	6
Liverpool	2
Liverpool	2
Liverpool	2
Liverpool	0
Liverpool	1
Liverpool	1
Liverpool	1
Liverpool	1
Liverpool	3
Liverpool	3
;

First two the question about percentiles and if they infact can be altered by weights.

proc means data=lfc min max mean stddev q1 median q3 p10;
	var goalspergame;
run;


proc means data=lfc min max mean stddev q1 median q3 p10;
	var goalspergame;
	weight numgames;
run;

Which demonstrates that you can infact alter the percntiles with weights, as p10 in the unweighted is 0, and in the weighted it is 1.

SAS Output

Unweighted

The MEANS Procedure

Analysis Variable : goalspergame
Minimum	Maximum	Mean	Std Dev	Lower Quartile	Median	Upper Quartile	10th Pctl
0	6.0000000	2.8333333	2.3166067	1.0000000	2.5000000	5.0000000	0

Weighted

The MEANS Procedure

Analysis Variable : goalspergame
Minimum	Maximum	Mean	Std Dev	Lower Quartile	Median	Upper Quartile	10th Pctl
0	6.0000000	3.3888889	4.0565448	1.0000000	3.0000000	6.0000000	1.0000000

But say you didn't want to use the weighted dataset and you felt like typing out each line, ie LFC2, would you get the same results as if you used the weighted? Sort of...


proc means data=lfc min max mean stddev q1 median q3 p10;
	var goalspergame;
	weight numgames;
	title "Weighted";
run;

proc means data=lfc2 min max mean stddev q1 median q3 p10;
	var goalspergame;
        title "individual rows";
run;

SAS Output

Weighted

The MEANS Procedure

Analysis Variable : goalspergame
Minimum	Maximum	Mean	Std Dev	Lower Quartile	Median	Upper Quartile	10th Pctl
0	6.0000000	3.3888889	4.0565448	1.0000000	3.0000000	6.0000000	1.0000000

Individual Lines

The MEANS Procedure

Analysis Variable : goalspergame
Minimum	Maximum	Mean	Std Dev	Lower Quartile	Median	Upper Quartile	10th Pctl
0	6.0000000	3.3888889	2.1999703	1.0000000	3.0000000	6.0000000	1.0000000

But if you look all of the statistics are the same except for the Std Dev, and if you look at proc univariate the variance and kurtosis are also different between the data that is weighted, and the data which is written out. So my question is why? And this is PC-SAS 9.4.

Rick_SAS · Posted 08-25-2016 08:39 PM

You are confusing weights and frequencies. For the formulas for weighted descriptive statistics, see the doc for MEAN and StdDev. For weighted percentiles, see the doc for percentiles.

A positive integer frequency is easy to understand: all statistics will be the same as if you replicate each record 'freq[i]' number of times. A frequency variable is just a notational convenience to save you from typing duplicate records.

A weight changes statistics by giving more weight to some observations than others. I have written about the weighted mean in SAS. I have also shown that frequencies and weights give different results for regression analyses.

To compute percentiles, you order the data and create the empirical cumulative distribution (ECDF). Say your data are

1,2,3,4,5,6

In an unweighted analysis, each datum accounts for 1/n=1/6th of the sample, so the ECDF is a piecewise contant function that increases by 1/6 at each datum. To find a quantile, you start on the Y axis, move across until you hit a "vertical line", which gives you a datum value (for the default PCTL=5 method). (I am omitting details such as the median for an even number of points is the averagea of two data values). The computation follows:

/* unweighted percentiles */
data a;
input x @@;
datalines;
1 2 3 4 5
;
proc univariate data=a;
cdf x;
run;
proc means data=a p20 p40 p60 p80;
var x;
run;

In a weighted analysis, the ECDF does not increase by 1/n at each datum. Instead, the ECDF increases by w[i]/W where W=sum_i w[i] is the sum of the weights. (This is equivalent to choosing weight that sum to 1, which would represent the proportion of weight for each datum.) Thus the unweighted percentile p is not necessarily equal to the weighted percentile for p. However, the extreme values (min=0th percentile; max=100th percentile) remain unchanged because they are the endpoints of the weighted ECDF, which still increases from 0 to 1.

For example, suppose you choose weights

x: 1, 2, 3, 4, 5, 6

wt: 0.3, 0.2, 0.05, 0.1, 0.3, 0.05

Then the following computation finds the weighted percentiles:

data Wt;
input x wt;
datalines;
1 0.3
2 0.2
3 0.05
4 0.1
5 0.3
6 0.05
;
proc means data=Wt p20 p40 p60 p80;
var x;
weight wt;
run;

The corresponding weighted ECDF looks like this:

To find the 20th weighted pctl, start at 0.2 on the Y axis. Go over until you hit the graph, then go down to get x=1.

To find the 40th weighted pctl, start at 0.4 on the Y axis. Go over until you hit the graph, then go down to get x=2

In a similar way, the 60th weighted pctl is x=4 and the 80th weighted pctl is x=5.

View solution in original post

Reeza · Posted 08-25-2016 05:14 PM

I think that's stated in the documentation.

If you need standard deviation use proc surveymeans?

overmar · Posted 08-25-2016 05:18 PM

So I'm not disagreeing that it may state it in the procedure manual, however proc surveymeans can't be the solution.

SAS Output

Statistics
Variable	Minimum	Maximum	Mean	Std Error of Mean	Std Dev
goalspergame	0	6.000000	3.388889	1.079382	26.672083

Reeza · Posted 08-25-2016 05:34 PM

So, this is a question that's come up before and will again and I know there's an answer that makes sense, but can't recall right now.

I'll move this post to the Statistical Procedures Forum and see if @Rick_SAS can contribute 🙂

ballardw · Posted 08-25-2016 07:00 PM

And since your "weight" variable is actually a count then the results of:

proc means data=lfc min max mean stddev q1 median q3 p10;
	var goalspergame;
	freq numgames;
run;

Pretty much match exactly the results from

proc means data=lfc2 min max mean stddev q1 median q3 p10;
	var goalspergame;
run;title;

The NUMBER of observations is very important when looking at standard deviation. WEIGHT does not affect the N used in the calculation, FREQ does however.

Rick_SAS · Posted 08-25-2016 08:39 PM

You are confusing weights and frequencies. For the formulas for weighted descriptive statistics, see the doc for MEAN and StdDev. For weighted percentiles, see the doc for percentiles.

A positive integer frequency is easy to understand: all statistics will be the same as if you replicate each record 'freq[i]' number of times. A frequency variable is just a notational convenience to save you from typing duplicate records.

A weight changes statistics by giving more weight to some observations than others. I have written about the weighted mean in SAS. I have also shown that frequencies and weights give different results for regression analyses.

To compute percentiles, you order the data and create the empirical cumulative distribution (ECDF). Say your data are

1,2,3,4,5,6

In an unweighted analysis, each datum accounts for 1/n=1/6th of the sample, so the ECDF is a piecewise contant function that increases by 1/6 at each datum. To find a quantile, you start on the Y axis, move across until you hit a "vertical line", which gives you a datum value (for the default PCTL=5 method). (I am omitting details such as the median for an even number of points is the averagea of two data values). The computation follows:

/* unweighted percentiles */
data a;
input x @@;
datalines;
1 2 3 4 5
;
proc univariate data=a;
cdf x;
run;
proc means data=a p20 p40 p60 p80;
var x;
run;

In a weighted analysis, the ECDF does not increase by 1/n at each datum. Instead, the ECDF increases by w[i]/W where W=sum_i w[i] is the sum of the weights. (This is equivalent to choosing weight that sum to 1, which would represent the proportion of weight for each datum.) Thus the unweighted percentile p is not necessarily equal to the weighted percentile for p. However, the extreme values (min=0th percentile; max=100th percentile) remain unchanged because they are the endpoints of the weighted ECDF, which still increases from 0 to 1.

For example, suppose you choose weights

x: 1, 2, 3, 4, 5, 6

wt: 0.3, 0.2, 0.05, 0.1, 0.3, 0.05

Then the following computation finds the weighted percentiles:

data Wt;
input x wt;
datalines;
1 0.3
2 0.2
3 0.05
4 0.1
5 0.3
6 0.05
;
proc means data=Wt p20 p40 p60 p80;
var x;
weight wt;
run;

The corresponding weighted ECDF looks like this:

To find the 20th weighted pctl, start at 0.2 on the Y axis. Go over until you hit the graph, then go down to get x=1.

To find the 40th weighted pctl, start at 0.4 on the Y axis. Go over until you hit the graph, then go down to get x=2

In a similar way, the 60th weighted pctl is x=4 and the 80th weighted pctl is x=5.

overmar · Posted 08-26-2016 11:29 AM

Well that makes sense, thank you for clarifying.