BookmarkSubscribeRSS Feed
egrieco
Calcite | Level 5

For a report I am writing using Census Bureau data (CPS), I am calculating median wages. I used PROC MEANS to pull out the medians. For the workforce 18-74 with wages in the last 12 months, PROC MEANS gives me a median of $47,000. However, I need the standard errors for statistical testing, so I ran PROC SURVEYMEANS to get those. For the same group as before, PROC SURVEYMEANS gives me a median of $46,863. Can someone explain why the medians calculated by the two PROCs differ? Note the *means* calculated by both PROCs match. I realize the values are close but I need to be able to justify/explain which median to use.

 

 

6 REPLIES 6
mkeintz
PROC Star

Apparently they have different "definitions" of median.  Consider the below:

 

data have;
  do wages=47000 to 48000 by 1000; 
   do _n_=1 to 4; 
     output; 
   end;
  end;
run;

proc means median;run;
proc surveymeans median;run;

Proc means reports median=47,500 - a value that never occurred, but is the midpoint between the two equally frequent values.

 

Proc surveyeans reports median=47,000, a value that exists in the sample, but does not have exactly 50 percent higher and 50 percent lower.

 

I imagine somewhere in the documentation for these procedures, you'll find their default median estimate algorithms, and possibly options to choose another algorithm.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SAS_Rob
SAS Employee

The default method that Proc MEANS uses to compute the median is different than Proc SURVEYMEANS.  If you use PCTLDEF=1 in Proc MEANS then they should match.  

For further details, take a look at the two sections in the documentation.

SAS Help Center: SURVEYMEANS Statistical Computations

SAS Help Center: MEANS Keywords and Formulas

 

FreelanceReinh
Jade | Level 19

@SAS_Rob wrote:

If you use PCTLDEF=1 in Proc MEANS then they should match.


That's what I thought too, based on my first examples. But it turned out that, in general, none of the five possible PCTLDEF= option values of PROC MEANS replicates the results of the default quantile definition used by PROC SURVEYMEANS.

 

Example:

data test;
do x=0, 1, 1;
  output;
end;
run;

proc surveymeans data=test median;
var x;
run;

This yields a median estimate of 0.25 (by linear interpolation between the 1/3-quantile 0 and the maximum 1), whereas PROC MEANS gives 0.5 with PCTLDEF=1 and 1 with PCTLDEF>1.

ballardw
Super User

Code.

Code shows what options you used and we can address each specific possible cause of differences.

 

Basic: if you use a weight variable, which is typical of Survey procedures, the methods of using the weights are quite different because of the sample design information that can be included in the Survey procedures because that is what they are designed to do.

 

Means also uses calculations that assume your data comes from a very large population. The Survey procedures can incorporate adjustments for known population sizes.

 

 

 

Rick_SAS
SAS Super FREQ

> Can someone explain why the medians calculated by the two PROCs differ? 

Yes. They differ because they use different estimates. There are many ways to estimate quantiles, and the median is the 0.5 quantile. You can read about 9 common definitions that are used by statisticians.

 

All the definitions are based on various ways to estimate the cumulative distribution of the data from the finite sample. 

 

Proc SURVEYMEANS uses an estimate for the CDF that is different from those supported by PROC MEANS. The definition is given in the SURVEYMEANS doc. 

 

For the sample data {0, 1, 1}, the CDF is estimated by SURVEYMEANS to be

F(t) = { 1/3 if t <1
       {   1 if t=1

 

Let p=0.5 and use this for the estimate for Q(0.5). For your sample, choose k=1 because F(y_1) <= p .<= F(y_2) Then the estimate is

Q(0.5) = 0 + (0.5 - 1/3)/(1 - 1/3) * (1 - 0)

   = 1/4

 

>  I need to be able to justify/explain which median to use.

If you are using survey data, use the definition in PROC SURVEYMEANS because it correctly accounts for survey weights, strata, and clusters in the survey design.

Rick_SAS
SAS Super FREQ

Do you have further topics? If not, please close this thread by indicating the reply that answered your question.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1860 views
  • 8 likes
  • 6 in conversation