Re: Why are medians from PROC MEANS and PROC SURVEYMEANS different?

egrieco · Posted 04-07-2022 08:23 AM

For a report I am writing using Census Bureau data (CPS), I am calculating median wages. I used PROC MEANS to pull out the medians. For the workforce 18-74 with wages in the last 12 months, PROC MEANS gives me a median of $47,000. However, I need the standard errors for statistical testing, so I ran PROC SURVEYMEANS to get those. For the same group as before, PROC SURVEYMEANS gives me a median of $46,863. Can someone explain why the medians calculated by the two PROCs differ? Note the *means* calculated by both PROCs match. I realize the values are close but I need to be able to justify/explain which median to use.

mkeintz · Posted 04-07-2022 10:05 AM

Apparently they have different "definitions" of median. Consider the below:

data have;
  do wages=47000 to 48000 by 1000; 
   do _n_=1 to 4; 
     output; 
   end;
  end;
run;

proc means median;run;
proc surveymeans median;run;

Proc means reports median=47,500 - a value that never occurred, but is the midpoint between the two equally frequent values.

Proc surveyeans reports median=47,000, a value that exists in the sample, but does not have exactly 50 percent higher and 50 percent lower.

I imagine somewhere in the documentation for these procedures, you'll find their default median estimate algorithms, and possibly options to choose another algorithm.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS_Rob · Posted 04-07-2022 10:15 AM

The default method that Proc MEANS uses to compute the median is different than Proc SURVEYMEANS. If you use PCTLDEF=1 in Proc MEANS then they should match.

For further details, take a look at the two sections in the documentation.

SAS Help Center: SURVEYMEANS Statistical Computations

SAS Help Center: MEANS Keywords and Formulas

FreelanceReinh · Posted 04-07-2022 11:16 AM

@SAS_Rob wrote:

If you use PCTLDEF=1 in Proc MEANS then they should match.

That's what I thought too, based on my first examples. But it turned out that, in general, none of the five possible PCTLDEF= option values of PROC MEANS replicates the results of the default quantile definition used by PROC SURVEYMEANS.

Example:

data test;
do x=0, 1, 1;
  output;
end;
run;

proc surveymeans data=test median;
var x;
run;

This yields a median estimate of 0.25 (by linear interpolation between the 1/3-quantile 0 and the maximum 1), whereas PROC MEANS gives 0.5 with PCTLDEF=1 and 1 with PCTLDEF>1.

ballardw · Posted 04-07-2022 10:17 AM

Code.

Code shows what options you used and we can address each specific possible cause of differences.

Basic: if you use a weight variable, which is typical of Survey procedures, the methods of using the weights are quite different because of the sample design information that can be included in the Survey procedures because that is what they are designed to do.

Means also uses calculations that assume your data comes from a very large population. The Survey procedures can incorporate adjustments for known population sizes.

Rick_SAS · Posted 04-09-2022 06:36 AM

> Can someone explain why the medians calculated by the two PROCs differ?

Yes. They differ because they use different estimates. There are many ways to estimate quantiles, and the median is the 0.5 quantile. You can read about 9 common definitions that are used by statisticians.

All the definitions are based on various ways to estimate the cumulative distribution of the data from the finite sample.

Proc SURVEYMEANS uses an estimate for the CDF that is different from those supported by PROC MEANS. The definition is given in the SURVEYMEANS doc.

For the sample data {0, 1, 1}, the CDF is estimated by SURVEYMEANS to be

F(t) = { 1/3 if t <1
{ 1 if t=1

Let p=0.5 and use this for the estimate for Q(0.5). For your sample, choose k=1 because F(y_1) <= p .<= F(y_2) Then the estimate is

Q(0.5) = 0 + (0.5 - 1/3)/(1 - 1/3) * (1 - 0)

= 1/4

> I need to be able to justify/explain which median to use.

If you are using survey data, use the definition in PROC SURVEYMEANS because it correctly accounts for survey weights, strata, and clusters in the survey design.

Rick_SAS · Posted 04-14-2022 09:51 AM

Do you have further topics? If not, please close this thread by indicating the reply that answered your question.

Why are medians from PROC MEANS and PROC SURVEYMEANS different?