topic Re: LSMEANS oddity in Statistical Procedures

LSMEANS oddity

plf515 — Tue, 28 Apr 2009 01:27:41 GMT

I ran into this oddity at work.... I can't show the real data, but I made up some (see below).
In this example, the differences are quite small, but in the data at work they were not so small.

Suppose you have a data set with a dependent variable and some categorical variables. Some of these
are coded 0-1, some have several levels. I was under the impression that with 0-1 variables, it did
not matter whether you included them on the CLASS statement, and, indeed, the parameter estimates are
identical for the two versions below. But LSMEANS are not identical, even for the variable with more
than 2 levels, which is always on the CLASS statement.

So ...

proc format;
value racfmt 1 = 'Black'
2 = 'White'
3 = 'Latino';
run;

data today;
input catv1 catv2 catv3 @@;
dv = catv1 * 3 + catv2 * 5 + catv3 + rannor(123);
format catv3 racfmt.;
datalines;
0 1 2 0 1 1 0 1 3 1 0 1 0 0 3 0 1 3 0 1 2 1 0 3 1 0 1 0 1 2 0 0 3
1 1 2 1 1 1 0 1 3 1 1 1 1 0 3 0 1 3 1 1 2 1 0 3 1 0 1 0 1 2 1 1 3
0 0 2 1 0 1 1 1 3 0 0 1 0 0 3 0 0 3 0 0 2 1 0 3 1 0 1 0 1 2 1 1 3
;
run;

title 'Version with all on CLASS statement';
proc glm data = today;
class catv1 catv2 catv3;
model dv = catv1 catv2 catv3;
lsmeans catv1 catv2 catv3;
run;

title 'Version with only race on CLASS statement';
proc glm data = today;
class catv3;
model dv = catv1 catv2 catv3;
lsmeans catv3;
run;

and there are differences ....

I understand the models are parameterized a bit differently, with different intercepts, but
shouldn't LSMEANS be the same? And which are 'correct'?

Re: LSMEANS oddity

SteveDenham — Tue, 28 Apr 2009 12:05:32 GMT

Hey Peter,

Look at this (code added to your original):

proc format;
value racfmt 1 = 'Black'
2 = 'White'
3 = 'Latino';
run;

data today;
input catv1 catv2 catv3 @@;
dv = catv1 * 3 + catv2 * 5 + catv3 + rannor(123);
format catv3 racfmt.;
datalines;
0 1 2 0 1 1 0 1 3 1 0 1 0 0 3 0 1 3 0 1 2 1 0 3 1 0 1 0 1 2 0 0 3
1 1 2 1 1 1 0 1 3 1 1 1 1 0 3 0 1 3 1 1 2 1 0 3 1 0 1 0 1 2 1 1 3
0 0 2 1 0 1 1 1 3 0 0 1 0 0 3 0 0 3 0 0 2 1 0 3 1 0 1 0 1 2 1 1 3
;
run;

title 'Version with all on CLASS statement';
proc glm data = today;
class catv1 catv2 catv3;
model dv = catv1 catv2 catv3;
lsmeans catv1 catv2 catv3;
run;

title 'Version with only race on CLASS statement';
proc glm data = today;
class catv3;
model dv = catv1 catv2 catv3;
lsmeans catv3;
run;

proc means data=today;
var dv catv1 catv2 catv3;
run;

title 'Version with only race on CLASS statement, with at=0.5';
title2 'Results same as all on CLASS statement';
proc glm data = today;
class catv3;
model dv = catv1 catv2 catv3;
lsmeans catv3/at (catv1 catv2)=(0.5 0.5);
run;

title 'Version with only race on CLASS statement, with at=';
title2 'Results same as only race on CLASS statement';
proc glm data = today;
class catv3;
model dv = catv1 catv2 catv3;
lsmeans catv3/at (catv1 catv2)=(0.48484848485 0.5151515151515);
run;

So the difference is in the solution for the OLS equations. The first calculates lsmeans with equal weighting by class membership (which was the whole point of Searle, Speed and Milliken (1980), I think), while the second calculates lsmeans at the mean value.

I would bet that the large differences in your real data arise from substantial differences in class size.

Good luck.

Re: LSMEANS oddity

plf515 — Tue, 28 Apr 2009 14:30:01 GMT

Thanks Steve!

That is very clear.

Peter