I ran into this oddity at work.... I can't show the real data, but I made up some (see below).
In this example, the differences are quite small, but in the data at work they were not so small.
Suppose you have a data set with a dependent variable and some categorical variables. Some of these
are coded 0-1, some have several levels. I was under the impression that with 0-1 variables, it did
not matter whether you included them on the CLASS statement, and, indeed, the parameter estimates are
identical for the two versions below. But LSMEANS are not identical, even for the variable with more
than 2 levels, which is always on the CLASS statement.
So ...
proc format;
value racfmt 1 = 'Black'
2 = 'White'
3 = 'Latino';
run;
data today;
input catv1 catv2 catv3 @@;
dv = catv1 * 3 + catv2 * 5 + catv3 + rannor(123);
format catv3 racfmt.;
datalines;
0 1 2 0 1 1 0 1 3 1 0 1 0 0 3 0 1 3 0 1 2 1 0 3 1 0 1 0 1 2 0 0 3
1 1 2 1 1 1 0 1 3 1 1 1 1 0 3 0 1 3 1 1 2 1 0 3 1 0 1 0 1 2 1 1 3
0 0 2 1 0 1 1 1 3 0 0 1 0 0 3 0 0 3 0 0 2 1 0 3 1 0 1 0 1 2 1 1 3
;
run;
title 'Version with all on CLASS statement';
proc glm data = today;
class catv1 catv2 catv3;
model dv = catv1 catv2 catv3;
lsmeans catv1 catv2 catv3;
run;
title 'Version with only race on CLASS statement';
proc glm data = today;
class catv3;
model dv = catv1 catv2 catv3;
lsmeans catv3;
run;
and there are differences ....
I understand the models are parameterized a bit differently, with different intercepts, but
shouldn't LSMEANS be the same? And which are 'correct'?