BookmarkSubscribeRSS Feed
Xudeer
Calcite | Level 5
Hi all,

I am running fixed effects model that regress outcome on A while controlling for both student FE (studentid) and course FE (courseid). I used Proc glm. My understanding is that i can absorb both studentid and courseid.
In my first model, I only absorbed one variable:
Proc glm data=data;
Class courseid;
Absorb studentid;
Model outcome=A courseid;
Run;

I had no problems running the model and the coefficients on A look correct. However when I add both course ID and student ID in the absorb function, sas fails to provide a valid coefficients for A. Shouldn't the two models-year-old exactly the same results?
Any insights are appreciated!
9 REPLIES 9
JacobSimonsen
Barite | Level 11

Agree, it should give same estimates for A. Unless courseID is nested in A, because you then in practice also have absorbed A when you absorb courseID.

Xudeer
Calcite | Level 5

Thank you Jacob!

 

That is why I got confused. My courseID is not nested in A. A is actually teacher ID. The majority of courses are taught by multiple college instructors and each instructor is teaching multiple courses as well. The most confusing thing is that when I only absorb student ID and add courseID and my key variabe A (instructorID) as dummy variables, my model is totally fine:

 

SAS Output

Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total
8200032715.804810.398972.39<.0001
24130240251.640070.16681  
32330272967.44488   

R-Square Coeff Var Root MSE second_2yr Mean
0.448362118.69510.4084240.344095

 

However, once I abosorb both studentID and courseID. the model explodes, incating that something is wrong:

SAS Output

Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total
32325072954.598990.225690.910.7029
5212.845890.24704  
32330272967.44488   

R-Square Coeff Var Root MSE second_2yr Mean
0.999824144.44480.4970280.344095

 

 

 

PaigeMiller
Diamond | Level 26

You have only showed us the code for one of the models. Your description seems to indicate a fairly straightforward change to the code for the second model, but it would still be nice if you showed it to us.

 

My other concern is that you have 82000 df for the model in the first output, this doesn't seem to be a likely number, it seems way too large for any type of teacher/testing scenario I am aware of. Also, the total degrees of freedom, over 300,000, is also way too large for any type of teacher/testing scenario. Can you explain why these numbers are so large?

--
Paige Miller
Xudeer
Calcite | Level 5

Thank you for the response, PaigeMiller! 

 

Here is my first model that only absorbs studentID:

 

proc glm data=derived.fouryear;
absorb student_nid;
class instructor_nid coursenid_ft;
model second_2yr=instructor_nid coursenid_ft / solution;
run;

 

Here are the outputs:

 

SAS Output

Dependent Variable: second_2yr

 

Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total
8197931583.900470.385272.25<.0001
24132341383.544410.17149  
32330272967.44488   

R-Square Coeff Var Root MSE second_2yr Mean
0.432849120.34720.4141090.344095

Source DF Type I SS Mean Square F Value Pr > F student_nid instructor_nid coursenid_ft
6860617514.982250.255301.49<.0001
786710393.424791.321147.70<.0001
55063675.493430.667543.89<.0001

 

 

In the second model, everything remains the same, except that now I absorb courseID rather than having it as dummies:

 

proc sort data=derived.fouryear; by student_nid coursenid_ft;run;
proc glm data=derived.fouryear;
absorb student_nid coursenid_ft;
class instructor_nid;
model second_2yr=instructor_nid/ solution;
run;

 

Here are the outputs from the second model:

 

SAS Output

Dependent Variable: second_2yr

 

Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total
32322972948.946140.225690.890.7788
7318.498740.25341  
32330272967.44488   

R-Square Coeff Var Root MSE second_2yr Mean
0.999746146.29550.5033960.344095

Source DF Type I SS Mean Square F Value Pr > F student_nid coursenid_(IN ABOVE) instructor_nid
6860617514.982250.255301.010.5041
25455855420.962620.217710.860.8402
6513.001260.200020.790.8338
 

 

 
 
PaigeMiller
Diamond | Level 26

You haven't addressed why there are so many degrees of freedom, this seems like an incredibly large number.

 

However, from the ABSORB documentation

 

Several variables can be specified, in which case each one is assumed to be nested in the preceding variable in the ABSORB statement.

 

So your two models are not equivalent.

 

Also, from the documentation

 

When you use the ABSORB statement, the data set (or each BY group, if a BY statement appears) must be sorted by the variables in the ABSORB statement.

--
Paige Miller
Xudeer
Calcite | Level 5

Thank you PaigeMiller! 

 

You haven't addressed why there are so many degrees of freedom, this seems like an incredibly large number.

- We have more than 300,000 observations (the data is student by course level transcript records from multiple cohorts of students from an entire four-year public college system)

 

However, from the ABSORB documentation

 

Several variables can be specified, in which case each one is assumed to be nested in the preceding variable in the ABSORB statement.

 

So your two models are not equivalent.

- I see. What I would like to do is to absorb studentID and courseID which are not nested within each other. Is there any model that SAS would allow that? 

 

 

Also, from the documentation

 

When you use the ABSORB statement, the data set (or each BY group, if a BY statement appears) must be sorted by the variables in the ABSORB statement.

 

-- Yes, I sorted the variable before running the command. 

JacobSimonsen
Barite | Level 11

I dont think there is any procedure doing what you want. But, theoretically it is possible to "absorb" non-nested variables. As you maybe know, when using the absorb method, data and collumnvectors of the design matrix is projected into the orthogonal space of the design vectors defined by the variable(s) in the absorbstatement. This is quite simple if you have only one class variable in the absorb statement. If there are more variables (non-nested), then this projection becomes more complicated (in terms of calculation time). I experimented with this some years ago, and I didnt see any time efficient way to do it. So that is maybe the reason that it is also not possible with proc glm.

Xudeer
Calcite | Level 5

Thank you Jacob! For your information, STATA can absorb multiple non-nested variables, but it runs extremely slow for a large dataset such as mine. 

I guess for SAS, I will have to only absorb only one variable while adding the other as dummies?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 2322 views
  • 0 likes
  • 3 in conversation