BookmarkSubscribeRSS Feed
Lop
Fluorite | Level 6 Lop
Fluorite | Level 6

I will appreciate if you can help me with some insights to solve this problem.

I was carrying out a robust regression with continuous and categorical variables. For this, I transformed categorical variables into dummie variables.

 

54           proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
55               model inbody_pbf=sdtaqdum2 sdtaqdum3 sdtaqdum4 age bmi waist bsa
56               exerc2 smokdum2 smokdum3 drinkdum2 drinkdum3 jobdum1 jobdum2 jobdum3 jobdum4 jobdum5 jobdum6
57               jobdum7 jobdum8 jobdum9 jobdum10 jobdum11 jobdum13 / diagnostics leverage (opc mcdinfo) ;
58               output out=robrefsittapbfresw5 weight=wgt ;
59               test sdtaqdum2 sdtaqdum3 sdtaqdum4 ;
60           run ;

 

However, after inserting additional dummies variables (corresponding to those who appear in blue), I got this message.

Furthermore, my previous modeling didn't present same problem. And, I even used the CLASS option but the problem still was present.


WARNING: The design matrix is singular. Some regressors are dropped from the matrix. LEVERAGE is being computed
         on the reduced design matrix.
ERROR: The current MM estimation failed because a collinearity problem for a subset of the dataset occurred in
       its initial LTS estimation.
ERROR: Initial LTS estimator cannot be computed.

 

7 REPLIES 7
PaigeMiller
Diamond | Level 26

If you have 13 levels of JOB, then you need 12 (not 13) variables jobdum1 through jobdum12. Similarly for your other dummy variables. This eliminates the message.

 

Using ROBUSTREG will not solve any colinearity problems, ROBUSTREG is effective in the presence of outliers.

--
Paige Miller
Lop
Fluorite | Level 6 Lop
Fluorite | Level 6

Thanks #PaigeMiller for your answer and indeed I took just 12 dummie variables omiting jobdum12 as reference, but I still found same dissapointing results. And, as you mention robust regression leds to make regression in data with high outliers or large leverage points, however in this specific case I encounter collinearity problem that was produced after adding these last dummie variables. Previously, I carried out same regression with other dummies with few categories and I got successfully results.

I'm gonna thank you if you can give me an additional feedback. Greetings

PaigeMiller
Diamond | Level 26

@Lop wrote:

Thanks #PaigeMiller for your answer and indeed I took just 12 dummie variables omiting jobdum12 as reference, but I still found same dissapointing results. And, as you mention robust regression leds to make regression in data with high outliers or large leverage points, however in this specific case I encounter collinearity problem that was produced after adding these last dummie variables. Previously, I carried out same regression with other dummies with few categories and I got successfully results.

I'm gonna thank you if you can give me an additional feedback. Greetings


You have to make this change for all of your dummy variables, not just the ones for JOB. Or better yet, do what @Rick_SAS suggested.

--
Paige Miller
Rick_SAS
SAS Super FREQ

PROC ROBUSTREG supports a CLASS statement. This feature was introduced in SAS 9.22.  I suggest you list categorical variables in the CLASS and MODEL statements instead of generating your own dummy variables.

Lop
Fluorite | Level 6 Lop
Fluorite | Level 6

Dear #Rick_SAS and # thanks for replying.

So far I tried the option you suggested and I got the same described problem. Perhaps it may be related to the syntaxis I am using.

 

/*Option 1*/

proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
model inbody_pbf=sittaqdum2 sittaqdum3 sittaqdum4 age bmi waist bsa exerc2 smokdum2 smokdum3 drinkdum2 drinkdum3 
jobdum2 jobdum3 jobdum4 jobdum5 jobdum6 jobdum7 jobdum8 jobdum9 jobdum10 jobdum11 jobdum12 jobdum13  / diagnostics leverage (opc mcdinfo) ;
output out=robrefsittapbfresw5 weight=wgt ;
test sittaqdum2 sittaqdum3 ;
run ;

 

/*Option 2*/
proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
class sdta_quart exerc2 smoking drinking a7_1 ;
model inbody_pbf=sdta_quart age bmi waist bsa exerc2 smoking drinking a7_1 / diagnostics leverage (opc mcdinfo) ;
output out=robrefsittapbfresw5 weight=wgt ;
test sdta_quart ;
run ;

 

After this, let me ask you a couple of questions.

Is there a way how to control the reference level for categorical variables in robust regression? -as the same way there is in logistic regression-

Does SAS support categorical variables with many categories as in this case?

 

Thanks one more time for valuable insights.

 

Lop,

 

Rick_SAS
SAS Super FREQ

The MM method is initialized by using the LTS method, and that is the algorithm that is failing. Try 

PROC ROBUSTREG method=MM(INITEST= S) ...

and maybe the S method will converge to an initial estimate.

 

Alternatively, if you stick with LTS as the initializer, you can try increasing the default H (breakdown) value. The syntax is 

PROC ROBUSTREG method=MM(INITEST= LTS H=0.24) ...

where the H= value depends on the size of your data.

 

If that doesn't work, you might need to change to the M method. I think the number of categorical variables is causing this problem. You only have 4 continuous variables whereas you have dozens of categorical levels. The algorithms for robust regression were created for continuous variables. Later people tried to extend them to support discrete variables, but as the ROBUSTREG doc says:

Note: Because the LTS and S methods use subsampling algorithms, these methods are not suitable in an analysis that uses variables that have only a few unequal values..... For example, indicator variables that correspond to a classification variable often fall into this category. The same issue also applies to the initial LTS and S estimates in the MM method. For a model that includes classification independent variables or continuous independent variables with a few unequal values, the M method is recommended.

 

 

PaigeMiller
Diamond | Level 26

@Lop wrote:

Dear #Rick_SAS and # thanks for replying.

So far I tried the option you suggested and I got the same described problem. Perhaps it may be related to the syntaxis I am using.

 

/*Option 1*/

proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
model inbody_pbf=sittaqdum2 sittaqdum3 sittaqdum4 age bmi waist bsa exerc2 smokdum2 smokdum3 drinkdum2 drinkdum3 
jobdum2 jobdum3 jobdum4 jobdum5 jobdum6 jobdum7 jobdum8 jobdum9 jobdum10 jobdum11 jobdum12 jobdum13  / diagnostics leverage (opc mcdinfo) ;
output out=robrefsittapbfresw5 weight=wgt ;
test sittaqdum2 sittaqdum3 ;
run ;

 

/*Option 2*/
proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
class sdta_quart exerc2 smoking drinking a7_1 ;
model inbody_pbf=sdta_quart age bmi waist bsa exerc2 smoking drinking a7_1 / diagnostics leverage (opc mcdinfo) ;
output out=robrefsittapbfresw5 weight=wgt ;
test sdta_quart ;
run ;



Sure would be nice if you showed us the relevant portions of your SASLOG, with error message.

 

The problem may be that your different categorical variables are perfectly correlated with one another, and so even by reducing the number of dummy variables by 1, or by using the CLASS statement, the matrix still can't be inverted.

--
Paige Miller

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 2127 views
  • 3 likes
  • 3 in conversation