Re: Collinearity problem in robust regression

Lop · Posted 07-11-2018 04:09 AM

I will appreciate if you can help me with some insights to solve this problem.

I was carrying out a robust regression with continuous and categorical variables. For this, I transformed categorical variables into dummie variables.

54           proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
55               model inbody_pbf=sdtaqdum2 sdtaqdum3 sdtaqdum4 age bmi waist bsa
56               exerc2 smokdum2 smokdum3 drinkdum2 drinkdum3 jobdum1 jobdum2 jobdum3 jobdum4 jobdum5 jobdum6
57               jobdum7 jobdum8 jobdum9 jobdum10 jobdum11 jobdum13 / diagnostics leverage (opc mcdinfo) ;
58               output out=robrefsittapbfresw5 weight=wgt ;
59               test sdtaqdum2 sdtaqdum3 sdtaqdum4 ;
60           run ;

However, after inserting additional dummies variables (corresponding to those who appear in blue), I got this message.

Furthermore, my previous modeling didn't present same problem. And, I even used the CLASS option but the problem still was present.

WARNING: The design matrix is singular. Some regressors are dropped from the matrix. LEVERAGE is being computed
on the reduced design matrix.
ERROR: The current MM estimation failed because a collinearity problem for a subset of the dataset occurred in
its initial LTS estimation.
ERROR: Initial LTS estimator cannot be computed.

PaigeMiller · Posted 07-11-2018 06:51 AM

If you have 13 levels of JOB, then you need 12 (not 13) variables jobdum1 through jobdum12. Similarly for your other dummy variables. This eliminates the message.

Using ROBUSTREG will not solve any colinearity problems, ROBUSTREG is effective in the presence of outliers.

--
Paige Miller

Lop · Posted 07-11-2018 10:42 PM

Thanks #PaigeMiller for your answer and indeed I took just 12 dummie variables omiting jobdum12 as reference, but I still found same dissapointing results. And, as you mention robust regression leds to make regression in data with high outliers or large leverage points, however in this specific case I encounter collinearity problem that was produced after adding these last dummie variables. Previously, I carried out same regression with other dummies with few categories and I got successfully results.

I'm gonna thank you if you can give me an additional feedback. Greetings

PaigeMiller · Posted 07-12-2018 09:12 AM

@Lop wrote:

Thanks #PaigeMiller for your answer and indeed I took just 12 dummie variables omiting jobdum12 as reference, but I still found same dissapointing results. And, as you mention robust regression leds to make regression in data with high outliers or large leverage points, however in this specific case I encounter collinearity problem that was produced after adding these last dummie variables. Previously, I carried out same regression with other dummies with few categories and I got successfully results.

I'm gonna thank you if you can give me an additional feedback. Greetings

You have to make this change for all of your dummy variables, not just the ones for JOB. Or better yet, do what @Rick_SAS suggested.

--
Paige Miller

Rick_SAS · Posted 07-12-2018 09:03 AM

PROC ROBUSTREG supports a CLASS statement. This feature was introduced in SAS 9.22. I suggest you list categorical variables in the CLASS and MODEL statements instead of generating your own dummy variables.

PaigeMiller · Posted 07-12-2018 11:01 PM

Dear #Rick_SAS and #PaigeMiller thanks for replying.

So far I tried the option you suggested and I got the same described problem. Perhaps it may be related to the syntaxis I am using.

/*Option 1*/

proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
model inbody_pbf=sittaqdum2 sittaqdum3 sittaqdum4 age bmi waist bsa exerc2 smokdum2 smokdum3 drinkdum2 drinkdum3
jobdum2 jobdum3 jobdum4 jobdum5 jobdum6 jobdum7 jobdum8 jobdum9 jobdum10 jobdum11 jobdum12 jobdum13 / diagnostics leverage (opc mcdinfo) ;
output out=robrefsittapbfresw5 weight=wgt ;
test sittaqdum2 sittaqdum3 ;
run ;

/*Option 2*/
proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
class sdta_quart exerc2 smoking drinking a7_1 ;
model inbody_pbf=sdta_quart age bmi waist bsa exerc2 smoking drinking a7_1 / diagnostics leverage (opc mcdinfo) ;
output out=robrefsittapbfresw5 weight=wgt ;
test sdta_quart ;
run ;

After this, let me ask you a couple of questions.

Is there a way how to control the reference level for categorical variables in robust regression? -as the same way there is in logistic regression-

Does SAS support categorical variables with many categories as in this case?

Thanks one more time for valuable insights.

Lop,

Rick_SAS · Posted 07-13-2018 06:34 AM

The MM method is initialized by using the LTS method, and that is the algorithm that is failing. Try

PROC ROBUSTREG method=MM(INITEST= S) ...

and maybe the S method will converge to an initial estimate.

Alternatively, if you stick with LTS as the initializer, you can try increasing the default H (breakdown) value. The syntax is

PROC ROBUSTREG method=MM(INITEST= LTS H=0.24) ...

where the H= value depends on the size of your data.

If that doesn't work, you might need to change to the M method. I think the number of categorical variables is causing this problem. You only have 4 continuous variables whereas you have dozens of categorical levels. The algorithms for robust regression were created for continuous variables. Later people tried to extend them to support discrete variables, but as the ROBUSTREG doc says:

Note: Because the LTS and S methods use subsampling algorithms, these methods are not suitable in an analysis that uses variables that have only a few unequal values..... For example, indicator variables that correspond to a classification variable often fall into this category. The same issue also applies to the initial LTS and S estimates in the MM method. For a model that includes classification independent variables or continuous independent variables with a few unequal values, the M method is recommended.

PaigeMiller · Posted 07-13-2018 06:53 AM

@Lop wrote:

Dear #Rick_SAS and #PaigeMiller thanks for replying.

So far I tried the option you suggested and I got the same described problem. Perhaps it may be related to the syntaxis I am using.

/*Option 1*/

proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
model inbody_pbf=sittaqdum2 sittaqdum3 sittaqdum4 age bmi waist bsa exerc2 smokdum2 smokdum3 drinkdum2 drinkdum3
jobdum2 jobdum3 jobdum4 jobdum5 jobdum6 jobdum7 jobdum8 jobdum9 jobdum10 jobdum11 jobdum12 jobdum13 / diagnostics leverage (opc mcdinfo) ;
output out=robrefsittapbfresw5 weight=wgt ;
test sittaqdum2 sittaqdum3 ;
run ;

/*Option 2*/
proc robustreg data=cpro.fstqwom_dum method=mm plots=all ;
class sdta_quart exerc2 smoking drinking a7_1 ;
model inbody_pbf=sdta_quart age bmi waist bsa exerc2 smoking drinking a7_1 / diagnostics leverage (opc mcdinfo) ;
output out=robrefsittapbfresw5 weight=wgt ;
test sdta_quart ;
run ;

Sure would be nice if you showed us the relevant portions of your SASLOG, with error message.

The problem may be that your different categorical variables are perfectly correlated with one another, and so even by reducing the number of dummy variables by 1, or by using the CLASS statement, the matrix still can't be inverted.

--
Paige Miller