Hi,
I am using SAS Enterprise Miner and SAS Enterprise Guide to perform Logistic Regression upon the same dataset.
However, I get different set of output statistics from PROC DMREG (for Data Mining needs) in SAS EM compared to PROC LOGISTIC in SAS EG.
Which set of results is better? Is there any reference I can use to identify better results?
Please note that the direct use of the DMREG procedure is not supported by SAS Technical Support. There is, however. documentation available on request to licensed users of SAS Enterprise Miner. An excerpt from the documentation for DMREG explains these potential differences:
/*** BEGIN EXCERPT ***/
The DMREG and LOGISTIC procedures fit the same models for a categorical target. Both procedures have the CLASS statement to specify categorical input variables and both use the deviation from the mean coding as the default parameterization for a CLASS input variable. However, there are many differences between the two procedures, both in syntax and in features. For example, to specify the GLM parameterization of CLASS variables, you specify the MODEL statement option CODING= GLM in the DMREG procedure. But, in the LOGISTIC procedure , you specify the CLASS statement option PARAM= GLM. You are required to specify a DMDB catalog of input data in the DMREG procedure, but not in the LOGISTIC procedure. The DMREG procedure produces DATA step scoring code, but the LOGISTIC procedure does not. In terms of training a model, you might expect the estimates from both procedures to be identical. Often the estimates between the two procedures are very close but not necessarily identical for a number of reasons. The DMREG and LOGISTIC procedures do not use the same routines to carry out the optimization, and the convergence criterion and optimization technique used might not be the same. However, discrepancies of the parameter estimates between the two procedures would not make any difference in prediction.
/*** END EXCERPT ***/
In short, differences in how categorical effects are coded and differences in optimization algorithms as well as collinearity among any of the predictors might lead to slightly different parameter estimates but these should result in minimal difference in the predicted values. The GLM coding scheme makes exponentiating the parameter a meaningful value but this is not true for the default deviation coding used by DMREG since this compares each level to the average, not to a 'base' level. One other thing, SAS Enterprise Miner will choose the overall average for the predicted value for the Regression node for any observation with missing values, while these observations will be completely dropped by the LOGISTIC procedure.
Hope this helps!
Doug
I confess, I'm going by a hazy recollection of a general impression here.
Doesn't Enterprise Miner automatically sample the data and perform its analysis on the sample? That's at least an issue to verify and would explain why the results might be different.
Thank you Astounding.
The 2 procedures use the same dataset for comparison. No sampling is applied.
For example,
* For PROC LOGISTIC (also can use node [SAS Code] in SAS EM which would generate the same results as SAS EG), get:
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -4.2280 0.1005 1768.1503 <.0001
F_HV 1 -0.4514 0.0447 102.1930 <.0001
F_REV 1 1.1634 0.0399 851.3617 <.0001
F_ACTIVE0 1 0.6207 0.0395 246.4798 <.0001
(The results for other 15 parameters almost match those from PROC DMREG.)
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
F_HV 0.637 0.583 0.695
F_REV 3.201 2.960 3.461
F_ACTIVE0 1.860 1.721 2.010
* For PROC DMREG (use node [Regression] in SAS EM), get:
Analysis of Maximum Likelihood Estimates
Standard Wald Standardized
Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Exp(Est)
Intercept 1 -3.5620 0.1009 1247.02 <.0001 0.028
F_ACTIVE0 0 1 -0.3103 0.0198 246.48 <.0001 0.733
F_REV 0 1 -0.5817 0.0199 851.36 <.0001 0.559
F_HV 0 1 0.2257 0.0223 102.19 <.0001 1.253
(The results for other 15 parameters almost match those from PROC LOGISTIC.)
Odds Ratio Estimates
Point
Effect Estimate
F_ACTIVE0 0 vs 1 0.538
F_REV 0 vs 1 0.312
F_HV 0 vs 1 1.570
Other 15 parameters almost have the same results of output statistics for both PROC LOGISTIC and PROC DMREG. However, the above 4 parameters (including Intercept) have different results.
One more thing, using PROC DMREG, why the statistics of 3 parameters for Exp (Est) are not equal to those for Point Estimate? (But other 15 parameters are equal.)
Please note that the direct use of the DMREG procedure is not supported by SAS Technical Support. There is, however. documentation available on request to licensed users of SAS Enterprise Miner. An excerpt from the documentation for DMREG explains these potential differences:
/*** BEGIN EXCERPT ***/
The DMREG and LOGISTIC procedures fit the same models for a categorical target. Both procedures have the CLASS statement to specify categorical input variables and both use the deviation from the mean coding as the default parameterization for a CLASS input variable. However, there are many differences between the two procedures, both in syntax and in features. For example, to specify the GLM parameterization of CLASS variables, you specify the MODEL statement option CODING= GLM in the DMREG procedure. But, in the LOGISTIC procedure , you specify the CLASS statement option PARAM= GLM. You are required to specify a DMDB catalog of input data in the DMREG procedure, but not in the LOGISTIC procedure. The DMREG procedure produces DATA step scoring code, but the LOGISTIC procedure does not. In terms of training a model, you might expect the estimates from both procedures to be identical. Often the estimates between the two procedures are very close but not necessarily identical for a number of reasons. The DMREG and LOGISTIC procedures do not use the same routines to carry out the optimization, and the convergence criterion and optimization technique used might not be the same. However, discrepancies of the parameter estimates between the two procedures would not make any difference in prediction.
/*** END EXCERPT ***/
In short, differences in how categorical effects are coded and differences in optimization algorithms as well as collinearity among any of the predictors might lead to slightly different parameter estimates but these should result in minimal difference in the predicted values. The GLM coding scheme makes exponentiating the parameter a meaningful value but this is not true for the default deviation coding used by DMREG since this compares each level to the average, not to a 'base' level. One other thing, SAS Enterprise Miner will choose the overall average for the predicted value for the Regression node for any observation with missing values, while these observations will be completely dropped by the LOGISTIC procedure.
Hope this helps!
Doug
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.