There is a whitepaper for selecting important variables in a linear regression model. The URL of the whitepaper is http://support.sas.com/resources/papers/proceedings15/3242-2015.pdf .
It explains gini coefficient can be used to check linearity in the model. And we can also rank variable based on their GINI coefficient. A higher Gini coefficient suggests a higher potential for the variable to be useful in a linear regression. If a numeric variable is high on IV Rank but low on Gini coefficient , it usually suggests a lack of linearity.
My Question - Is it the gini coefficient derived from decision tree? Or it is related to Area under Curve (AUC) -- (Gini = 2*AUC- 1)? What is the exact calculation of this Gini Coefficient and how it can be used to check linearity? I googled a lot. What i got is it is used in economics theory to check inequality.
Any help would be highly appreciated. Thanks!
The Gini coefficient or Somers' D statistic gives a measure of concordance in logistic models. It is a rank based statistic, where all results are paired (all observed with all predicted). In linear regression, it is a transformation of the Pearson correlation coefficient.
Here is a nice paper that covers a lot of what is buried in the SGF paper.
http://www.imperial.ac.uk/nhli/r.newson/miscdocs/intsomd1.pdf
Steve Denham.
Check proc univariate who can calculate GINI .
Thanks Xia. The Gini that PROC UNIVARIATE produces is a measure of statistical dispersion. Correct me if i am wrong? A low Gini coefficient indicates a more equal distribution, with 0 corresponding to complete equality. How it can be used to check linearity? How it can be used in modeling process to select important linear variables?
Sorry. I will leave it to Steve or lvm .
Thanks Xia for looking into it.
The Gini coefficient or Somers' D statistic gives a measure of concordance in logistic models. It is a rank based statistic, where all results are paired (all observed with all predicted). In linear regression, it is a transformation of the Pearson correlation coefficient.
Here is a nice paper that covers a lot of what is buried in the SGF paper.
http://www.imperial.ac.uk/nhli/r.newson/miscdocs/intsomd1.pdf
Steve Denham.
Thanks a ton Steve for your answer. I know Somer's D and Gini Coefficient. Gini Coefficient = 2 (AUC -1) and AUC = %Concordance + 0.5 (Tied Pairs). It would be great if you share an article of "In linear regression, it is a transformation of the Pearson correlation coefficient.". I am more intersted about application of Gini Coefficient in linear regression. I did not find a single article to support it.
Am i correct? -
In logistic regression, if gini coefficient is high, logit function is monotonically related to independent variable?
In linear regression, if gini coefficient is high, y is linearly related to independent variable?
Regarding correctness of interpretation, that is the way I would interpret it.
The quote is from the Imperial College paper I linked to.
Steve Denham
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.