Estimate for reference category

LuciferF · Posted 01-10-2019 09:46 AM

Hello everyone!

When i modeling a scoring card, there was a problem with converting probability into a risk score. To build a score card, I used logistic regression with param=ref in CLASS statement. The final model includes 10 variables. In proc logistic output i got estimates for intercept and for all bins of each variable except reference category. According to the book of Naim Siddiqui there are two ways to convert probability of default to risk score:

1. Risk score = Offset + Factor*ln(odds)

2.Σ((-WOEj *Bi) + a/n))*Factor+Offset/n

The second method is preferable for me, because it allows me to award the score of each category to a separate variable. The problem is that in the case of reference cell coding there is no estimates for reference category, and i can't calculate risk score for reference category. Is there a way to calculate the generalized regression coefficient for each variable in case of reference cell coding or calculate estimate for reference category.

I hope I managed to describe the problem, thank you in advance, dear colleagues.

Rick_SAS · Posted 01-10-2019 02:35 PM

Maybe I am misunderstanding the question, but wouldn't you use 0 as the coefficient for the categorical variable that contains the reference category? That's the definition of the reference category: it gets a zero coefficient (estimate) and the other estimates represent the relative change as compared to the reference level.

Here is an example, Notice that the values of the linear predictor on the score data set the same magnitudes as the parameter estimates.

data Cars;
set Sashelp.cars(where=(type^='Hybrid' AND Origin^="Europe"));
run;

proc logistic data=Cars;
class Type(ref='Sedan') / param=ref;
model Origin(event="USA") = mpg_city Type;
store out=LogiModel;
run;

data ScoreMe;
Type = "Sedan"; mpg_city = 25; output;
Type = "Wagon"; output;
Type = "Truck"; output;
run;

proc plm restore=LogiModel noprint;
score data=ScoreMe out=Pred pred; /* linear predictor */
run;

proc print data=Pred; run;

LuciferF · Posted 01-11-2019 12:14 AM

@Rick_SAS, thank you for your reply! Yes, I used a zero as an estimate for reference category to calculate a predicted value. The problem is that instead of probability I want to get a separate risk score for each variable. Let me try to give an example of what I want.

We construct a logistic regression on one variable, assuming that the probability of default is dependent on the client’s work experience (in months).

We have six categories of clients with different distribution of bads and goods clients:

BIN	Total Number of Loans	Number of Bad Loans	Numbef of Good Loans	% Bad Loans	Distibution Bad (DB)	Distibution Good (DG)	WOE
(-;12]	9640	1935	7705	20,1%	0,199	0,129	-0,432
(12;24]	9955	1840	8115	18,5%	0,189	0,136	-0,330
(24;48]	10976	1734	9242	15,8%	0,178	0,155	-0,141
(48;72]	7183	1025	6158	14,3%	0,105	0,103	-0,021
(72;84]	21865	2452	19413	11,2%	0,252	0,325	0,255
(84; + inf)	9896	757	9139	7,6%	0,078	0,153	0,677

In proc logistic output we will gain next estimates:

Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard	Wald	Pr > ChiSq
Error	Chi-Square
Intercept		1	-2,5178	0,0415	3684,029	<,0001
Work_exp_BIN	(-;12]	1	1,1389	0,0498	522,9748	<,0001
Work_exp_BIN	(12;24]	1	1,0335	0,0501	425,9453	<,0001
Work_exp_BIN	(24;48]	1	0,8289	0,0503	272,0801	<,0001
Work_exp_BIN	(48;72]	1	0,7398	0,0553	178,9733	<,0001
Work_exp_BIN	(72;84]	1	0,4355	0,0476	83,6209	<,0001

We can score dataset or add statement output p = pred_prob to get probabilities, but I calculated it by myself:

1/(1+exp(-1*(intercept+(Work_exp_BIN_estimate*Work_exp_BIN)))

We know that for the reference category the value of estimate is zero, so in fact, only intercept will remain in the exponent. After calculating the probability I can convert it to the risk score, using, for i.e, 1st formula from my 1st post:

Score = 33,561144 + 20/ln(2)*ln(odds)

category	prob	Score
(-;12]	0,2011857	79,829441
(12;24]	0,1847788	82,284014
(24;48]	0,1559206	87,183777
(48;72]	0,1445503	89,368574
(72;84]	0,1108291	97,033263
(84 ; +inf)	0,0746197	108,44742

The result would have completely satisfied me, but if there are more than one variable, then it becomes difficult to calculate the risk score for each variable separately (since the intersection is common to the entire model, and the probability is calculated from all factors). The second formula allows you to solve this problem, but if you use zero as a beta coefficient, then even with large values of WOE, the value of the risk score will be the average, since the left side of the equation will be equal to zero. Is it possible to get a standardized logistic regression coefficient for a single variable? Also I do not exclude the option that I incorrectly interpreted the coefficients in the formula (Credit Risk Scorecards Developing and Implementing Intelligent Credit Scoring,
Naeem Siddiq, p.116), therefore, I would be very grateful if you correct me if I misinterpreted. Thank you in advance!

Estimate for reference category

Re: Estimate for reference category

Re: Estimate for reference category

Estimate for reference category

Re: Estimate for reference category

Re: Estimate for reference category

The 2025 SAS Hackathon has begun!