I am trying to model counts data to rank order risk of accounts going bad by grade and band. data looks like this:
Grade | band | year | bad rate | Total Accs | Bad Accs |
A | A-(0-45%) | 2016 | 0.31% | 3924 | 12 |
A | A-(0-45%) | 2013 | 0.20% | 51556 | 103 |
A | A-(0-45%) | 2014 | 0.24% | 49918 | 120 |
A | A-(0-45%) | 2015 | 0.25% | 59723 | 150 |
A | B-(>45-55%) | 2016 | 0.80% | 249 | 2 |
A | B-(>45-55%) | 2015 | 0.22% | 3664 | 8 |
A | B-(>45-55%) | 2013 | 0.32% | 3149 | 10 |
when I summarize the above data, I see below observed bad rate by grade*band:
Grade | band | bad rate | Total Accs | Bad Accs |
A | A-(0-45%) | 0.23% | 165121 | 385 |
B | A-(0-45%) | 0.68% | 250156 | 1708 |
C | A-(0-45%) | 1.92% | 240478 | 4609 |
D | A-(0-45%) | 3.05% | 33809 | 1030 |
E | A-(0-45%) | 3.89% | 2853 | 111 |
F | A-(0-45%) | 1.52% | 7417 | 113 |
G | A-(0-45%) | 3.30% | 3026 | 100 |
I have used proc genmod with poisson distribution to model the above data to compare if it rank ordering according to the observed results.
proc genmod data=data;
class grade band;
model bad_Accs = Grade bands grade*DTI/ dist=poisson link=log;
run;
I see the below results:
Analysis of Maximum likelihood paramter estimates | |||
parameter | Estimate | ||
Grade*band | G | A-(0-45%) | 0 |
Grade*band | D | A-(0-45%) | 0.5404 |
Grade*band | A | A-(0-45%) | 0.9426 |
Grade*band | B | A-(0-45%) | 1.0461 |
Grade*band | C | A-(0-45%) | 1.2279 |
Grade*band | E | A-(0-45%) | 17.6989 |
Grade*band | F | A-(0-45%) | 17.7168 |
from the results it suggests that a F grade with A-(0-45%) is 17.71 % more likely to go bad compared to other grades?
but from the observed results, I see bad rate is high for grade E, should'nt grade E have higher parameter estimate in genmod ?
or am I modeling wrong vraiable? I feel like I should model for Total Accs/Bad Accs instead of just Bad Accs to consider severity.
when I try to do that as below, its givng me an error:
proc genmod data=DTI;
class grade DTI;
model (Accs/bad_Accs)*100 = Grade DTI grade*DTI/ dist=poisson link=log;
run;
19 model (Accs/bad_Accs)*100 = Grade DTI grade*DTI/ dist=poisson link=log;
_
22
76
ERROR 22-322: Syntax error, expecting one of the following: a name, ','.
ERROR 76-322: Syntax error, statement will be ignored.
Any suggestion on how to model for bad accs including severity as well in the model?
If your data consist of a count of events and a count of total trials, then the proper syntax is the following to fit the model which is a logistic model for this binomial response. You can use either LOGISTIC or GENMOD with the same syntax.
proc logistic data=DTI_mod;
class grade DTI;
model bad_accs/total_accs = Grade DTI grade*DTI;
run;
Use offset= option.
proc genmod data=DTI;
class grade DTI;
model bad_Accs= Grade DTI grade*DTI/ dist=poisson link=log offset=total_accs;
run;
The offset in @Ksharp's solution should be "log(total_accs)", but otherwise I agree. It is actually also possible to model the rate directly as you suggest, just, you should weight with the "total_accs"
proc genmod data=DTI;
class grade DTI;
model rate= Grade DTI grade*DTI/ dist=poisson link=log;
weight total_accs;
run;
where the rate is a variable defined as bad_Accs/total_accs. The "offset"-solution and the "weight-solution" are equivalent (same estimate and standard errors).
But maybe a better solution here is to regard the bad_accs as outcome from a binomial distribution.
@Ksharp@JacobSimonsen thanks!
when i try to use @Ksharp solution I am getting below errors.
proc genmod data=DTI;
class grade DTI;
/*weight accs;*/
model bad_Accs = Grade DTI grade*DTI/ dist=poisson link=log offset=log(accs);
run;
ERROR: Variable LOG not found.
(or)
proc genmod data=DTI;
class grade DTI;
/*weight accs;*/
model bad_Accs = Grade DTI grade*DTI/ dist=poisson link=log offset=accs;
run;
ERROR: The mean parameter is either invalid or at a limit of its range for some observations.
I have also tried @JacobSimonsen your approach but still not able to relate the results with observed results.
proc genmod data=DTI_new;
class grade DTI;
/* rate = bad_accs/total_accs */
model rate = Grade DTI grade*DTI/ dist=poisson link=log;
weight total_accs;
run;
observed:
Grade | DTI | bad rate | Total Accs | Bad Accs |
A | A-(0-45%) | 0.23% | 165121 | 385 |
B | A-(0-45%) | 0.68% | 250156 | 1708 |
C | A-(0-45%) | 1.92% | 240478 | 4609 |
D | A-(0-45%) | 3.05% | 33809 | 1030 |
E | A-(0-45%) | 3.89% | 2853 | 111 |
F | A-(0-45%) | 1.52% | 7417 | 113 |
G | A-(0-45%) | 3.30% | 3026 | 100 |
model results:
Parameter | DF | Estimate | ||
Grade*DTI | A | A-(0-45%) | 1 | 0.1193 |
Grade*DTI | B | A-(0-45%) | 1 | 0.7107 |
Grade*DTI | C | A-(0-45%) | 1 | 1.202 |
Grade*DTI | D | A-(0-45%) | 1 | 0.5473 |
Grade*DTI | E | A-(0-45%) | 1 | 16.0838 |
Grade*DTI | F | A-(0-45%) | 1 | 15.1462 |
Grade*DTI | G | A-(0-45%) | 0 | 0 |
Maybe I am not reading it right ( I am trying to relate model estimate to the observed bad rate %), but it doesn't seem to rank order the bad rate by grade*DTI correctly.
you can not put "log(accs)" into offset. You have to create a variable in a dataset before the procedure that contain the log values. That variable should be in offset.
The message, "ERROR: The mean parameter is either invalid or at a limit of its range for some observations" can be because there is a level in the interaction term where observation is zero. I dont think its a coding error.
The
Again, as I see your data, it looks more as binomial data than Poisson distributed data. Why do you want to use Poisson distribution instead of binomial distribution?
Thanks. I have modified my data so I have a 2 level target variable and tried genmod with binomial distribution. It is giving me similar results compared to poisson.
data:
Grade | DTI | year | bad | Accs |
A1 | A-(0-45%) | 2013 | N | 51453 |
A1 | A-(0-45%) | 2013 | Y | 103 |
A1 | A-(0-45%) | 2014 | N | 49798 |
A1 | A-(0-45%) | 2014 | Y | 120 |
A1 | A-(0-45%) | 2015 | N | 59573 |
A1 | A-(0-45%) | 2015 | Y | 150 |
A1 | A-(0-45%) | 2016 | N | 3912 |
code:
proc genmod data=DTI_mod descending;
class grade DTI;
/*weight accs;*/
model bad = Grade DTI grade*DTI/ dist=binomial link=log;
weight accs;
run;
log:
NOTE: PROC GENMOD is modeling the probability that bad='Y'.
WARNING: The negative of the Hessian is not positive definite. The convergence is questionable.
WARNING: The procedure is continuing but the validity of the model fit is questionable.
WARNING: The specified model did not converge.
NOTE: The Pearson chi-square and deviance are not computed since the AGGREGATE option is not specified.
WARNING: Negative of Hessian not positive definite.
NOTE: The scale parameter was held fixed.
NOTE: PROCEDURE GENMOD used (Total process time):
real time 0.15 seconds
cpu time 0.07 seconds
results:
Parameter | DF | Estimate | ||
Grade*DTI | A1 | A-(0-45%) | 1 | 0.1193 |
Grade*DTI | A2 | A-(0-45%) | 1 | 0.7107 |
Grade*DTI | A3 | A-(0-45%) | 1 | 1.202 |
Grade*DTI | D1 | A-(0-45%) | 1 | 0.5473 |
Grade*DTI | D2 | A-(0-45%) | 1 | 16.6717 |
Grade*DTI | DN | A-(0-45%) | 1 | 15.7341 |
Grade*DTI | DS | A-(0-45%) | 0 | 0 |
If the ACCS variable contains the count of the number of events and nonevents, then you should use
FREQ accs;
instead of using the WEIGHT statement. Frequencies and weights have different meanings in a regression.
@Rick_SAS Thanks for the link its very useful!
I am getting exact same results even after using FREQ accs;
proc genmod data=DTI_mod descending;
class grade DTI;
freq accs;
model bad = Grade DTI grade*DTI/ dist=binomial link=log;
run;
I think you are done. The error message is because you have some cells with 0. Therefore, it can not estimate all parameters with the data you have, which then cause the warning in the log.
Its right that "freq" should be used instead of weight. The two options results in same parameter estimates, but not always same p-values.
If your data consist of a count of events and a count of total trials, then the proper syntax is the following to fit the model which is a logistic model for this binomial response. You can use either LOGISTIC or GENMOD with the same syntax.
proc logistic data=DTI_mod;
class grade DTI;
model bad_accs/total_accs = Grade DTI grade*DTI;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.