Re: Predicted Value not valid

Mathis1 · Posted 04-22-2020 09:45 AM

Hello,

I performed a Proc GLM on a 1200 observations table with the option p=pred.

proc glm data=TABLE Outstat=Results ;
class A B C D E F G ;
model Y= A B C D E F G/ solution;
output out=Results2 p=pred;
run;
quit;

The issue is that my predicted value in the table Results2 is way too close (nearly always equal) to the dependant variable.

When I try to compute manually some Y value (with the intercept and the coefficients) i find a predicted value that's not equal to the "pred" value automatically given by sas.

Could it be related with the fact that some variables possess a lot of modalities (more than 100 modalities for the variable A and B...). Despite this, the estimated coefficient seem to be quite well fitted with the data because my manual computing of Y (with the coefficients) is quite close to the real value of Y.

Thank you in advance for your reply 🙂

ballardw · Posted 04-22-2020 10:31 AM

I'm afraid that you have to provide some data.

If your model is good then the predicted value should be close to the dependent value. So I am kind of wondering what the concern is.

An additional concern with class variables with 100's of values could be that you do not have much data, read number of observations, for some combinations and are running to issues because of that.

Since you claim to calculate some "manually" then you need to show the code you used, or very carefully describe how you did so. It quite possible that if you looked at printed output and used those values for manual calculation that you did not use enough decimals.

Having done linear regression by hand on paper with pencil at one time I know very well that truncation/rounding can very significantly distort results.

SteveDenham · Posted 04-22-2020 02:50 PM

This is a bit off-topic, but truncation/rounding ALWAYS results in odd things happening. Look at some of the maximum likelihood iterative methods, the simulation method for multiple comparisons, or the Bayesian methods. Once you do enough calculations, the fact that you can't express base 10 fractions EXACTLY in binary/hex leads to the accumulation of errors and disparate results. Hopefully, everything gets to a decent stopping point before the errors overwhelm the estimation. Physicists and biological modelers learned that the hard way.

SteveDenham

PaigeMiller · Posted 04-22-2020 03:19 PM

When I try to compute manually some Y value (with the intercept and the coefficients) i find a predicted value that's not equal to the "pred" value automatically given by sas.

Its hard to imagine SAS has it wrong, and manual calculations are correct. My general rule of thumb is that if SAS says one thing and a programmer says something else, I believe SAS. But you have to show us data and the results from SAS.

--
Paige Miller

djmangen · Posted 04-22-2020 04:33 PM

Agree with needing the data and results. If the class variables have hundreds of categories, what is a typical sample for each category? I'm going to guess that you've nearly run out of degrees of freedom.

Mathis1 · Posted 04-28-2020 09:27 AM

Thank you all for your replies. I give you on the picture below, an example of what I mean by a "not valid pred" :

So as you can see, 7 out of the 8 observations are "correctly" predicted. But the second observation has a pred of only one decimal (which i find weird when you compare it to the others predicted values). Besides, when computing the predicted value manually (with the coefficients), i find 15.26. Another thing, the error (TI_NUM - pred) gives us a long decimal number when the "pred" value is only 2 decimal long.

Here is a QQPlot of the error :

I suspect the very middle part of the percentiles (the flat part) to be due to those "not valid" preds that make the error equal to 0.

Finally, here is a screenshot of the anova output from the PROC GLM, the table is 1127 observations long :

Last but not least, please find attached the table of the coefficients with their standard error, their coefficients and their t-test results. I purposely don't give the name of the different modalities but you will find the name of the correspondant variable for each given coefficient.

Thank you very much for your help so far, sincerely.

PaigeMiller · Posted 04-28-2020 09:44 AM

So let's review. Several people have asked to see your data, not yet provided, I understand why you don't want to share the original data with us; but that is a major barrier that prevents us from digging further; and you could somehow anonymize the data such that category "George Freeman" becomes "Category 12".

Also, if you have a cell with only a single data point, you could get exact predictions, so your row 2 is not necessarily indicative of a problem. How many data points are in the level predicted by row 2? Or if the variability in the cell is zero, you also could get exact predictions.

Lastly, if you are going to claim that SAS has gotten it wrong and your calculations are correct, you need to provide your calculations (which was also a request earlier in this thread), but I go back to my own personal rule of thumb (maybe it should be a "maxim"?) that says when SAS and the programmer disagree, I believe SAS. It takes a lot of evidence to come to the conclusion that the calculations from SAS are wrong, and you have not even started down that path of providing a lot of evidence.

--
Paige Miller

Mathis1 · Posted 04-28-2020 12:46 PM

Thank you for your messages.

Find attached the data set that I used for my GLM, sorry for having been reluctant to disclose it earlier.

When executing my proc GLM I used all the variables as categorical variables.

Please don't hesite to tell me where the dataset shows shortcomings and how it 's design may be improved...

Thank you again for your help 🙂

NB : i saw a couple of times in the thread the invocation of the programmer claiming himself being right against sas being wrong. I never used those words nor have I ever made a value judgment about the software. I'm very conscious of my lack of knowledge and experience on the subject and am also aware that what i call the "non valid" prediction is so because of me and the way i used sas and my dataset. The help that i'm seeking, is precisely about knowing where I'm wrong. Very far be it from me to claim that I am right.

This being said... I thank you by advance for your help.

djmangen · Posted 04-28-2020 11:22 AM

Frankly, I am not at all surprised that this is happening, Given the n=1200 and some of your categorical variables have more than 100 categories, you almost certainly are seriously over-fitting this equation. My recommendation is that you take at least two steps back and rethink the model. How detailed should the different measures in the model be? How are you possibly going to interpret a measure with over 100 categories in it?

As for SAS being wrong and your manual calculations correct -- it does happen, but I doubt that is the case here.

Ready to join fellow brilliant minds for the SAS Hackathon?