10-23-2013 11:17 AM
I have a question about orthogonal polynomials. Any comment and hint will be greatly appreciated! Thanks in advance!
I have a project that needs polynomial terms when building a GLM (proc genmod). I have several continuous variables that I want to put into the model using orthogonal polynomials. While I checked the SAS online documentation and found the only way to build orthogonal polynomial is using orpol function in proc iml. The difficulty is that my data set has around 7 million observations and I am afraid it will be slow in proc iml: apply the orpol function to the variables one by one and append the orthogonal polynomials to the original data set then run the model as a regular regression.
My question is: Does anyone of you know if there is any method I can do orthogonal polynomials in proc genmod or in data steps? I checked some references listed on the SAS documentation and did not find a quick way to implement the algorithms mentioned in the papers. If you could give me some hint on this, I would highly appreciate it!
Another question is: after using orpol function, should I put the constant term in the model too? How to put a constant in a regression model? Like the first column in the below:
x = T(do(-1,1,0.2));
0.3015113 -0.476731 0.5120916
0.3015113 -0.381385 0.2048366
0.3015113 -0.286039 -0.034139
0.3015113 -0.190693 -0.204837
0.3015113 -0.095346 -0.307255
0.3015113 -4.21E-18 -0.341394
0.3015113 0.0953463 -0.307255
0.3015113 0.1906925 -0.204837
0.3015113 0.2860388 -0.034139
0.3015113 0.381385 0.2048366
0.3015113 0.4767313 0.5120916
10-23-2013 11:49 AM
You do not need to include the constant terms. They will all be a multiple of the 1 vector, and PROC GENMOD will automatically add an intercept.
I suppose you have a reason for wanting to do this, but I wonder if you have considered whether adding orthogonal polynomials for EACH variable is really what you want. The orthogonal polynomials for each variable will be orthogonal to each other, but there will still be correlations between different variables. An alternative to consider is whether you want to do a regression on the principal components of the original variables. The principal components will also be orthogonal, and you can use the usual MODEL statement to model interactions:
MODEL Y = PC1 | PC2 | PC3 @2;
10-23-2013 11:59 AM
The size of your data set is not relevant in the use of IML for the coefficients. You only use IML to get the values, as you did. Then you would put these into contrast or estimate statements in GENMOD. Note that in CONTRAST, the coefficients must add to 0, exactly (not just to two or three decimal places, or whatever). You would rescale first: multiply these long decimals by a big number and then put in a / divisor= option to bring these back to what you want.
But with GENMOD, you can get orthogonal polynomial parameterization directly within the procedure. Make sure the levels of your factors are numeric (not letters, names, etc.). Then use:
class factor1 factor2 / effect=orthpoly;
Check out the GENMOD documentation for different ways of specifying the CLASS coefficients. I haven't used this option myself (I have used other effect= options), so I don't recall what happens when you specify interactions.
10-23-2013 12:48 PM
Are your variables continuous? Using the CLASS statement is going to generate a design matrix with as many columns as there are unique values in your variables. If your variables are continuous, this will generate about 7 million columns, which is not what you want to do.
10-23-2013 01:25 PM
Thanks for all your reply!
My variables are continuous and the reason why I want to use orthogonal polynomials is that I want to avoid the multicollinearity between x^1 and x^2 when putting polynomial terms in regression. The correlation between variables is not concerned here.
10-23-2013 01:52 PM
My suggestion would only work if you have a limited number of levels of the variables. The fact that you wanted to do orthogonal polynomials suggested a relatively small number of levels (unique values) for the variables. This is the usual situation when someone would use orthpolys. For your situation, you should not use my idea, nor should you use contrasts to deal with the topic. There are some examples in Littell et al. (2006), SAS for Mixed Models, 2nd edition, that would help you (for continuous and categorical variables in the same model).
I personally would not worry much about orthogonality, a luxury that you don't need (or maybe can't afford). Especially with millions of observations. Are you working with nonnormal response variable? If you can assume normality, then I highly recommend PROC PLS for dealing with the correlations of the variables. This is a great procedure, primarily intended for all continuous predictor variables.No matter what approach you take, you probably will want to first work on a sample of your full data set to get the bugs worked out. Then fit the model to the full data set.
10-23-2013 01:51 PM
How many variables? I used degree=2 for 10 variables and 1 million observations. It took about 10-12 minutes. (I didn't get an exact time.) How long does your computaiton take?
10-23-2013 02:22 PM
7 continuous variables (some are in fact categorical ones but we want to treat them as continuous).
The proc iml with orpol function took around 1 hour to finish for the 7 variables/7 million obs.
I looked at those orthogonal polynomials and they seem very small: do they look reasonable? Is it because they are standardized?
here are the original variables:
Could you please name some reference that I can read to understand the algorithm how SAS does this? Thanks!
10-24-2013 02:59 PM
The SAS/IML Users Guide documents the ORPOL algorithm.
Yes, each element will be small because the column has unit norm.
That is, the sum of the squares of the elements is 1. That means that the average size is
sqrt(1/N), which is 0.00038 when N=7million.
10-24-2013 03:16 PM
Thanks Rick! My coefficients for those orthogonal terms are pretty large( around hundred) compared to other variables. I think it is due to the normalization, is there a way to use the original scale in the regression?