Hi,
I am using proc princomp to reduce dims in a dataset. I have var1-var30 and now with PCA: prin1-prin10
I have created a logistic regression model with proc logistic using as input vars the autovectors: prin-prin10.
Once my logisctic model is created and validated I want to score the model in a new dataset.
In my new dataset I have the original vars var1-var30 and to apply the model I have to calculate
prin1-prin10 in this new dataset. I have some question:
- I suppose that I have to create prin1-prin10 in the new model with the coefficients obtained with the first dataset, for example:
prin1 = c1*var1 + c2*var2 + ......cN*varN, is it right??
- Is there any quick way to do this transformation?, an option of proc princomp?, do I have to calculate writting all the formulas??
Any advice will be greatly appreciated.
You have to do this in two steps: For the original data:
1. Save the PRINCOMP loadings by using the OUTSTAT= option. Save the PCs by using the OUT= option.
2. Run PROC LOGISTIC on the PCs. Use the STORE statement to save the logistic model.
Then for the new data:
3. Use PROC SCORE to generate the PC scores for the new data.
4. Use PROC PLM to evaluate the logistic model on the PCs for the new data.
I think you ought to be able to modify the following code for your data:
data Fitness;
call streaminit(123);
input Age Weight Oxygen RunTime RestPulse RunPulse @@;
survived = rand("Bernoulli", 0.5);
datalines;
44 89.47 44.609 11.37 62 178 40 75.07 45.313 10.07 62 185
44 85.84 54.297 8.65 45 156 42 68.15 59.571 8.17 40 166
38 89.02 49.874 9.22 55 178 47 77.45 44.811 11.63 58 176
40 75.98 45.681 11.95 70 176 43 81.19 49.091 10.85 64 162
44 81.42 39.442 13.08 63 174 38 81.87 60.055 8.63 48 170
44 73.03 50.541 10.13 45 168 45 87.66 37.388 14.03 56 186
;
proc princomp data=Fitness N=3 plots=none
outstat=PCModel /* for scoring new data */
out=PCData; /* the original data + PCs */
var Age Weight RunTime RunPulse RestPulse;
run;
proc logistic data=PCData;
model survived = Prin1-Prin3;
store out=LogiModel; /* item store for scoring model */
run;
/* For the example, create the "new" data set. You won't need to do this... */
data NewData;
set Fitness;
Age = Age - 5;
Weight = Weight - 10;
call streaminit(321);
Oxygen = Oxygen + rand("Normal");
RunTime = RunTime - 1;
run;
/* use new data and produce PC scores */
proc score data=NewData score=PCModel /* from OUTSTAT= option */
out=NewPCData; /* PCs for new data */
var Age Weight RunTime RunPulse RestPulse;
run;
/* and use PC scores to score the logistic model */
proc plm restore=LogiModel; /* from STORE statement in LOGISTIC */
score data=NewPCData out=NewScore / ilink; /* score new data */
run;
proc print data=Newscore; /* final: logisitc model applied to new data */
var Prin1-Prin3 Predicted;
run;
Look at PROC SCORE and/or PLM and PROC PLS does this in one step, so consider this as an alternative overall as well.
You have to do this in two steps: For the original data:
1. Save the PRINCOMP loadings by using the OUTSTAT= option. Save the PCs by using the OUT= option.
2. Run PROC LOGISTIC on the PCs. Use the STORE statement to save the logistic model.
Then for the new data:
3. Use PROC SCORE to generate the PC scores for the new data.
4. Use PROC PLM to evaluate the logistic model on the PCs for the new data.
I think you ought to be able to modify the following code for your data:
data Fitness;
call streaminit(123);
input Age Weight Oxygen RunTime RestPulse RunPulse @@;
survived = rand("Bernoulli", 0.5);
datalines;
44 89.47 44.609 11.37 62 178 40 75.07 45.313 10.07 62 185
44 85.84 54.297 8.65 45 156 42 68.15 59.571 8.17 40 166
38 89.02 49.874 9.22 55 178 47 77.45 44.811 11.63 58 176
40 75.98 45.681 11.95 70 176 43 81.19 49.091 10.85 64 162
44 81.42 39.442 13.08 63 174 38 81.87 60.055 8.63 48 170
44 73.03 50.541 10.13 45 168 45 87.66 37.388 14.03 56 186
;
proc princomp data=Fitness N=3 plots=none
outstat=PCModel /* for scoring new data */
out=PCData; /* the original data + PCs */
var Age Weight RunTime RunPulse RestPulse;
run;
proc logistic data=PCData;
model survived = Prin1-Prin3;
store out=LogiModel; /* item store for scoring model */
run;
/* For the example, create the "new" data set. You won't need to do this... */
data NewData;
set Fitness;
Age = Age - 5;
Weight = Weight - 10;
call streaminit(321);
Oxygen = Oxygen + rand("Normal");
RunTime = RunTime - 1;
run;
/* use new data and produce PC scores */
proc score data=NewData score=PCModel /* from OUTSTAT= option */
out=NewPCData; /* PCs for new data */
var Age Weight RunTime RunPulse RestPulse;
run;
/* and use PC scores to score the logistic model */
proc plm restore=LogiModel; /* from STORE statement in LOGISTIC */
score data=NewPCData out=NewScore / ilink; /* score new data */
run;
proc print data=Newscore; /* final: logisitc model applied to new data */
var Prin1-Prin3 Predicted;
run;
Not sure there is any advantage in doing this. You end up with a model including all the variables anyway, with coefficients that do not necessarily make sense. But anyhow, here is how to do the scoring:
/* Example dataset from proc factor doc */
data SocioEconomics;
input Population School Employment Services HouseValue;
datalines;
5700 12.8 2500 270 25000
1000 10.9 600 10 10000
3400 8.8 1000 10 9000
3800 13.6 1700 140 25000
4000 12.8 1600 140 25000
8200 8.3 2600 60 12000
1200 11.4 400 10 16000
9100 11.5 3300 60 14000
9900 12.5 3400 180 18000
9600 13.7 3600 390 25000
9600 9.6 3300 80 12000
9400 11.4 4000 100 13000
;
/* Principal components from 4 variables to 2 components */
proc princomp
data=socioeconomics
n=2
outstat=sestats
out=sescores
plots=none ;
var population -- services;
run;
proc print data=sestats; run;
/* How to find the scores from proc princomp stats. For this test,
use the original data again. You can use any dataset with the same
variables used to calculate the principal components. Here, for example,
you require varables Population School Employment Services */
proc score data=SocioEconomics score=sestats out=sescores2;
var population -- services;
run;
/* Check that the new scores match the original scores */
proc compare
base=sescores /* from proc princomp */
compare=sescores2 /* from proc score */
method=relative(1e-7)
briefsummary;
run;
Thanks PGStats,
Why do you say 'Not sure there is any advantage in doing this'??
I don't understand...I have to do de PCA to reduce dim, and then I have to work with this new vars...
@juanvg1972 wrote:
Thanks PGStats,
Why do you say 'Not sure there is any advantage in doing this'??
I don't understand...I have to do de PCA to reduce dim, and then I have to work with this new vars...
Since you need all the variables to calculate the PCA are you really reducing the DIM? You're creating new variables/features but what you need to run the model is the same.
PROC PLS does this in one step, you should really look into it a bit more.
@Reeza, PLS would be a good alternative for linear models, but it doesn't do logistic regression.
It all depends on the purpose of your model and the number of observations that you have to build it on. By using PCA, your model becomes essentially a black box where the role of your original variables is very difficult to track.
Why not use variable cluster analysis to reduce dimension ?
PROC VARCLUS;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.