Solved: Re: Using proc princomp to generate a model

juanvg1972 · Posted 10-03-2017 02:28 PM

Hi,

I am using proc princomp to reduce dims in a dataset. I have var1-var30 and now with PCA: prin1-prin10

I have created a logistic regression model with proc logistic using as input vars the autovectors: prin-prin10.

Once my logisctic model is created and validated I want to score the model in a new dataset.

In my new dataset I have the original vars var1-var30 and to apply the model I have to calculate

prin1-prin10 in this new dataset. I have some question:

- I suppose that I have to create prin1-prin10 in the new model with the coefficients obtained with the first dataset, for example:

prin1 = c1*var1 + c2*var2 + ......cN*varN, is it right??

- Is there any quick way to do this transformation?, an option of proc princomp?, do I have to calculate writting all the formulas??

Any advice will be greatly appreciated.

Rick_SAS · Posted 10-03-2017 03:24 PM

You have to do this in two steps: For the original data:

1. Save the PRINCOMP loadings by using the OUTSTAT= option. Save the PCs by using the OUT= option.

2. Run PROC LOGISTIC on the PCs. Use the STORE statement to save the logistic model.

Then for the new data:

3. Use PROC SCORE to generate the PC scores for the new data.

4. Use PROC PLM to evaluate the logistic model on the PCs for the new data.

I think you ought to be able to modify the following code for your data:

data Fitness;
   call streaminit(123);
   input Age Weight Oxygen RunTime RestPulse RunPulse @@;
	survived = rand("Bernoulli", 0.5);
   datalines;
44 89.47  44.609 11.37 62 178     40 75.07  45.313 10.07 62 185
44 85.84  54.297  8.65 45 156     42 68.15  59.571  8.17 40 166
38 89.02  49.874  9.22 55 178     47 77.45  44.811 11.63 58 176
40 75.98  45.681 11.95 70 176     43 81.19  49.091 10.85 64 162
44 81.42  39.442 13.08 63 174     38 81.87  60.055  8.63 48 170
44 73.03  50.541 10.13 45 168     45 87.66  37.388 14.03 56 186
;
proc princomp data=Fitness N=3 plots=none
   outstat=PCModel      /* for scoring new data */
   out=PCData;          /* the original data + PCs */
   var Age Weight RunTime RunPulse RestPulse;
run;

proc logistic data=PCData;
model survived = Prin1-Prin3;
store out=LogiModel;    /* item store for scoring model */
run;

/* For the example, create the "new" data set. You won't need to do this...  */
data NewData;
set Fitness;
Age = Age - 5;
Weight = Weight - 10;
call streaminit(321);
Oxygen = Oxygen + rand("Normal");
RunTime = RunTime - 1;
run;

/* use new data and produce PC scores */
proc score data=NewData score=PCModel   /* from OUTSTAT= option */
                        out=NewPCData;  /* PCs for new data */
   var Age Weight RunTime RunPulse RestPulse;
run;
/* and use PC scores to score the logistic model */
proc plm restore=LogiModel;         /* from STORE statement in LOGISTIC */
   score data=NewPCData out=NewScore / ilink; /* score new data */
run;

proc print data=Newscore; /* final: logisitc model applied to new data */
var Prin1-Prin3 Predicted;
run;

View solution in original post

Reeza · Posted 10-03-2017 02:41 PM

Look at PROC SCORE and/or PLM and PROC PLS does this in one step, so consider this as an alternative overall as well.

Rick_SAS · Posted 10-03-2017 03:24 PM

You have to do this in two steps: For the original data:

1. Save the PRINCOMP loadings by using the OUTSTAT= option. Save the PCs by using the OUT= option.

2. Run PROC LOGISTIC on the PCs. Use the STORE statement to save the logistic model.

Then for the new data:

3. Use PROC SCORE to generate the PC scores for the new data.

4. Use PROC PLM to evaluate the logistic model on the PCs for the new data.

I think you ought to be able to modify the following code for your data:

data Fitness;
   call streaminit(123);
   input Age Weight Oxygen RunTime RestPulse RunPulse @@;
	survived = rand("Bernoulli", 0.5);
   datalines;
44 89.47  44.609 11.37 62 178     40 75.07  45.313 10.07 62 185
44 85.84  54.297  8.65 45 156     42 68.15  59.571  8.17 40 166
38 89.02  49.874  9.22 55 178     47 77.45  44.811 11.63 58 176
40 75.98  45.681 11.95 70 176     43 81.19  49.091 10.85 64 162
44 81.42  39.442 13.08 63 174     38 81.87  60.055  8.63 48 170
44 73.03  50.541 10.13 45 168     45 87.66  37.388 14.03 56 186
;
proc princomp data=Fitness N=3 plots=none
   outstat=PCModel      /* for scoring new data */
   out=PCData;          /* the original data + PCs */
   var Age Weight RunTime RunPulse RestPulse;
run;

proc logistic data=PCData;
model survived = Prin1-Prin3;
store out=LogiModel;    /* item store for scoring model */
run;

/* For the example, create the "new" data set. You won't need to do this...  */
data NewData;
set Fitness;
Age = Age - 5;
Weight = Weight - 10;
call streaminit(321);
Oxygen = Oxygen + rand("Normal");
RunTime = RunTime - 1;
run;

/* use new data and produce PC scores */
proc score data=NewData score=PCModel   /* from OUTSTAT= option */
                        out=NewPCData;  /* PCs for new data */
   var Age Weight RunTime RunPulse RestPulse;
run;
/* and use PC scores to score the logistic model */
proc plm restore=LogiModel;         /* from STORE statement in LOGISTIC */
   score data=NewPCData out=NewScore / ilink; /* score new data */
run;

proc print data=Newscore; /* final: logisitc model applied to new data */
var Prin1-Prin3 Predicted;
run;

PGStats · Posted 10-03-2017 03:41 PM

Not sure there is any advantage in doing this. You end up with a model including all the variables anyway, with coefficients that do not necessarily make sense. But anyhow, here is how to do the scoring:

/* Example dataset from proc factor doc */
data SocioEconomics;
   input Population School Employment Services HouseValue;
   datalines;
5700     12.8      2500      270       25000
1000     10.9      600       10        10000
3400     8.8       1000      10        9000
3800     13.6      1700      140       25000
4000     12.8      1600      140       25000
8200     8.3       2600      60        12000
1200     11.4      400       10        16000
9100     11.5      3300      60        14000
9900     12.5      3400      180       18000
9600     13.7      3600      390       25000
9600     9.6       3300      80        12000
9400     11.4      4000      100       13000
;

/* Principal components from 4 variables to 2 components */
proc princomp 
    data=socioeconomics 
    n=2 
    outstat=sestats 
    out=sescores
    plots=none ;
var population -- services;
run;

proc print data=sestats; run;

/* How to find the scores from proc princomp stats. For this test, 
   use the original data again. You can use any dataset with the same 
   variables used to calculate the principal components. Here, for example,
   you require varables Population School Employment Services */
proc score data=SocioEconomics score=sestats out=sescores2;
var population -- services;
run;

/* Check that the new scores match the original scores */
proc compare 
    base=sescores      /* from proc princomp */
    compare=sescores2  /* from proc score */
    method=relative(1e-7) 
    briefsummary; 
run;

PG

juanvg1972 · Posted 10-03-2017 05:33 PM

Thanks PGStats,

Why do you say 'Not sure there is any advantage in doing this'??

I don't understand...I have to do de PCA to reduce dim, and then I have to work with this new vars...

Reeza · Posted 10-03-2017 05:48 PM

@juanvg1972 wrote:

Thanks PGStats,

Why do you say 'Not sure there is any advantage in doing this'??

I don't understand...I have to do de PCA to reduce dim, and then I have to work with this new vars...

Since you need all the variables to calculate the PCA are you really reducing the DIM? You're creating new variables/features but what you need to run the model is the same.

PROC PLS does this in one step, you should really look into it a bit more.

PGStats · Posted 10-03-2017 06:04 PM

@Reeza, PLS would be a good alternative for linear models, but it doesn't do logistic regression.

PG

PGStats · Posted 10-03-2017 05:57 PM

It all depends on the purpose of your model and the number of observations that you have to build it on. By using PCA, your model becomes essentially a black box where the role of your original variables is very difficult to track.

PG

Ksharp · Posted 10-05-2017 03:59 AM

Why not use variable cluster analysis to reduce dimension ?

PROC VARCLUS;