Hi all,
Is that true that when we use proc princomp, the variable prin1 in the output data must be from -1 to 1?
I used several survey items (H1DS5 H1DS8 H1DS1 H1DS2 H1DS10) with exactly the same scale and I used proc princomp. I got
SAS Output
2.05794307 | 1.16465860 | 0.4116 | 0.4116 |
0.89328447 | 0.07840683 | 0.1787 | 0.5902
|
So I will use the first component since only the first is bigger than 1.
Then I open the output data and found the variable prin1--it is the component score that I should use as my latent factor, right?
But its distribution is not good.
SAS Output
The SAS System |
It ranged from -.8 to 11.3! I know the distribution is skewed but I can not log transform it because of the negative values.
In this case, if I am using it as my outcome variable in the model, can I use OLS?
this is my code:
proc princomp data=wave1.wave1 out=wave1.pcsat_cor;
var H1DS5 H1DS8 H1DS1 H1DS2 H1DS10;
run;
proc univariate data=wave1.pcsat_cor;
var prin1;
histogram prin1;
run;
Thank you!
@Lindy wrote:
Hi all,
Is that true that when we use proc princomp, the variable prin1 in the output data must be from -1 to 1?
No
So I will use the first component since only the first is bigger than 1.
There are other ways of determining how many components to use.
It ranged from -.8 to 11.3! I know the distribution is skewed but I can not log transform it because of the negative values.
The PRIN1 scores do not have to have any particular distribution. Transforming PRIN1 isn't something that is usually done. In fact, it may be that you have an outlier in PRIN1 (and possibly elsewhere). Have you plotted the distribution of PRIN1 to check for outliers? If there is a serious outlier, and you decide you should remove the outlier, then you would want to re-run the PCA analysis.
Thank you, Paige! I checked the frequency and distribution of prin1, and I found there is no particular outlier. The range of prin1 is -.77 to 12.95. The sample is with about 4000 cases and the majority of the respondents fall in -.77 to about 1 on prin1 (delinquency score), but there are some respondents evenly scored at some value from 1 to 12.95.
In this case, should I go ahead to use prin1 as my outcome variable in OLS?
Thank you!
--lindy
@Lindy wrote:
In this case, should I go ahead to use prin1 as my outcome variable in OLS?
Well now you have opened up a whole new issue. I am on record as opposing the use of PCA results as inputs to OLS, despite the fact that 90% of the rest of the world goes ahead and does this, ignoring the fatal flaw of using PCA as input to OLS.
Why? Because PCA does not include information about the Y variables when it determines the components and scores. You can easily get components that are not predictive of Y. This is a fatal flow of using PCA to predict Y variables. So there is really no reason to use PCA here if the ultimate goal is some prediction of a Y variable. You want to use a method of determining components that will produce components that are predictive of Y ... that method is called Partial Least Squares regression, which is PROC PLS in SAS. PLS will give you better predictions of Y than PCA ever will.
Thank you so much for your insights, Paige! I am not using PCA to predict Y.
My plan is like this.
I want to have a latent variable called "delinquency propensity" as Y in my model, and I have several independent variables such as parenting styles to the children, children's school scores, etc. Majority of the independent variables are scaled variables.
Because there is no item in my data called "delinquency propensity", I used several items from the data asking the frequency of using a weapon, fighting, truancy, etc. Using proc princomp, I found these items are under 1 latent factor, so I want to use prin1 in the output as "delinquency propensity" --Y in my model. As I posted before, this prin1 ranged from -.77 to more than 12.
Based on the info, do you think OLS model is good option?
Thank you very much!
Okay, I got it backwards, you are using PCA to create Y, not predict Y.
I still don't see any reason to do this. You are creating PCA scores that may not be well predictable by your X variables. It simply doesn't make sense to do this. PLS does not have this drawback. PLS will find components of your variables that are well predictable by your X variables (predicted as well as the data will allow).
PCA simply doesn't help here.
Thank you! Will check out PLS!
Good luck!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.