Underlying assumptions for PCA using proc factor method=prin

maya12 · Posted 07-21-2016 05:23 PM

Hi,

I was wondering what the underlying assumption is when using proc factor method=prin priors=1 (principal components analysis).

Is the data being used on the raw data, a correlation matrix or a covariance matrix? I'm almost positive it's not a covariance matrix since this requires an extra data step (to confirm you want a covariance matrix). So if you don't specify anything, is whats being run on raw data or a correlation matrix. And if these are different, how do you chose which one is better? If you specify corr in the datastep, this only shows you correlation matrix in output, it doesn't actually make the PCA based off of a correlation matrix, so how do you do this?

If I have different likert scales for each question (ranges 0-6, 0-12 etc) what is the best matrix to use?

Rick_SAS · Posted 07-22-2016 08:33 AM

Hi,

Welcome to SAS and the Support Community.

I recommend looking at the PROC PRINCOMP documentation if you are running a PCA. The first sentence of the PROC FACTOR doc says "See Chapter ... for a discussion of principal component analysis." Consequently, most of the PCA doc is in the PRINCOMP chapter. In fact, unless you are doing factor rotations, I recommend using PROC PRINCOMP for PCA.

When you use the METHOD=PRIN option, PROC FACTOR computes the same principal component analysis as PROC PRINCOMP. So, yes, you are correct that the default analysis uses the correlation matrix.

I don't understand what you mean by " this requires an extra data step (to confirm you want a covariance matrix)". In most cases, analysts use the DATA= option to specify the raw data set. The procedure then automatically computes the correlation matrix and then uses that matrix for the factor or PC analysis. However, this page of the doc explains that you can also specify pre-summarized data. There is no "data step" involved, but let me know if the doc does not adequately explain the difference betwee reading raw data and reading a TYPE=CORR data set. The important thing is that BOTH methods analyze the correlation matrix.

For variables that have different scales, I would suggest using the CORR matrix, because that is equivalent to standardizing the variables before using them. (The covariance matrix for standardized variables equals the correlation matrix of unstandardized variables.)

maya12 · Posted 07-22-2016 11:51 AM

Hi,

Thanks for answering my question. I will be doing a rotation (varimax) so I think staying with proc factor instead of proc princomp. Also, I need to create factor scores after i get the results of the PCA. Can I get factor scores if I use the cov matrix?

As the for the link to the document, thank you for showing that. So one reason to pre specify type=corr is if you have a lot of observations and will be running factor analysis a lot. Otherwise, not specifying type, it just means it's taking more time to run but it's still using a correlation matrix?

I'm asking this becuase i'm using the "A step by step approach to using SAS for factor analysis and structal equation modeling" By Hatcher and it says "data may be input in the form of raw data, a correlation matrix or a covariance matrix" and it shows examples of how to run on raw data (not extra data step is required) and then on a correlation matrix (where you need to specify data d1 type=corr and then input the correlation matrix. Is the only difference here what info you start off with? E.g. if I input raw data, there is no extra data step, but If I start with input data that's not raw (that's a correlation matrix) this would need to be specified first as an extra step?

So really what i'm saying is that if i'm not worried about the time it takes to run, and i have raw data to input, there is no need to specify type=corr, and the underlying assumption is that i'm relying on a correlation matrix?

If the data is put in 'raw form' as hatcher says, does this mean it's accounting for the different in variance on each likert scale question? e.g. if I believe it's important to take into account that some questions on the likert scale go to 6 and another goes to 12. E.g. I believe the question that goes to 12 should have more weight than the question that goes to 6.

Thanks again for your help.

Rick_SAS · Posted 07-22-2016 02:04 PM

It has been many years since I thought carefully about this. However, I think the correct answers are:

1. Yes, you can use PROC SCORE to get the factor scores. It shouldn't matter whether you use CORR or COV as the starting point. The doc for PROC SCORE has an example.

2. Yes, if you have tens of millions of observations and intend to run PROC FACTOR many times (maybe trying to find good rotations), then starting with the COV or CORR matrix will be faster. However, computing these matrices are so fast that if you have a moderately sized data, you probably won't notice the speed difference.

3. I don't have Hatcher's book, but it sounds like his example uses a DATA step to set the TYPE= attribute of the data set. This is not necessary. If the data set was created by PROC CORR, for example, then it already has the TYPE=CORR attribute. It it came from some other source (eg, you imported an EXCEL file), you can just put the TYPE=CORR attribute on the DATA= option at run time. Study the following examples carefully. Both calls to PROC FACTOR are equivalent. Neither requires a DATA step.

/* create C, which is a TYPE=CORR data set */
proc corr data=sashelp.cars NOMISS outp=C;
run;

proc factor data=corr;  /* the procedure knows this is TYPE=CORR */
run;

proc factor data=corr(type=corr); /* or you can specify it here */
run;

The previous code is also equivalent to reading the raw data directly, in which case PROC FACTOR will create the correlation matrix on the fly:

proc factor data=sashelp.cars;
run;

4. (second-to-last paragraph) Yes, if you are not worried about the time it takes to run, and you have raw data to input, the default behavior (as above) is to use the correlation matrix.

5. If you want certain variables to have more influence than others. use the COV option to specify the covariance matrix:

proc factor data=sashelp.cars cov;
run;

In the future, when you ask more than two questions in a single response, please number your questions so it is easier to associate answers to questions.

To learn more about the TYPE=CORR and TYPE=COV data sets, I highly recommend that you read the section "Special Data Sets" in the SAS/STAT documentation. Click on the two relevant links.

Underlying assumptions for PCA using proc factor method=prin

Re: Underlying assumptions for PCA using proc factor method=prin

Re: Underlying assumptions for PCA using proc factor method=prin

Re: Underlying assumptions for PCA using proc factor method=prin