Dear all,
I am dealing with the following problem:
My data, in counting process style suitable for survival analysis, is high dimensional, i.e. ~2000 variables. I would like to use a principal component analysis to reduce the dimensionality. However, some variables are categorical.
My first shot would be to convert my data into a design matrix (hot encoding of categorical variables) and perform PROC PRINCOMP on this.
I came across PROC PRINQUAL, which documentation says: "performs principal component analysis (PCA) of qualitative, quantitative, or mixed data". However, its main statement seems to be TRANSFORM which can be used to pre-process the data for a PCA in PRINCOMP, rather than performing PCA directly in PRINQUAL. Which transformation to apply seems arbitrary to me. Is there any guideline available?
Just as an additional information: I do want to split my data in training and test samples. The principal components should be extracted from the training data only to not spoil my test data. I know this can be done by either PROC SCORE or making use of FREQ 0 in PRINQUAL/PRINCOMP.
Thank you for your thoughts
Would you like to use PROC VARCLUS ?
If you just want to do Principal Components, use the IDENTITY transformation.
Thank you both!
@KsharpI was thinking about this, as well and actually almost expected your suggestion
However, also PROC VARCLUS requires numerical variables, which has been the crux in the first place. Any suggestions how to handle this?
@PaigeMillerYes, I came across this non-transformation transformation. There are two main issues I do have currently:
PROC PRINQUAL DATA=full_data NOMISS out=prinqual_results REPLACE;
ID cust_num date status;
FREQ freq;
TRANSFORM
IDENTITY(&num_vars.)
OPSCORE(
&cat_vars);
run;
@mat_n wrote:
@PaigeMillerYes, I came across this non-transformation transformation. There are two main issues I do have currently:
- While IDENTITY(*) keeps variables exactly like they are, the only (?) available transformation for categorical variables OPSCORE(*) does impute missing values even when specifying NOMISS.is there any possibility to suppress this feature or do I have to exclude these observations in advance?
I have not actually used PROC PRINQUAL with categorical variables, however the documentation for the IDENTITY transformation does not state that the variable must be numeric. So, have you actually tried using IDENTITY on categorical variables?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.