BookmarkSubscribeRSS Feed
mat_n
Obsidian | Level 7

Dear all,

I am dealing with the following problem:
My data, in counting process style suitable for survival analysis, is high dimensional, i.e. ~2000 variables. I would like to use a principal component analysis to reduce the dimensionality. However, some variables are categorical.

My first shot would be to convert my data into a design matrix (hot encoding of categorical variables) and perform PROC PRINCOMP on this.
I came across PROC PRINQUAL, which documentation says: "performs principal component analysis (PCA) of qualitative, quantitative, or mixed data". However, its main statement seems to be TRANSFORM which can be used to pre-process the data for a PCA in PRINCOMP, rather than performing PCA directly in PRINQUAL. Which transformation to apply seems arbitrary to me. Is there any guideline available?

Just as an additional information: I do want to split my data in training and test samples. The principal components should be extracted from the training data only to not spoil my test data. I know this can be done by either PROC SCORE or making use of FREQ 0 in PRINQUAL/PRINCOMP.

 

Thank you for your thoughts

 

6 REPLIES 6
Ksharp
Super User

Would you like to use PROC VARCLUS ?

PaigeMiller
Diamond | Level 26

If you just want to do Principal Components, use the IDENTITY transformation.

--
Paige Miller
mat_n
Obsidian | Level 7

Thank you both!

@KsharpI was thinking about this, as well and actually almost expected your suggestion Smiley Very Happy
However, also PROC VARCLUS requires numerical variables, which has been the crux in the first place. Any suggestions how to handle this?

 

@PaigeMillerYes, I came across this non-transformation transformation. There are two main issues I do have currently:

  • While IDENTITY(*) keeps variables exactly like they are, the only (?) available transformation for categorical variables OPSCORE(*) does impute missing values even when specifying NOMISS.is there any possibility to suppress this feature or do I have to exclude these observations in advance?

 

PROC PRINQUAL DATA=full_data NOMISS out=prinqual_results REPLACE;
ID cust_num date status;
FREQ freq;
TRANSFORM 
IDENTITY(&num_vars.)
OPSCORE(
&cat_vars);
run;
  • I played around with some transformation methods and noticed that it fundamentaly changes the number of principal components necessary. MONOTONE yields a single PC which explains over 90% of the variance where as IDENTITY needs over 20 to explain just 85%. Of course, this is exactly the purpose of PRINQUAL but I am lacking a theoretical explanation when which transformation is justifiable apart from the very general precondtitions in the manual (numeric, continuous, etc..)
PaigeMiller
Diamond | Level 26

@mat_n wrote:

 

@PaigeMillerYes, I came across this non-transformation transformation. There are two main issues I do have currently:

  • While IDENTITY(*) keeps variables exactly like they are, the only (?) available transformation for categorical variables OPSCORE(*) does impute missing values even when specifying NOMISS.is there any possibility to suppress this feature or do I have to exclude these observations in advance?

I have not actually used PROC PRINQUAL with categorical variables, however the documentation for the IDENTITY transformation does not state that the variable must be numeric. So, have you actually tried using IDENTITY on categorical variables?

--
Paige Miller
mat_n
Obsidian | Level 7
Yes, they must be numeric:
ERROR: The IDENTITY variable var1 must be numeric.
Ksharp
Super User
Once you get Design Matrix ,then feed it into PROC VARCLUS.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 3597 views
  • 1 like
  • 3 in conversation