Dear Sir,
I have few questions regarding principle component analysis in Enterprise Miner. Below is my data process flow:
The transformation node is to convert categorical data to dummy since principle component only allow numerical value. I have tested 2 types of principle component nodes. The classification algorithms that I plan to use is Decision tree and Logistic regression.
The setting for the principle component nodes are below:
Principle component node setting
HP principal component node setting
The result for the nodes:
We select the number of component when eigenvalue is more than 1. In this case, there is 42 components but the selected number of component is 20. My first question is that does the Apply maximum number to Yes under Max Number cutoff section of the properties setting limit the component number to be 20 even though the actual number is 42?
Second question is when the principal component node and HP principal component node to be used for dimensional reduction.
My last question is whether Variable selection node can use to replace principle component node in dimensional reduction?
Can anyone explain more on this issue?
Thank you in advance.
Regards,
Potiu
I can't really explain the difference ... but you are doing a lot of work to FORCE your data into the form needed for principal (not "principle") components, specifically continuous variables, and my first thought is to not do this. The results of principal components could be highly dependent on how you perform this transformation from categorical variables to continuous variables. There may be some better way of handling the non-continuous variables. But since you didn't really say much about your data, it's hard to say.
Next, since you have a regression node, and I'm assuming that the output of principal components will be fed into the regression node ... DON'T DO THIS. Principal components is not looking to see whether or not the variables it selects are actually good predictors in the regression. Principal components could miss the variables that are good predictors in the regression. What should you do? Partial least squares (or PLS) regression! This picks combinations of variables that are good predictors, and as an extra added bonus, it has no trouble at all handling categorical variables as categorical. And so it's a lot simpler to do, there's no transformation of variables and there's no prior selecting of variables needed, PLS handles all of this.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.