BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Golumn
Calcite | Level 5

Hi, 

 

I'm doing PCA/Factoring to reduce my variable count before modeling.

 

I've got quite a few character variables with differing values. Some of the values are directional, meaning if variable1 has values A-Z, A is the best and Z is the worst. I'm replacing these character values with numeric equivalents A-Z = 1-26, because based on my understanding, factoring works best with numeric values for each variable. 

 

First question:

Is this true?  

 

Second Question:

What happens when some variables are not direction in their values? Gender = M,F,U, where recoding to 0,1,2 does not note an increase in performance....

 

NS

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

@Golumn wrote:

Hi, 

 

I'm doing PCA/Factoring to reduce my variable count before modeling.

Don't do this. Use partial least squares (PROC PLS) instead of PCA or Factor analysis. Why? Because PCA and factor analysis don't use the y-variables, you could get factors which are not predictive of Y. PLS does produce factors that are predictive of Y.


I've got quite a few character variables with differing values. Some of the values are directional, meaning if variable1 has values A-Z, A is the best and Z is the worst. I'm replacing these character values with numeric equivalents A-Z = 1-26, because based on my understanding, factoring works best with numeric values for each variable. 

 

First question:

Is this true?  


If you believe that the difference between a through z is linear, then numerical 1-26 works. If you don't believe its linear, you might want some other levels.

 


Second Question:

What happens when some variables are not direction in their values? Gender = M,F,U, where recoding to 0,1,2 does not note an increase in performance....


As far as I know, you need to make these categories, either by using dummy variables, or the CLASS statement in PROC PLS.

--
Paige Miller

View solution in original post

7 REPLIES 7
PaigeMiller
Diamond | Level 26

@Golumn wrote:

Hi, 

 

I'm doing PCA/Factoring to reduce my variable count before modeling.

Don't do this. Use partial least squares (PROC PLS) instead of PCA or Factor analysis. Why? Because PCA and factor analysis don't use the y-variables, you could get factors which are not predictive of Y. PLS does produce factors that are predictive of Y.


I've got quite a few character variables with differing values. Some of the values are directional, meaning if variable1 has values A-Z, A is the best and Z is the worst. I'm replacing these character values with numeric equivalents A-Z = 1-26, because based on my understanding, factoring works best with numeric values for each variable. 

 

First question:

Is this true?  


If you believe that the difference between a through z is linear, then numerical 1-26 works. If you don't believe its linear, you might want some other levels.

 


Second Question:

What happens when some variables are not direction in their values? Gender = M,F,U, where recoding to 0,1,2 does not note an increase in performance....


As far as I know, you need to make these categories, either by using dummy variables, or the CLASS statement in PROC PLS.

--
Paige Miller
Golumn
Calcite | Level 5

Hi, thanks for the input, for further insight, 

 

By "other levels" what do you mean? if the values are not linear in scale, then I should go the below route and create a 0/1 variable for each value?

 

By dummy variables do you mean splitting out Gender to three separate variables, each with 0/1? 

 

NS

PaigeMiller
Diamond | Level 26

@Golumn wrote:

Hi, thanks for the input, for further insight, 

 

By "other levels" what do you mean? if the values are not linear in scale, then I should go the below route and create a 0/1 variable for each value?

 

By dummy variables do you mean splitting out Gender to three separate variables, each with 0/1? 

 

NS


Other meaning some other set of numbers, not integers from 1 to 26. For example, if the effect was perfectly quadratic, you would use the numbers 1, 4, 9, ... , 676 (which is the square of 26)

 

If you use PROC PLS, you don't need to worry about creating dummy variables yourself, the PROC handles this for you in the CLASS statement.

--
Paige Miller
Golumn
Calcite | Level 5

Hi, initially I was taking 600+ variables, I looked at them all and chucked any with a large amount of nulls, I wanted to use PCA to limit the 600 and address multicolinearity.

 

I was under the assumption after general data cleaning, I'd look at  bivariate analysis (cross correlation matrix, PCA, addressing multicolinearity), and univariate (my understanding of optimizing via looking at assumptions of linearity, transformation and binning).

 

Is my intention to use PCA in this order before the modeling process proceeds an incorrect one? Forgive my ignorance. 

 

NS

PaigeMiller
Diamond | Level 26

Using PCA before modeling is a common strategy, everyone does it, so it must be okay — except that it produces factors that do not have to be predictive. PLS does not have this drawback, the factors are chosen because they are predictive. So therefore, PLS will result in better fitting models than the same data analyzed by PCA.

--
Paige Miller
PGStats
Opal | Level 21

If your intent is to develop a linear model with normal errors, you should try using proc glmselect to select the most promising variables.

 

Otherwise, if you suspect that some explanatory variables have a highly nonlinear effect (such as threshold or non monotonic effects), you should start with regression tree analysis (proc split or hpsplit) to identify important regressors.

PG
Golumn
Calcite | Level 5

Again, forgive my ignorance, I was just reading on proc glm, I wanted to clarify, I've previously done modeling in SAS Miner, but have access to SAS EG at the moment. 

 

It looks like proc glm immediately launches into model calculation, optimization, and comparison, based on any number of desired specifics. In miner, I'd be going through initial analysis before going through linear and non-linear model construction. I'd be partitioning to test/validation, looking at transformation and PCA, then looking at using Decision tree, Neural Network, Regressions etc. depending on the level of non-linearity. 

 

I'd had in my mind, a certain order of operations for the modeling process... and I'd enjoy both knowing if this was incorrect, (as I've basically learned in-industry), and also how this scales to using procedures in SAS EG.... 

 

if there is more comprehensive source that walks through general process in EG also great. Any feedback is appreciated. 

 

NS

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 708 views
  • 0 likes
  • 3 in conversation