BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
k_shide
Obsidian | Level 7

Dear SAS Gurus,

 

 

I have a data with 475 variables. I want to compress the number of variables say 10-20 variables by 

principal component without losing too much information then the data is used to proc logistic for modelling.

 

Could you please show me any sample codes to do this easily? any your existing quick sample is fine with me for start with.

(If required I will look into more details options all afterward!)

 

proc princomp data=model out=.......

then 

proc logistic data=model_after_princom........

 

 

I know I am lazy and I know if I read cuple of manuals and some try and error will lead me the solution but 

please some of you Gurus help me to save sometime.

Any samples will do help!!!

 

Thanks Gurus all the time,

Kaz

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
If you still stick with principal component analysis, try PROC VARCLUS


View solution in original post

14 REPLIES 14
Doc_Duke
Rhodochrosite | Level 12

The examples in the SAS manuals cover the syntax; just google

princomp examples site:sas.com

.

 

You will need to study the PRINCOMP results to determine the number of eignevalues to use; its not an automatic feed.

 

PRINCOMP discards records with missing data.  You will need at least 2500 complete observations to get components you can rely on (5 x # variables).  You may be able to use PROC MI to get better estimates, but that increases the work significantly.  Google

princomp sample size site:sas.com

for a more complete discussion.

Ksharp
Super User

If you already have Y variable, why not use PROC HPGENSELECT to pick up the significant variables ?

proc princomp is applied to No Y variable scenario .

PaigeMiller
Diamond | Level 26

@k_shide wrote:

Dear SAS Gurus,

 

 

I have a data with 475 variables. I want to compress the number of variables say 10-20 variables by 

principal component without losing too much information then the data is used to proc logistic for modelling.

 

Could you please show me any sample codes to do this easily? any your existing quick sample is fine with me for start with.

(If required I will look into more details options all afterward!)

 

proc princomp data=model out=.......

then 

proc logistic data=model_after_princom........

 

 

I know I am lazy and I know if I read cuple of manuals and some try and error will lead me the solution but 

please some of you Gurus help me to save sometime.

Any samples will do help!!!

 

Thanks Gurus all the time,

Kaz

 


First, I think the idea of picking 10-20 variables out of 475 is not the best thing to do. Furthermore, PROC PRINCOMP (Principal Components analysis) doesn't really let you get 10-20 original variables, it gives you a smaller number of new variables which are linear combinations of the original variables. In other words, all 475 of the original variables are still used. I don't know how to use Principal Components Analysis to reduce the original variables down to 10-20.

 

The alternative that I recommend to Principal Components in this sitaution is Partial Least Squares analysis (PROC PLS), which has a major advantage over Principal Components. PLS finds dimensions that are predictive of Y, whereas PCA does not. To me, that seems like such a major improvement over PCA that I would not use PCA here. All of your original variables remain in the PLS model, and those which have very little effect on the Y variable will have loadings close to zero.

 

If you want to do a logistic regression version of PLS, you would use PROC PLS with a binary or multinomial response variable (which would be represented by 0/1 dummy variables). No need for PROC LOGISTIC in this case. A more mathematically correct alternative is given here but I do not know of any SAS code that implements this.

 

I also would not use PROC HPGENSELECT because many statisticians have decided that they do not trust forward, backward and stepwise selection methods.

 

 

--
Paige Miller
Ksharp
Super User

@PaigeMiller

I think OP maybe means PROC VARCLUS .

 

"I also would not use PROC HPGENSELECT because many statisticians have decided that they do not trust forward, backward and stepwise selection methods"

Can you give me more detail information. If those methods are not trusted why SAS would not state this in documentatoin.

And an alternative way is use PROC LOGISTIC + selection=stepwise .

PaigeMiller
Diamond | Level 26

@Ksharp:  

You can Google "problems with stepwise regression" to find many articles on this matter.

 

Why would SAS do this? I don't know, other than to say there are many methods in SAS (and other statistical packages) that were developed a long time ago, newer procedures are now available that have superior properties, but these old procedures remain in SAS (and other statistical packages) ... possibly because some people still want these older methods and don't know about the newer methods. That is all pure speculation on my part.

--
Paige Miller
Ksharp
Super User

Sorry. Google has been banned by Chinese government.

 

PROC HPGENSELECT is quite new proc , and the only one PROC can select variable 

under many different distribution like binomial, possion, ...........

 

 

k_shide
Obsidian | Level 7

Everyone 

 

Thanks very much.

Maybe some of you misunderstood so let me clarify.

What I was intending to do was currently I am a initial phase of developing modells and need not academic statistical proof for this.

I tried from Tensorflow from Python to MATLAB module including my old friend SAS to see what is the potential upside of using different methods. 

So what I needed was a quick dirty solution for any potential upside by using principal components to summarise the variables into less dimensions without loosing too much information. What I used SAS mainly was like early 2000s when %treedesc was the cutting edge CHAID model module and I lost all my programs to do some automatic macros to do this type of things. And overmore, as some of you suggested some new procedure which is good.

 

And more over some of you also suggested there are not easy automated way in PRINCOMP and that is helping me not to waste of time.

 

I just thank everyone to give me some information value and I am determined to read through all the sas documents to achieve what i was intending to do.

 

Kaz

 

PaigeMiller
Diamond | Level 26

@k_shide wrote:

So what I needed was a quick dirty solution for any potential upside by using principal components to summarise the variables into less dimensions without loosing too much information.


But this is different than your original question in which you said:

 

I want to compress the number of variables say 10-20 variables

These small differences in wording make a huge difference. Your original question seems to imply that you want to select 10-20 of your original 475 variables, which Principal Components will not do.

 

Of course, Principal Components will give you fewer DIMENSIONS than your original 475 variables. It will not give you 10-20 of your original variables.

 

So which is it? Do you want fewer of your original variables to work with? Or do you want 10-20 DIMENSIONS?

 

Anyway, as I explained above, PCA is not the right tool to select fewere DIMENSIONS. PLS is the tool, because it finds DIMENSIONS that are predictive of your Y variable, while PCA does not.

--
Paige Miller
Ksharp
Super User
If you still stick with principal component analysis, try PROC VARCLUS


k_shide
Obsidian | Level 7

Thanks Ksharp. I remember you helped me already a couple of times. Everyone helped me but your simple solution rings my bell as an old SAS users' method.

 

k_shide
Obsidian | Level 7

Hi Everyone,

 

Thanks again some advice.

I still make myself misunderstood so let me clarify again.

 

What I want to have is X1-X475 into say Z1-Z10 which is linear combination of original 475 variables but X1 to X10 are all right angles each other in terms of 10 dimensional space.

 

Some of you found selection=stepwise takes really long time to converge and for SAS it may be easier by using Z1-Z10 which is still linear combination of 475 variables but as # of variables it's only 10 for the calculations.

 

 

Again I just want everyone to give me precious advices.

 

Kaz

 

 

 

PaigeMiller
Diamond | Level 26

 

What I want to have is X1-X475 into say Z1-Z10 which is linear combination of original 475 variables but X1 to X10 are all right angles each other in terms of 10 dimensional space.

 

Some of you found selection=stepwise takes really long time to converge and for SAS it may be easier by using Z1-Z10 which is still linear combination of 475 variables but as # of variables it's only 10 for the calculations. 


@k_shide Thank you, this is certainly clear now.

 

I stand by my recommendation to perform Partial Least Squares (PROC PLS) regression on this data, and I stand by my recommendation to NOT use Principal Components Analysis (PROC PRINCOMP); and I add a recommendation to NOT use PROC VARCLUS.

 

PROC PLS will indeed give you new dimensions Z1-Z10 (or however many you want, it doesn't have to be 10) that are predictive of your response. That's what PLS does, it finds orthogonal dimensions that are linear combinations of your original 475 X variables, that have predictive power. Neither PRINCOMP nor VARCLUS tries to find results that are predictive of your response, the algorithms used do not care or know about the Y variables; and so PRINCOMP and VARCLUS can easily produce new dimensions Z1-Z10 that are not very predictive of your response.

--
Paige Miller
k_shide
Obsidian | Level 7
Dear PaigeMiller thank you for the confirmation.
I will do look at PLS for the purpose.

Kaz

##- Please type your reply above this line. Simple formatting, no
attachments. -##

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 3110 views
  • 0 likes
  • 4 in conversation