turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- 475 variables to 10-20 variables by proc princomp ...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-13-2017 09:36 PM

Dear SAS Gurus,

I have a data with 475 variables. I want to compress the number of variables say 10-20 variables by

principal component without losing too much information then the data is used to proc logistic for modelling.

Could you please show me any sample codes to do this easily? any your existing quick sample is fine with me for start with.

(If required I will look into more details options all afterward!)

proc princomp data=model out=.......

then

proc logistic data=model_after_princom........

I know I am lazy and I know if I read cuple of manuals and some try and error will lead me the solution but

please some of you Gurus help me to save sometime.

Any samples will do help!!!

Thanks Gurus all the time,

Kaz

Accepted Solutions

Solution

04-17-2017
04:24 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-16-2017 06:00 AM

If you still stick with principal component analysis, try PROC VARCLUS

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-14-2017 09:44 AM

The examples in the SAS manuals cover the syntax; just google

princomp examples site:sas.com

.

You will need to study the PRINCOMP results to determine the number of eignevalues to use; its not an automatic feed.

PRINCOMP discards records with missing data. You will need at least 2500 complete observations to get components you can rely on (5 x # variables). You may be able to use PROC MI to get better estimates, but that increases the work significantly. Google

princomp sample size site:sas.com

for a more complete discussion.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-14-2017 09:50 AM

If you already have Y variable, why not use PROC HPGENSELECT to pick up the significant variables ?

proc princomp is applied to No Y variable scenario .

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-14-2017 10:09 AM - edited 04-14-2017 10:11 AM

k_shide wrote:

Dear SAS Gurus,

I have a data with 475 variables. I want to compress the number of variables say 10-20 variables by

principal component without losing too much information then the data is used to proc logistic for modelling.

Could you please show me any sample codes to do this easily? any your existing quick sample is fine with me for start with.

(If required I will look into more details options all afterward!)

proc princomp data=model out=.......

then

proc logistic data=model_after_princom........

I know I am lazy and I know if I read cuple of manuals and some try and error will lead me the solution but

please some of you Gurus help me to save sometime.

Any samples will do help!!!

Thanks Gurus all the time,

Kaz

First, I think the idea of picking 10-20 variables out of 475 is not the best thing to do. Furthermore, PROC PRINCOMP (Principal Components analysis) doesn't really let you get 10-20 original variables, it gives you a smaller number of new variables which are linear combinations of the original variables. In other words, all 475 of the original variables are still used. I don't know how to use Principal Components Analysis to reduce the original variables down to 10-20.

The alternative that I recommend to Principal Components in this sitaution is Partial Least Squares analysis (PROC PLS), which has a major advantage over Principal Components. PLS finds dimensions that are predictive of Y, whereas PCA does not. To me, that seems like such a major improvement over PCA that I would not use PCA here. All of your original variables remain in the PLS model, and those which have very little effect on the Y variable will have loadings close to zero.

If you want to do a logistic regression version of PLS, you would use PROC PLS with a binary or multinomial response variable (which would be represented by 0/1 dummy variables). No need for PROC LOGISTIC in this case. A more mathematically correct alternative is given here but I do not know of any SAS code that implements this.

I also would not use PROC HPGENSELECT because many statisticians have decided that they do not trust forward, backward and stepwise selection methods.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

04-14-2017 10:16 AM

I think OP maybe means PROC VARCLUS .

"I also would not use PROC HPGENSELECT because many statisticians have decided that they do not trust forward, backward and stepwise selection methods"

Can you give me more detail information. If those methods are not trusted why SAS would not state this in documentatoin.

And an alternative way is use PROC LOGISTIC + selection=stepwise .

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ksharp

04-14-2017 10:20 AM - edited 04-14-2017 10:28 AM

You can Google "problems with stepwise regression" to find many articles on this matter.

Why would SAS do this? I don't know, other than to say there are many methods in SAS (and other statistical packages) that were developed a long time ago, newer procedures are now available that have superior properties, but these old procedures remain in SAS (and other statistical packages) ... possibly because some people still want these older methods and don't know about the newer methods. That is all pure speculation on my part.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

04-14-2017 10:31 AM

Sorry. Google has been banned by Chinese government.

PROC HPGENSELECT is quite new proc , and the only one PROC can select variable

under many different distribution like binomial, possion, ...........

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ksharp

04-14-2017 10:38 AM

Other search engines should find the same

But here's a few articles

https://www.ma.utexas.edu/users/mks/statmistakes/stepwise.html

http://www.danielezrajohnson.com/stepwise.pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.126.4133&rep=rep1&type=pdf

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-16-2017 12:34 AM

Everyone

Thanks very much.

Maybe some of you misunderstood so let me clarify.

What I was intending to do was currently I am a initial phase of developing modells and need not academic statistical proof for this.

I tried from Tensorflow from Python to MATLAB module including my old friend SAS to see what is the potential upside of using different methods.

So what I needed was a quick dirty solution for any potential upside by using principal components to summarise the variables into less dimensions without loosing too much information. What I used SAS mainly was like early 2000s when %treedesc was the cutting edge CHAID model module and I lost all my programs to do some automatic macros to do this type of things. And overmore, as some of you suggested some new procedure which is good.

And more over some of you also suggested there are not easy automated way in PRINCOMP and that is helping me not to waste of time.

I just thank everyone to give me some information value and I am determined to read through all the sas documents to achieve what i was intending to do.

Kaz

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-16-2017 07:55 AM - edited 04-16-2017 07:58 AM

k_shide wrote:

So what I needed was a quick dirty solution for any potential upside by using principal components to summarise the variables into less dimensions without loosing too much information.

But this is different than your original question in which you said:

I want to compress the number of variables say 10-20 variables

These small differences in wording make a huge difference. Your original question seems to imply that you want to select 10-20 of your original 475 variables, which Principal Components will not do.

Of course, Principal Components will give you fewer DIMENSIONS than your original 475 variables. It will not give you 10-20 of your original variables.

So which is it? Do you want fewer of your original variables to work with? Or do you want 10-20 DIMENSIONS?

Anyway, as I explained above, PCA is not the right tool to select fewere DIMENSIONS. PLS is the tool, because it finds DIMENSIONS that are predictive of your Y variable, while PCA does not.

--

Paige Miller

Paige Miller

Solution

04-17-2017
04:24 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-16-2017 06:00 AM

If you still stick with principal component analysis, try PROC VARCLUS

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ksharp

04-17-2017 04:26 AM

Thanks Ksharp. I remember you helped me already a couple of times. Everyone helped me but your simple solution rings my bell as an old SAS users' method.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-17-2017 04:24 AM

Hi Everyone,

Thanks again some advice.

I still make myself misunderstood so let me clarify again.

What I want to have is X1-X475 into say Z1-Z10 which is linear combination of original 475 variables but X1 to X10 are all right angles each other in terms of 10 dimensional space.

Some of you found selection=stepwise takes really long time to converge and for SAS it may be easier by using Z1-Z10 which is still linear combination of 475 variables but as # of variables it's only 10 for the calculations.

Again I just want everyone to give me precious advices.

Kaz

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to k_shide

04-17-2017 08:32 AM

What I want to have is X1-X475 into say Z1-Z10 which is linear combination of original 475 variables but X1 to X10 are all right angles each other in terms of 10 dimensional space.

Some of you found selection=stepwise takes really long time to converge and for SAS it may be easier by using Z1-Z10 which is still linear combination of 475 variables but as # of variables it's only 10 for the calculations.

@k_shide Thank you, this is certainly clear now.

I stand by my recommendation to perform Partial Least Squares (PROC PLS) regression on this data, and I stand by my recommendation to NOT use Principal Components Analysis (PROC PRINCOMP); and I add a recommendation to NOT use PROC VARCLUS.

PROC PLS will indeed give you new dimensions Z1-Z10 (or however many you want, it doesn't have to be 10) that are *predictive* of your response. That's what PLS does, it finds orthogonal dimensions that are linear combinations of your original 475 X variables, that have predictive power. Neither PRINCOMP nor VARCLUS tries to find results that are predictive of your response, the algorithms used do not care or know about the Y variables; and so PRINCOMP and VARCLUS can easily produce new dimensions Z1-Z10 that are not very predictive of your response.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

04-18-2017 08:55 PM

Dear PaigeMiller thank you for the confirmation.

I will do look at PLS for the purpose.

Kaz

##- Please type your reply above this line. Simple formatting, no

attachments. -##

I will do look at PLS for the purpose.

Kaz

##- Please type your reply above this line. Simple formatting, no

attachments. -##