BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BrianLoe
Fluorite | Level 6

I have a data set with a large number of input variables, many of which are highly correlated.  The variable clustering node does a nice job of reducing the number of variable and selecting a cluster representative, but I have a question about the algorithm that the documentation doesn't seem to address.

 

What role does the target variable play in the Variable Clustering node? Are the variables in a cluster selected just because they are similar, or do they have to have a simialr relationship to the target variabel as well? 

 

In contrast, the Variable Selection node takes into account the strength of association between an input and the target.

1 ACCEPTED SOLUTION

Accepted Solutions
WendyCzika
SAS Employee

The target variable is not used in the Variable Clustering node.  It is an unsupervised method similar to Principal Component Analysis that only looks at the relationship among the input variables.  Hope that helps!

View solution in original post

4 REPLIES 4
WendyCzika
SAS Employee

The target variable is not used in the Variable Clustering node.  It is an unsupervised method similar to Principal Component Analysis that only looks at the relationship among the input variables.  Hope that helps!

JasonXin
SAS Employee
Hi,
If you believe the variance associated with each observation is 'according to' the target variable, you may consider listing the target variable at the WEIGHT statement in proc varclus.
BrianLoe
Fluorite | Level 6

I should just withdraw the question. If two variables act alike, then they would be correlated with the target in the same way as well. Since my goal was to use clustering for variable selection before constructing a regression model, variable that were aligned enough to be in the same cluster would necessarily have similar relationships with the target for regression. There was no need to consider the target variable in the the Variable Clustering node other than to withhold it from all of the clusters.

 

I actually got fairly strong results from regression using clustering as my method of variable selection, although a LARS node with the LASSO option proved to be the best model.

 

DougWielenga
SAS Employee

variable that were aligned enough to be in the same cluster would necessarily have similar relationships with the target for regression

 

I would disagree with this statement somewhat since it really depends on the nature of the relationship.  Correlation measures linear association and it is possible to have two variables have the same 'correlation' score yet have a very different relationships.  A variable with a slight linear relationship which provides minor improvements in prediction could have the same correlation with the target as a variable which has a quadratic highly predictive relationship to the target value since corrrelation only measures linearity.   Simpson's paradox assures us that things might not be simple even when the relationships are essentially linear.  When dealing with data mining problems, the number of dimensions makes it very difficult to investigate the true nature of the relationships without spending an inordinate amount of time on investigating the same.  


Cordially,

Doug 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 5202 views
  • 2 likes
  • 4 in conversation