Hi,
I would be grateful if any one could kindly tell me if clustering of observations, and clustering of variables improve the results of regression.
Regards,
Shibbir
Hello @shibbir63 ,
I am in favor of this statement made by @PaigeMiller .
> You could always try it both ways and see how good the regressions fit.
Clustering of observations :
I suppose you mean making a few segments in your observations and then fitting one model per segment. For example one model for smokers and another model for non-smokers or one model per gender.
This can be beneficial of course, given the segments remain big enough.
It can also be done with one overall model for sure but that overall model can be quite complex and full of interactions if the explanatory variables explain the response / target very differently for your multiple segments.
Clustering of variables :
That's a very good idea (to combat multi-collinearity and to do dimension reduction for example).
I do it almost always.
A decade ago I always used PROC VARCLUS (SAS/Stat) for this, but nowadays there are multiple techniques that you can use (feature construction from variable clusters). See Model Studio (SAS VIYA) doc.
Good luck,
Koen
I think it depends on the data, and I'm not aware of any global advice here (maybe others do have some advice). It would also help if you described your data and why you think clustering exists and would help.
You could always try it both ways and see how good the regressions fit.
Hello @shibbir63 ,
I am in favor of this statement made by @PaigeMiller .
> You could always try it both ways and see how good the regressions fit.
Clustering of observations :
I suppose you mean making a few segments in your observations and then fitting one model per segment. For example one model for smokers and another model for non-smokers or one model per gender.
This can be beneficial of course, given the segments remain big enough.
It can also be done with one overall model for sure but that overall model can be quite complex and full of interactions if the explanatory variables explain the response / target very differently for your multiple segments.
Clustering of variables :
That's a very good idea (to combat multi-collinearity and to do dimension reduction for example).
I do it almost always.
A decade ago I always used PROC VARCLUS (SAS/Stat) for this, but nowadays there are multiple techniques that you can use (feature construction from variable clusters). See Model Studio (SAS VIYA) doc.
Good luck,
Koen
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.