07-28-2015 01:49 PM
I'm fairly new to clustering, especially in SAS and needed some help on clustering analysis.
I have a dataset of 4 variables - Game title, Genre, Platform and Average Sales. The dataset contains 6,740 cases. Game title, Genre and Platform are categorical variables, whereas Average Sales is numeric. Since the number of game titles is exhaustive, I've only created a Dummy variable for Genre and Platform. Plus, game titles is the variable I want to cluster here.
Can someone help me understand how I would need to go about clustering this dataset into meaningful clusters, as well as an idea on how to go about profiling them too? I've done an exhaustive search on trying to understand the most straight forward step-by-step (with SAS code) to do this but mostly everything isn't very conclusive at all.
Would really appreciate some help here please.
07-28-2015 03:47 PM
I suspect you might get better results if you had price and total number of sales instead of average sales:
I would start with FASTCLUS
proc fastclus data=have out=want cluster=ClusterId maxclusters = ? ;
var genre platform;
weight AverageSales; /* I think it makes more sense to look at the average as a weighting variable if it is a mean currency value, otherwise use on the VAR statement*/
Generally start with a "guess" as to how many clusters you want to deal with. This would depend on how you want to use this. Then generate different output sets with values around that initial to see how many products change behaviors.
07-28-2015 04:34 PM
Thanks for the inputs here.
I agree that if I had price it would be more helpful, but unfortunately the data source doesn't recorded price data. Sorry, I forgot to mention that the dataset with 6,740 cases has been derived from a much larger original dataset which contains over a million cases (Sales are weekly and in 'Units'). To compress the dataset, i used the PROC REPORT procedure to basically create an average units sales matrix of every game title on every platform its been available from Nov '04 - March '15. I understand that I can't use hierarchical clustering since I read that its effective for datasets with upto 100 cases, hence the PROC FASTCLUS would be used to run a k-means clustering.
Is there a recommended way to go about identifying the number of clusters first which may be most optimal before I run the k-means clustering, or should I do it iteratively with different k = 1,2,3,4 ... and so on?
Thank you again.
07-28-2015 06:25 PM
As a crude starting point I would look at the sum of genre plus platform as a starting point. Possibly the more interesting results would be those identified clusters where platform mixes within genre or genre within platform.
If you still have the raw data I would be tempted to put that through fastclus. Fastclus is designed to give relatively quick results to fine tune with one of the other cluster procedures. Number of sales per week would probably fit well on the VAR statement.
07-29-2015 08:25 AM
@ballardw: Hi. I'm not sure what you mean by 'sum of genre plus platform as a starting point', since both of these are categorical. For eg. Genre = Action, Role-Playing, Shooter , while Platform = PS3, DS, WiiU, Wii and the likes. Although, I'll use your earlier inputs as a starting point and see what I can build on from there.
@Reeza: Hi, thanks for replying to my problem. I have assess to SAS 9.3 and E miner. In fact, I'm looking at both platforms in terms of trying to solve this clustering problem I have. Besides this, I really don't know how to interpret the output as well. Can you help?