05-16-2013 09:08 AM
I have a table with millions of product level transactions and wanted to know if there are clusters of products in my table which are often purchased together.
I’m using proc distance (method=Euclid), proc cluster (method=Ward) and proc tree but not entirely sure if this is the best way of analysing the data.
I’d appreciate to hear from anyone who has done something similar to this or has any ideas on how to develop this further.
05-17-2013 04:29 AM
Hope you haven't seen my first version of the answer; I overlooked that you were already using PROC DISTANCE.
So I assume you have taken the Jaccard distance like in the divorce example for PROC DISTANCE which gives basically fractions of purchases of both prduchts within purchases of at least 1 of these products (1-fractions indeed). I would do the same. Maybe an alternative is to divide the number of combined purchases only by the number of alone purchases (resp. the reciprocal for distance), just to get another distance measure, and see what PROC CLUSTER makes different. Maybe you then get an idea to prefer one result to the other.
05-17-2013 07:44 PM
thanks for both answers! I'm a newbie to this type of analysis and need to understand the functions are the right ones to use. I have the table in a format where each row is a customer and columns represent products purchased, summing each column by quantity.
I'm sure the answer you provided makes sense but to be honest I don't understand it.
where can I find the divorce example please?
05-22-2013 03:11 AM
It's the SAS Online Doc, here http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_distance_exa.... But I just see, applying this to your problem, you should have a rather unusual data structure: Observations are products, and columns are purchases, 1 indicating purchase, 0 indicating no purchase. Of course instead of having purchases as columns, you could have buyers watched over a longer time, i.e. counting all purchases of one person in one column.
06-27-2013 06:34 PM
THe observations are customers and columns are products. Each column shows how many of that product was bought. I then want to use those column in a cluster to show how much variance is accounted by that column. So, using cereals as an example, 50% of the variation is accounted for by Corn flakes, Frosties, and wheetabix cereals or 60% is explained by the same three cereals plus Cheerios etc..
THis them allows me to determine key products to then decide which I can remove or even suggest increasing variation.
05-17-2013 11:01 AM
One way is to do a standard association analysis, e.g. proc assoc followed by proc rulegen. You could also do a sparse Singular Value Decomposition with this data, using Proc SPSVD, if you have Text Miner licensed.
05-17-2013 07:22 PM
Thanks for sharing your ideas! I tried using associations (co-occurrences) which gave me useful insight, e.g. 10% of customers who bought single cream also bought double cream. At the same time I generated confidence levels, just as you would with proc assoc. I did this in base SAS and not miner. I do not have Text Miner.
Really pleased you mentioned proc assoc and rulegen as I used a variation of that but that didn't tell me which groups of products were being bought together, hence my preference to do a cluster.
Just to to clear, I changed my data set so I have one record per customer and all the different products as columns, measuring total quantity purchased over a period. Can I use that to cluster the products?
05-18-2013 07:26 AM
thanks Reeza! I think proc varclus could give me groupings of products purchased together? i have 1.6m customers and 450 products (columns), can proc varclus handle data that large, or do I use proc fastclus first then varclus please?
05-18-2013 03:44 PM
I don't think it would based on how I understand cluster analysis. It will give you results, but I'm not sure your interpretation is correct.
What a proc can handle depends on your computer, if you're on a server it may be fine, if you're on a desktop you may not be.