We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Demand-Based Store Clustering by Example with SAS Enterprise Miner: Part 1

by SAS Employee ksouthall on ‎12-18-2015 11:19 AM - edited on ‎12-18-2015 11:20 AM by Community Manager (1,368 Views)

Along the same lines as the Learn by Example with SAS® Enterprise Miner series, this tip explains how to apply the cluster analysis technique to a real-world example using process flow diagrams. 


In general, the clustering technique is used to perform observation grouping, which can be used to segment databases. Clustering places objects into groups, or clusters, suggested by the data. The objects in each cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar. If obvious clusters or groupings could be developed prior to the analysis, then the clustering analysis could be performed by simply sorting the data.


For more information on how clustering works, please watch this video:



Why demand mix store clustering? Here's a retail example.

Approximately 20 percent of a retailer’s categories generate 80 percent of its revenue. Historically, companies have relied on imperfect volume-based store clustering or geographically-based approaches which limit their ability to focus assortments for local customers. Moreover, research shows that stores with the same volume or geographic proximity may have very different sales patterns and customer profiles. Traditional clustering methods do not sufficiently identify localized trends. 


This suggested approach benefits retailers by using statistical analysis to identify all significant and distinct buying patterns in a market. This method is purposeful and based primarily on demand and secondarily (in a separate step) defines their demographic profile. It is important to note that by separating these steps, you ensure that clusters are created purely on customer demand of products. This concept will increase sales, margins and inventory productivity by aligning local assortments with local demand.


What data is required?

  • Historical sales data at store & sku level. You can use unit, margin or revenue dollar sales for a particular product category.
  • Product descriptive data (i.e. Color, Fabric, Collar, Silhouette, Price Range, etc.)
  • Store Attributes – Format, Climate, Size, Demographic (Population, Household Income, Average Age of Household)

How to prep the data

It is necessary to transpose the sales data by product ID and use the product names as labels. This allows you to better understand the product descriptions in charts within SAS Enterprise Miner. Also, for the clustering node to perform without errors the null values should be replaced with zeros. The following code demonstrates how this might work.  



 /*       Store Product Clustering Example                            */
/* Copyright(c) December 2015 */ /* by SAS Institute Inc., Cary, NC, USA */ /*-------------------------------------------------------------------*/ /* This material is provided "as is" by SAS Institute Inc. There */ /* are no warranties, expressed or implied, as to merchantability or */ /* fitness for a particular purpose regarding the materials or code */ /* contained herein. The Institute is not responsible for errors */ /* in this material as it now exists or will exist, nor does the */ /* Institute provide technical support for it. */ /*-------------------------------------------------------------------*/ /* First sort data by store number, note: the use of a TEST libref */ PROC SORT; DATA=TEST.SALES_DATA(KEEP=sales_units PROD_NAME PROD_ID store_number); OUT=TEST.SORTTempTableSorted; BY store_number; RUN; /* Next, transpose the data and apply product name as labels */ PROC TRANSPOSE DATA=TEST.SALES_DATA2; OUT=TEST.SALES_DATA; BY store_number; ID PROD_ID ; idlabel PROD_NAME; VAR sales_units; RUN; QUIT; /* Also, its important to Loop through data and replace null values with zeros */ data TEST.SALES_DATA2; set TEST.SALES_DATA2; array change _numeric_; do over change; if change=. then change=0; end; run;



Additionally, you might consider removing outliers as they could appear as individual clusters. Also, you might want to consider standardizing the data so that the distances are of equal practical importance. The Transform Variables node in SAS Enterprise Miner could be used to standardize the variables.


More on clustering options

In this example you see two clustering options. The first uses the default SAS Enterprise Miner settings and produces many different cluster segments. For a more practical approach, the second “user chosen” option is applied where four clusters are selected. The correct number is often ambiguous and can depend on many things including shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. The best number of clusters depend on both your data and how you will use the results for your situation.



Additional clustering options are below. You are encouraged to try different methods to find the technique that work best with your data.

  • Average — the distance between two clusters is the average distance between pairs of observations, one in each cluster.
  • Centroid — the distance between two clusters is defined as the (squared) Euclidean distance between their centroids or means.
  • Ward — (default) the distance between two clusters is the ANOVA sum of squares between the two clusters summed over all the variables. At each generation, the within-cluster sum of squares is minimized over all partitions obtainable by merging two clusters from a previous generation.


Descriptive statistics

Now that we have created pure demand-based clusters, the next step is to quantitatively describe the main features of the collection of information on each cluster using new attributes. This example leverages both the StatExplore and Decision Tree nodes in SAS Enterprise Miner.  



In the above diagram, we join the new cluster label data from Enterprise Miner Diagram 1 and merge this with additional sales, product, store and demographic information. When incorporating additional data we can better understand each cluster. By using the SAS Code Node you can set each individual cluster label as the “Target” variable.That way we can use the Decision Tree and StatExplore node to find the variables that best describe each cluster.


This concludes Part 1 of this Tech Tip. In the second installment, we will discuss options regarding how to further investigate the product and store level attributes in a more balanced and useful way.

Your turn
Sign In!

Want to write an article? Sign in with your profile.