BookmarkSubscribeRSS Feed
MiloKK
Calcite | Level 5

Hi there,

I’m writing this first post basically because I’m getting crazy in analyzing a set of quite complicated data Smiley Wink

It’s been several years that I’ve been dealing with data mining issues, but most focused on classification and association problems. In this case I have to develop a multivariate regression model in order to explain how investments in marketing, communication and HR training impact on commercial output (e.g. sales) of different kind of products. The purpose of the model is of course explicative, but also predictive as the final result must be a sort of simulation dashboard (useful for planning the investments in future campaigns).

Structure of data: longitudinal, time series cross sectional à weekly observations (3 years long) for 5 different product categories. So I have about 750 observations. Dependent variable: product category sales; independent variables: about 200 potential predictors.

Ok, nothing particularly strange.. but..

  • For each product category, the sales are VERY differentiated: for example, for one category they could be thousands of units (on weekly base) for 40 weeks long (and the rest equal to zero), while for another category I have hundreds of units weekly, but along 100 weeks. So the dependent variable is full of zeros across time, it depends on product category.. It’s very inhomogeneous!
  • Also the predictors are not homogenous: I have some quantity that are different from category to category (e.g. investments in Paid Search on the internet), while others vary across time, but are equal across product categories (e.g. customer satisfaction index); lot of null values also in the predictors.
  • Most of the predictors are strongly correlated with each other;
  • Within each category, the dependent variable is autocorrelated;

Ok, this is the framework. I tried a lot of models (arimax, glm, pooled regression..), but nothing worked fine. Poor fit and poor interpretation. Well I need some good tips.. any idea?

Thanks for your help!

M

3 REPLIES 3
M_Maldonado
Barite | Level 11

Hey Milo,

If you plan to measure the effectiveness of marketing campaigns in the future I would highly recommend investing resources in design of experiments so that you can use a control group and incremental response models. Take a look at this thread ( ) for a paper and a video. You can find more information on Enterprise Miner Reference Help (press F1 on Enterprise Miner). This kind of model will give you a better light on what factors impact your sales.

In the meantime I can offer these suggestions.

Time Series Analysis

It seems to me that you should have an Arima model for each product. And you might want to try using just the last year of information (perhaps the 52 weeks that best capture seasonal effects like Easter, Christmass, and other important holidays in your data?). If you think that the null values are tripping off your model, turn those variables into an accumulator (for example dollars invested to date instead of dollars invested on a given month).

I wouldn't think that you have to standardize your predictors (indexes, money counts, etc). But give it a try with/without proc stdize before modeling) and see how it goes.

I am not a time series expert but I would think that model selection would take care of autocorrelation. I found this doc that might come handy for selection (look for automatic model selection in SAS/ETS(R) 9.2 User's Guide).

Regressions

If your audience is more familiar with regressions than with ETS analysis, try aggregating product sales by product category and training quarterly or semester models. I would hope that summarizing your 200 predictors into Q1 and Q2 are great predictors of the sales of Q3. Use variable selection so that you end up with only the most predictive handful of predictors.

Personally I would prefer this second approach as I find it easier to explain. Since you want to do both explain and predict, you might have to take the best of each model. For example you may want to use your regression model to find the most important factors that drive your product sales. But you will actually use your time series model to predict future sales.


It sounds like you are on the right track. Good luck with your models!


Take care,

M

MiloKK
Calcite | Level 5

Thanks Miguel, very useful suggestions!

The accumulator tip could be very interesting.. now we are proceeding with ARIMAX models with good results. We standardized the dependent variable (creating a more homogenous index variable) and it seems it worked out. Let's see what happen after the last fine tuning..

Consider that we do not have information on the single customer, so it would be difficult to use uplift models in this case.

cheers

Milo

M_Maldonado
Barite | Level 11

Great to hear that Milo!

Just curious, what are you using to standardize just proc stdize? If you are using something else, can I borrow an example from you? Someone asked about this the other day...

Bummer that you don't have info per customer...

Good luck with your Arimax!

-Miguel

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1117 views
  • 4 likes
  • 2 in conversation