turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Challenging regression

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-02-2015 06:21 PM

Hi there,

I’m writing this first post basically because I’m getting crazy in analyzing a set of quite complicated data

It’s been several years that I’ve been dealing with data mining issues, but most focused on classification and association problems. In this case I have to develop a multivariate regression model in order to explain how investments in marketing, communication and HR training impact on commercial output (e.g. sales) of different kind of products. The purpose of the model is of course explicative, but also predictive as the final result must be a sort of simulation dashboard (useful for planning the investments in future campaigns).

Structure of data: longitudinal, time series cross sectional à weekly observations (3 years long) for 5 different product categories. So I have about 750 observations. Dependent variable: product category sales; independent variables: about 200 potential predictors.

Ok, nothing particularly strange.. but..

- For each product category, the sales are VERY differentiated: for example, for one category they could be thousands of units (on weekly base) for 40 weeks long (and the rest equal to zero), while for another category I have hundreds of units weekly, but along 100 weeks. So the dependent variable is full of zeros across time, it depends on product category.. It’s very inhomogeneous!
- Also the predictors are not homogenous: I have some quantity that are different from category to category (e.g. investments in Paid Search on the internet), while others vary across time, but are equal across product categories (e.g. customer satisfaction index); lot of null values also in the predictors.
- Most of the predictors are strongly correlated with each other;
- Within each category, the dependent variable is autocorrelated;

Ok, this is the framework. I tried a lot of models (arimax, glm, pooled regression..), but nothing worked fine. Poor fit and poor interpretation. Well I need some good tips.. any idea?

Thanks for your help!

M

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to MiloKK

02-08-2015 06:23 PM

Hey Milo,

If you plan to measure the effectiveness of marketing campaigns in the future I would highly recommend investing resources in design of experiments so that you can use a control group and incremental response models. Take a look at this thread ( ) for a paper and a video. You can find more information on Enterprise Miner Reference Help (press F1 on Enterprise Miner). This kind of model will give you a better light on what factors impact your sales.

In the meantime I can offer these suggestions.

**Time Series Analysis**

It seems to me that you should have an Arima model for each product. And you might want to try using just the last year of information (perhaps the 52 weeks that best capture seasonal effects like Easter, Christmass, and other important holidays in your data?). If you think that the null values are tripping off your model, turn those variables into an accumulator (for example dollars invested to date instead of dollars invested on a given month).

I wouldn't think that you have to standardize your predictors (indexes, money counts, etc). But give it a try with/without proc stdize before modeling) and see how it goes.

I am not a time series expert but I would think that model selection would take care of autocorrelation. I found this doc that might come handy for selection (look for automatic model selection in SAS/ETS(R) 9.2 User's Guide).

**Regressions**

If your audience is more familiar with regressions than with ETS analysis, try aggregating product sales by product category and training quarterly or semester models. I would hope that summarizing your 200 predictors into Q1 and Q2 are great predictors of the sales of Q3. Use variable selection so that you end up with only the most predictive handful of predictors.

Personally I would prefer this second approach as I find it easier to explain. Since you want to do both explain and predict, you might have to take the best of each model. For example you may want to use your regression model to find the most important factors that drive your product sales. But you will actually use your time series model to predict future sales.

It sounds like you are on the right track. Good luck with your models!

Take care,

M

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to M_Maldonado

02-13-2015 10:19 AM

Thanks Miguel, very useful suggestions!

The accumulator tip could be very interesting.. now we are proceeding with ARIMAX models with good results. We standardized the dependent variable (creating a more homogenous index variable) and it seems it worked out. Let's see what happen after the last fine tuning..

Consider that we do not have information on the single customer, so it would be difficult to use uplift models in this case.

cheers

Milo

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to MiloKK

02-13-2015 10:24 AM

Great to hear that Milo!

Just curious, what are you using to standardize just proc stdize? If you are using something else, can I borrow an example from you? Someone asked about this the other day...

Bummer that you don't have info per customer...

Good luck with your Arimax!

-Miguel