07-24-2014 10:03 AM
I am a bit stuck here. I am tasked with comparing sales numbers. On the one hand, I have a single account, where a specific project has been conducted (the affected account), which is to be compared to the others, non-affected accounts (the control accounts), to see if there has been any statistically significant change in the sales following the implementation of the project, and if the account stands out from the others, in average. I guess I am looking for something more 'powerful' than a boxplot and outliers/extreme values definition.
It was OK previously when a group of accounts were affected altogether by the project, so I could perform a t-test, but now, there is only one account in my first sample.
I take it that I cannot perform a Student's t-test, but an elementary search has returned some results around a Levene's t-test, that could be able to do that (a slight variation from the test for homoscedasticity), but I can't find any proc in SAS to do so.
Any ideas please ??
07-26-2014 02:28 AM
It may be more than 30 years since I did Stats 101, but I believe the rule still applies - you cannot say there is a statistical difference between a single observation and a population of observations. Even the most extreme value of some variable could be due to random variation.
However, based on some of the words you've used (account, sales), I can offer some advice. It sounds as though you're dealing with already summarised data - e.g. You have Total Sales $ and Total Sales volume, etcetera, for the accounts - affected and control. You need the raw data. Then you can compare whether the differences between populations is statistically significant.
Secondly, you need better framed questions than "Is the affected account statistically different to the control accounts?" - e.g. "Is the rate of conversion from Prospect to Sale higher for the affected account?", or "Does the affected account achieve higher Unit Sales Prices?". So, you'll need to ensure that you raw data can support testing your hypothesis.
Hope this helps.
07-29-2014 08:17 AM
@ DaveBirch -
Thanks for your answer. In fact yes, I use already summarised data, but there is a reason for that. I have accounts with sales, and the accounts are very heterogeneous (some may have big sales whereas others would have very small ones; there can be differences for up to 10 fold). That's why I decided, instead of dealing with the monthly sales of the affected, single account (which can be very big or very small) versus the monthly, national average of control accounts (whose sales, thanks to the average, are more homogeneous), to transform the data using a more meaningful approach by comparing the growth rates of the two types of accounts. Basically, the sales were averaged 6 months before (Xb) and 6 months after (Xa) the implementation date of the project conducted, both in the (single) affected account and in the (numerous, 150-200) control accounts, and the variable growth rate (GR) was defined ((Xa-Xb)/Xb).
That is why I therefore have one single GR to compare to a sample of 150-200 GR.
Dealing with the raw, absolute sales data wouldn't be as meaningful as using the defined, relative variation (GR) variable - I assume.
Any ideas / thoughts about that ?
07-29-2014 08:28 AM
Oh in fact an idea just germinated in my mind while explaining my problem : I could basically model my data, for the time period - doing some linear regression, and then test whether the slope of the equation for the affected account differ from the slope of the equation for the control accounts ?
But that would assess if the variations between the two types of accounts are different, for the time period. Not if there is a difference between the sales (or their progression) before and after the project, compared to the control accounts (also before and after the project...)
07-29-2014 09:24 AM
Seems like you're getting there. I'm glad to see you've actually got more than one observation in your 'affected group'. Not like how you introduced the thread :-).
Yes, the raw, absolute sales data (transaction level) probably would be overkill - since I presume the hypothesis is that the project has, in some way, improved the sales figures for the affected account. So, 12 'affected' obs vs. 150(+) x 12 control obs is the right level of detail I think.
I suggest that a linear regression - testing whether the slope of the equation for the affected account differs from the slope of the equations for the control accounts - is a very worthwhile approach. Remember the slope represents the time progression from before treatment to after treatment.
I would also adopt SteveDenham's approach with PROC UNIVARIATE or PROC MEANS to compare where the affected account ranks within the control group before and after treatment.
07-29-2014 11:49 AM
Having dug deeper, I can't think I can model my data with a regression without loosing much of my granularity - I have roller-coaster sales !
Moreover, I don't have 12 figures as you're assuming, because they are growth rates (they are not monthly). I only have one growth rate per account for the time period (that covers, say, one year - 6 months before and 6 months after the project).
I end up with a table like that, if I use the GR:
Name_Account Type Growth_Rate%
AccountAA Affected x_AA <-- compare this GR...
AccountAB Control x_AB ... to the GR of all the other, control accounts below
... ... ...
AccountZZ Control x_ZZ
BUT ideally I would also take the time period into account (I can perhaps define a binary variable,' Period' : Before/After) which will gather the sales depending on their month being before or after the project's implementation.
So perhaps my question comes down to an ANOVA whose Y=sales and X=Type (assess if the presence of the project has had an effect), Period (assess if the sales before and after the project are different), Type*Period (impact of the interaction)
Does that make any sense at all ??... :smileyconfused::smileyconfused:
07-30-2014 01:29 AM
ANOVA is another form of linear model, but its name should give you a clue about the level of data you need. ** ANanlysis Of VAriance ** If you throw away that "roller-coaster sales" ride (because it seems confusing or inconvenient), you are throwing away the basic information that ANOVA (and linear regression for that matter) works with. If you have only one observation that is "affected", then you have no variation for "affected". (And, don't forget that the account is not "affected" in the months before the project.) Remember, average values inherently show less variance than the larger population of values from which they are derived. This can lead to false conclusions about whether an averaged value is significantly different to other averaged values.
So, are you doing the averaging - or, is that how the data is supplied?
07-30-2014 03:11 AM
I don't know what to do. I can't find a way to transform the data to assess them with an ANOVA (which could be a great solution to grasp all the variability of the sales).
I think that my analysis comes down to two questions and I don't know in which order/how to handle them, namely "are the sales different before and after the project?" AND (in the same time) "are the sales different from the national sales ?"
Perhaps, is it possible to do a t-test (comparing before and after sales) but adjusting for national (control) sales ?
I think I'm getting confused
07-30-2014 10:44 AM
Ooooh... Digging further I have discovered an entire subfield I was unaware of, time-series analysis. I think that what I have comes down to an intervention model, that I could handle with an ARIMA. What an illumination is that !!
07-26-2014 11:02 AM
proc ttest data=have h0=300 ; /* suppose this is single account value */
07-28-2014 10:51 AM
Use PROC UNIVARIATE or PROC MEANS to calculate confidence bounds on the non-affected accounts, and determine if they cover the single point of interest.
07-29-2014 04:17 AM
@ SteveDenham -
Thanks for your answer - yes I had thought of that too, but I cannot tell the confidence with which I have answered the question - there is no decision variable or probability associated to formally assess my answer to the problem... I don't know if this is clear enough?
Besides, for me, a confidence interval for the mean means something like "I can say with 95% confidence that the real (well, population's - as opposed to sample's) mean is contained in this interval". Stricto sensu, this tool doesn't tell us anything else; what meaning/conclusion could we significantly draw from a comparison of a single value, that doesn't even belong to the sample (or the population), with the CI ?
I am not at all rejecting your proposition - in fact I wanted to go with this type of assessment in the first place, but my questioning mind is looking for real answers...
There are so many things you can do with the statistics, and if you're not precise and thorough enough, you end up manipulating the tests and the numbers to make them saying anything you want - I look on the statistics as a precise and powerful science if correctly used, so I'm trying not to act way the same as those who do tests left, right and centre without understanding the underlying principles.
Thanks for your help - and moreover taking part in this intellectual challenge :-)
07-29-2014 08:41 AM
Perhaps a Bayesian interval on the mean of the non-affected counts or a kernel estimate that is relatively "distribution free" would be of interest for comparing the single affected value. You can always specify a "confidence" level when you do this.
07-29-2014 03:57 AM
Thanks for all the answers up here.
About stat@sas's one - so if I understand well, you're using here the t-test in the configuration of comparing a sample mean to a theoretical mean ? Doesn't that pose any problem - around the definition of the hypothesis, because this is not 'exactly' a theoretical mean ?
I don't know if my concerns are legitimate or not, it's just because I have never used the theoretical/observed t-test to answer such a question, so would be sure that this can be the accurate and right thing to do.