08-08-2016 05:16 PM
Not looking for a lesson in stats as I am perfectly capable of researching statistical methods and I do have a background in econometrics and stats. I would however like to ask if anyone can point me in a good direction or even just giev me the name of the class of models I am describing.
I have data sets that typically ranges from 1000-8000 observations. Typically there over 50-200 individuals (in my case they are marketing channels) and each has a multiple observations. Because the individual effects of each of these observations are typically grouped with the individual effects of multiple individuals, the individual effect of a station is not isolated. Instead, I buckets (if you will) of effects that I know came from Individual A, B ,G, E, and F.
For example, I have 8000 Unique effects (observations) caused by 100 Individuals. On average, 5 random individuals have an effect at one time...thus, I have 8000/5 groups = 1600 groups.
Group 1: Individual A, Individual B, Individual C, Individual D, Individual E
Effect of Group 1: $5000
Group 2: Individual A, Individual C, Individual G, Individual H, Individual L
Effect of Group 2: $4,000
Group 3: Ind. Z, E, Y, D, Q
Effecot of Group 3: $8,000
So above I have 3 groups of 5 individuals whose aggregated effects are observed. My goal is to use the entire dataset to determine how much of the effect of each group was due to the individuals of the group.
For Group 1, the end goal would be something like:
Ind. A = 40% = $2000
Ind. B = 10% = $500
Ind. C = 15% = $750
Ind. D = 20% = $1000
Ind. E = 15% = $750
I am assuming this is going to be a probablistic model and possible a panel data model.
08-11-2016 06:22 AM - edited 08-11-2016 07:15 AM
I hope I understand the problem correctly.
You have individuals, that occasionally go to a shop in groups and buy somthing.
You want to estimate the average buying amount of an individual.
The problem is, that you are unable to observe individuals, you are always observing the sum of the groups.
Another problem of similar structure is, when we sell different products in a package to customers. We are able to observe only the total amount the custoomers pay for a package.
Linear regression with no intercept can do it:
input a b c d sum;
1 1 0 0 100
1 0 1 1 400
1 1 0 1 900
1 1 0 1 370
1 0 1 1 500
0 1 0 1 250
0 1 1 1 600
proc glm data=have ;
model sum=a b c d / noint solution ;
Edit: depending on what types of "effects" you want to see, maybe you don't need the noint option. Or maybe you want to take the logarithm of the dependent variable (sum), then essentially you will analyze multiplicative effects.
08-11-2016 06:31 PM
Thanks for your response.
I thought that linear regression would work as well, but it doesn't seem to be doing the trick. The problem is that the individual effects of each individual within the group have a variance of their own. This sometimes causes a particular individual to have a "negative" effect or estimated coefficient which does not make sense. In my case, the effect has to be 0 or greater. That's why I'm searching for something that will tell me the percentage of the effect an individual should get based on all the combinations of effects from all observerable groups.
08-12-2016 06:02 AM - edited 08-12-2016 06:07 AM
Yes. To be honest I had to play around with my fabrcated data, to get all the estimates positive.
One way to workaround this is to use the RESTRICT statement. You can force coefficients >=0 with it.
RESTRICT statment is not available in PROC GLM, but other regression procedures have it. PROC REG fore example.
Though this does not solve the problem you describe.
I you have individual variances it sound more like a random effects model.
Still, depending on the data you can get negative parameter estimates. Or even if all parameter estimates are positive, they have an associated variance, so dheir distribution includes negative values.
I would investigate if the model is rather multiplicative (when effects are restricted to be positive it is always suspicious that the model is multiplicative), then you can simply first take the logarithm of the target variable and then model. In that case maybe variances of individuals become similar, and then you can assume one common variance for the sum.
With 1600 observations (in this context 1 observation is when you observe the aggregated result) and 200 variables you will need to estimate 2x200 parameters (mean and variance for each individual). Sounds possible, but... good luck
Instead of my fabricated data I simulate data in the following example. Its an additive model. The task is to get back the the original means (40, 100, 15, 300).
array indiv a b c d;
array means _temporary_ (40, 100, 15, 300);
array varia _temporary_ (5, 5, 10, 20);
do obs=1 to 1000;
do var=1 to 4;
indiv[var]=ranbin(0,1,0.5); /*0.5 chance that effect is included*/
drop obs var ind_effect;
proc mixed data=have ;
model sum=a b c d / noint solution ;
random a b c d / type=vc ;