11-09-2012 04:58 AM
Hi, i have interesting data tu analyse. there are two datasets (showing only example):
1) -1 -1 -1 0 1 1 0 -1 1 1 0 -1 -1 0 -1 1 1 -1 0 0 1 -1 1 1............................
basicaly datasets have 3 diferent types of values:
-1, 0, 1 in first dataset
and -1, 0 ,poitive value which is in interval [0.5;1.2] in second dataset.
the size of data set is ~500 observations.
What i need to find out is the mean and confidence limits of it.
does anybody have an idea of evaluating distribution and parameters for such data? the main task is to get confident lower limit of mean and the minimum size of dataset which should be analysed to get confident result.
please help anyone.. thanks a lot!
11-09-2012 07:44 AM
What do the values represent? With some idea of the process that generated these values, it might be easier to come up with an answer. I think of the first as ordinal categories, while the second looks like a mixture of some sort. With a better understanding of how these values were generated, we might be able to give a better answer.
11-09-2012 08:26 AM
thank you for the interest. the data represents the outcome of gambling game. those datasets are from the same games observed.
1)simplified dataset: if you lose your result is -1 . if its a draw the result is 0. if you win the result is 1. so we take 500 games, we win more often than loose and we have some draws. and i am trying to calculated the return of investment (average result).
2) second dataset is based on the same games but difference is that when you come in to the game you pay the price of 1 unit. if you loose the game, you loose 1 unit so profit is -1. if its a draw you get 1 unit refunded so profit is 0. if you win you win 1 multiplied by some coeficient frominterval [0.5; 1.2] and the profit is equal to the coeficient.
for example we know that from first 500 observations, we won 280, won lost 170, and draw 50, and the final result is that the average profit on one game is 0.05 (5 percent) having in mind that profit of winning is less than 1, and lost amount is always -1 makes profit small while we win much more times than loose.
any help from my explanation?
11-12-2012 09:12 AM
The -1, 0, 1 data can be modeled with two processes. The first calculates the probability of not tying, the second the probability of a win. The expected value, based on your data, would be (number of zeroes/number of trials) * (number of wins/number of non-zero trials). The variance could be calculated using the delta method, and applying it to the product of two binomials.
However, possibly the best way to estimate the mean and variance for these kinds of mixtures would be by bootstrapping. Randomly sample 100 observations from each distribution and calculate the raw mean. Repeat this about 1000 times, and calculate the overall mean and standard error based on the sample means.
I think the second distribution would be nearly intractable to any other analysis, as the fractional payoff is dependent on the distribution on the interval [0.5, 1.2]. I doubt very much that the distribution is uniform on that interval, probably a non-linear decreasing function like a gamma distribution, but truncated, so that moments would be almost impossible to calculate. A perfect place to use bootstrapping.
11-19-2012 08:31 AM
Thank you so much. i found some prepared macros for bootstraping and i will test them with my data in few days. i will get back with results. Thanks again.
01-29-2013 02:48 AM
Hi once more,
I triedusing bootstrapping technique to find out the distribution and i get X~N(0.05, 0.02). Everything is clear for me here. If i put mean, std dev, and lower/upper bounds of mean to proc power to get N i get ~300. Thats look totally normal as i was guessing that 300 would be enough before trying to calculate this. But if i choose 300 random observations from data set containing 1000 observations and calculate mean, every time it is not even close to 0.05. Even running bootstrapping when generating 300 samples means where N =300 i get totally different results. So how do i get N with which mean would be similar in both cases with all data from set and with N=300 ?
Problem: for example i want to collect another data set with similar observations (it takes too long to get another 1000 observations) and i want to decide when number of observations is enough to conclude that mean is equal to some number and it will be the same in a long term so and i can invest real money into these observations.