09-25-2017 05:16 AM
I have a question, and please forgive me if it is too basic.
I have a big data set, that I have filtered and prepared in order to check if one of my columns fits to the Normal distribution. As I expected to see a fitting, while I am not, I would really appreciate a clarification on:
1. Why would someone set manually the mean and standard deviation parameters? They are calculated automatically, so...
2. What else should I be careful of, when running the process? I mean...is there something in the code that might be tricky and lead me to wrong conclusions if I do not take into consideration ?
09-25-2017 08:07 AM
0. You say "I expected to see a fitting, while I am not." Could you say more about what you are doing, and maybe post the code and the image that you get?
1. Setting the values of the parameters are useful when you want to compare the data distribution to a population distribution. For example, you might have census data that gives the heights of American males. For example, if you sample 50 males from a Native American tribe in Arizona, you might be interested in overlaying the distribution of the population on the data. To do that, you would specify the mean and standard distribution of the population, not the data sample.
2. For normal distributions, fitting the data is easy because the sample mean and sample standard deviation are used as the estimates for the normal parameters. For non-normal distributions, the software often performs a maximum likelihood estimation, so there is a chance that the optimization might not converge.
09-25-2017 08:26 AM
First of all ,thank you very much for your quick response.
I expect to see a fitting of the train dwell times at the stations to a normal distribution, as this is what previously was found at past researches. I just removed some outliers from my data, and the histogram now is attached in the photo. The intervals (x-axis) dropped a lot now, making it not-realistic for railways.
I am a poor user of SAS Enterprise guide. But what I have done was:
1. Filter my data, keeping the values i need for my analysis (i.e. I have a column with the train dwell times)
2. Select the task ''distribution analysis''.
3. Chose the normal distribution, outputting also a histogram and quantiles plot for visual evaluation alongside with the normality tests.
4. Visually evaluate, and check the p-values for rejection/approval
p.s. please note that due to a confidencial agreement, I am not able to disclose much. I am attaching a screenshot of the output histogram.
09-25-2017 08:55 AM
First, I suggest you use ODS graphics.
Tasks-->Describe-->Distribution Analysis, then click on Distributions: Summary" and click the ODS radio button under Graphics Style.
How many observations in the data?
It looks like you might have repeated values and rounded data. Rounded data can cause the histogram to look strange if the bin width is too small, as explained in this article.
The bin width can be set by using code. Go to the code window, and copy/paste/modify the code to add the MIDPOINTS= option to the HISTOGRAM statement. The code might look something like this:
proc univariate data=MYDATA; var Real_run_time; histogram Real_run_time / midpoints=(0.25 to 1.5 by 0.25); run;
09-25-2017 09:28 AM
First thing done, and thank you for the advice.
I have lots of observations (35.000 approximately).
As in railways, the best accuracy we can have is of 10 seconds, and my dataset is in minutes, the data are in a form of x.xx without any additional rounding.
I changed the bin width as explained, and the histogram now looks much better (thank you!). Is there also a way to define it in the Tasks--> Describe --> Distribution Analysis command ? Because now I created a new ''program'' in order to try it out.
From a statistical point of you, could the width affect the result?
09-25-2017 09:38 AM
I will let an EG expert suggest how to address this issue in a task.
From a statistical point of view, the bin width does not matter, but the fact that you have coarsely rounded data is a big problem.
Your data are rounded to the nearest minute, which results in only a few unique values out of the 35,000 observations. Normally distributed values are CONTINUOUS, but you have discrete measurements. Therefore statistical tests for normality will REJECT the hypothesis that the data are normal. See the example in the third section of the article "Goodness-of-fit tests: A cautionary tale for large and small samples."
09-25-2017 09:50 AM
I went through the article and it makes perfect sense, I see your point. Please have a look to a small sample of my data. Apparently, the provider of the data had applied some corrections, and the values are varrying quite well, as for up to 2 decimals accuracy. (or at least that's what i can see).
09-25-2017 10:28 AM
Dealing with the discreet values problem, I thought that I could deal with the data in intervals, that will cover my accuracy (i.e. the probabilty of having a value between 0.83 and 1 minute). Does this make sense, and how can one define in in the code if so?
09-25-2017 10:57 AM
If you truncate the time to two decimal places (hundredths of a minute?), then the GOF tests should "work better" for small data sets, where "work better" means they will not reject the hypothesis of normality when the population is truly normal. As I explain in the "Goodness of Fit" article, however, for very large samples the tests will detect the rounded values or deviations in the tails and reject normality.
From what little you've told me, the underlying process is unlikely to be normal because a train will rarely arrive more than a few minutes early but it can arrive very late. Therefore the distribution of train times is likely to be skewed to the right.
In practice, normality tests are not very useful. Can you say why are you concerned about whether hypothesis tests on these data accept/reject normality? What is the practical question that you are trying to answer?
09-26-2017 03:30 AM
I see. I will try and repeat the process, isolating a sample of a week only or so.
You are right, and that will come afterwards, expecting the departure delays for example to fit an exponential distribution (based on existing research outcomes). Nevertheless, these data refer to the dwell times (times that a train stops at a station).
It is because there is existing literature claiming so, and if that applies to my research too, for railway line in the country I am looking into, I would be able to use the fitted distribution for simulation reasons afterwards.
The last question I would like to ask is: why does not, changing the bins width, the fitting gets affected? Is not it like a grouping of values? If the width is big enough to cover the gap that the discreet data create, then we will have continuity between the groups. Is this thought wrong?
09-26-2017 05:18 AM
Yes, the thought is not correct. Goodness-of-fit tests do not use bins, only the underlying data. The bins are used to form a visual estimate of the density. Different bin widths bring out different features of the data, but they do not change the data or the hypothesis tests in any way.