Hi all Currently at the company I work for, we use benchmarks of 'subjects' (I call it 'subjects' because of anonimity reasons) in order to discover subjects with 'unusual behaviour'. Basically what we do is: we have a set of subjects with a 'score' on a 'risk'. We want to discover the 'highest scorers' (hence we use benchmark) in order to 'further analyze them'. Until now, we have worked with percentiles in order to define the ‘subjects to further analyse’. Let’s say we have 150 subjects with a score between 0 and 1 (percentage), ranked from high to low. What we did until now is to say: let’s further analyse all the subjects above the 90th percentile = the 10% highest scorers. As you all know, this method doesn’t account for the actual scores, the mean of the scores or the spread within the scores. If I have 150 subjects, which score relatively the same on a risk, except for the top 3, which score very high, I don’t want to further investigate the top 10% (15 subjects), but only the top 3, right? Therefore we are now looking for a method better than percentiles to determine the ‘highest scorers’. FYI, we don’t bother with the lowest scorers (yet). One method we are thinking of is of course the use of the mean + X*standard deviations. Doing so, the subjects who fall outside our predefined upper range, defined as 1,28 (=upper 10%) or 1,64 (=upper 5%) or 1,96 (=upper 2,5%) times the standard deviation, will be flagged as ‘to be further analyzed’. So far, so good. This looks to me like a solid method to determine the highest scorers, while taking into account the mean and the spread in the data. As the title suggests though, most of the ‘risks’ we analyse, contain data which is not normally distributed. That still doesn’t have to be a problem, since data can be transformed in order to become normally distributed. I know of different kinds of transformation, according to Tukey’s ladder of transformation. After transformation I can calculate mean and upper range and then transform the data back to the original values in order to determine who are my subjects of interest. This sounds like a solid method to me, right? Transformation --> calculation --> back-transformation, is what is always used in science, as far as I know. Now we come to the main questions of this discussion: How do I determine if my data is normally distributed after transformation? What if my data is still not (perfectly) normally distributed even after transformation? Can I still use the method of mean + X*standard deviation in order to determine which subjects to further analyse? I have attached 3 files in order to show how I determine if my data is (somewhat) normally distributed. The data used to generate the 3 files, is already ln-transformed (log(x) in SAS, not log10(x)). The first picture shows the result the univariate procedure. In this output I look at different things: Skewness and kurtosis: If skewness is between -1 and +1, it suggests to me a normal distribution If kurtosis is < 1, it suggests to me a normal distribution. Mean and median If mean is approximately the same as the median, it suggests to me a normal distribution Tests for normality If the tests are NOT significant, it suggests to me a normal distribution. Then I look at the histogram: If it ‘looks like’ a normal distribution, it suggests to me a normal distribution. Then I look at the QQplot: If it is almost totally a straight line, it suggests to me a normal distribution. Based on the files I attached, I would decide that my ln-transformed data is distributed normally enough in order to do my mean + X*sd calculations. I realize that most of the judgments I make are ‘arbitrary’, except for the tests of normality. The only non-arbitrary measures of normality (the tests for normality) reject the hypothesis of normally distributed data, and still I would conclude that my data is distributed normally enough, based on what I ‘see’. Hence I am here asking for help on the matter. 1) Is my method of determining normality of the transformed data appropriate? If not, how can I best judge if my data is normally distributed? 2) What other transformations of tricks are there to get normally distributed data, of the ‘regular’ methods don’t work? 3) If I still do mean + X*sd calculations on not perfectly normal data, what are the consequences of this? And then specifically, what are the consequences of this looking at my initial questions, i.e. determine the high scorers / subjects of interest? Finally, do I REALLY require (perfectly) normally distributed data in order to select my ‘subjects of interest’ with the method of mean + X*sd?
... View more