I'm writing a program where I'm attempting to compute several descriptive statistics with only a single pass through the data. Example statistics are mean, stddev, frequency counts, etc. I am, of course, using DATA Step for this program.
With most stats, I can use sums or at worst a hash table that stores each unique value along with a count. Is there a calculation formula I can use to incrementally calculate the median (or for that matter, any percentile) without having every unique value available in memory or in a table?
Sorry if this is a VERY naive question - I'm more of a programmer than a statistician.
I, too, am not sure what you mean by incrementally. The calculation of percentiles is dependent upon whether you have an odd or even number of cases. In both cases, the calculation requires that your data be sorted by the values of the variable you are trying to calculate.
In the case of the median, or 50th percentile, for an even number of cases it is simply the value of the variable at the n*.5 th record. For an odd number of cases, it is the average of the int(n*.5) and int(n*.5)+1 records.
Thanks for the responses. It appears I confused people by asking about "incrementally" calculating the median. The idea was to avoid the use of PROCs since I need to generate metrics that either would require multiple SAS PROCs or DATA Step code. Instead of doing 3 or more passes on the data, I'd like to try to do it in a single pass using one DATA Step.
I can certainly hold all (or perhaps 1/2) of the unique values in memory and do a calculation on those values at the end of the DATA Step but was wondering if there was another option.
If you are already calculating frequency counts, why not find the median from those counts? In one pass you can compute the frequency counts, sum, n and sum of squared values. From those you should be able to obtain everything you want in one pass of your data, and a second pass of justs the frequency counts.
> Hi All,
> Thanks for the responses. It appears I confused
> people by asking about "incrementally" calculating
> the median. The idea was to avoid the use of PROCs
> since I need to generate metrics that either would
> require multiple SAS PROCs or DATA Step code.
> Instead of doing 3 or more passes on the data, I'd
> like to try to do it in a single pass using one DATA
> I can certainly hold all (or perhaps 1/2) of the
> unique values in memory and do a calculation on those
> values at the end of the DATA Step but was wondering
> if there was another option.
> Tim Stearn
By any chance did you mean:
generate metrics that either would NOT require multiple SAS PROCs or DATA Step code
As mentioned above, a single pass through PROC MEANS or SUMMARY would probably be the most efficient both in time of code development and execution time.
Tim .I am too,Even though I know some very basic statistical knowledge.
For you question.You can use proc mean to calculated all the statistical estimator you need, include median.You can create a macro to dynamically calculate the incremental data 's median.Hash table is not very flexibility to calculate these estimator exclude sum.
So for your situation,I recommend highly to use proc rank, which flag the data from min to max with rank. EX: You have one hundred number, using proc rank to flag them, the fiftieth rank will be the number you look for,namely median or in other words fifty percentile.
Hope these can help you some.