DATA Step, Macro, Functions and more

Incremental Median Calculation

Reply
SAS Employee
Posts: 51

Incremental Median Calculation

All,

I'm writing a program where I'm attempting to compute several descriptive statistics with only a single pass through the data. Example statistics are mean, stddev, frequency counts, etc. I am, of course, using DATA Step for this programSmiley Wink.

With most stats, I can use sums or at worst a hash table that stores each unique value along with a count. Is there a calculation formula I can use to incrementally calculate the median (or for that matter, any percentile) without having every unique value available in memory or in a table?

Sorry if this is a VERY naive question - I'm more of a programmer than a statisticianSmiley Happy.

Thanks,

Tim Stearn
Super Contributor
Super Contributor
Posts: 365

Re: Incremental Median Calculation

Posted in reply to TimStearn_SASProductManagement_
Hello Tim,

A standard way of calculating descriptive statistics (including percentiles) is proc UNIVARIATE. What do you mean saying "to incrementally calculate the median"?

Sincerely,
SPR
PROC Star
Posts: 7,467

Re: Incremental Median Calculation

Posted in reply to TimStearn_SASProductManagement_
Tim,

I, too, am not sure what you mean by incrementally. The calculation of percentiles is dependent upon whether you have an odd or even number of cases. In both cases, the calculation requires that your data be sorted by the values of the variable you are trying to calculate.

In the case of the median, or 50th percentile, for an even number of cases it is simply the value of the variable at the n*.5 th record. For an odd number of cases, it is the average of the int(n*.5) and int(n*.5)+1 records.

HTH,
Art
Regular Contributor
Posts: 241

Re: Incremental Median Calculation

Posted in reply to TimStearn_SASProductManagement_
> Is there a calculation formula I can use to

> incrementally calculate the median (or for that

> matter, any percentile) without having every unique

> value available in memory or in a table?



A straight forward way to calculate the median in this way seems to be: to maintain an ordered data store (i.e., an array or a hash) of half the total number of observations.



A StackOverflow article concurs.
SAS Employee
Posts: 51

Re: Incremental Median Calculation

Posted in reply to chang_y_chung_hotmail_com
Hi All,

Thanks for the responses. It appears I confused people by asking about "incrementally" calculating the median. The idea was to avoid the use of PROCs since I need to generate metrics that either would require multiple SAS PROCs or DATA Step code. Instead of doing 3 or more passes on the data, I'd like to try to do it in a single pass using one DATA Step.

I can certainly hold all (or perhaps 1/2) of the unique values in memory and do a calculation on those values at the end of the DATA Step but was wondering if there was another option.

Thanks,

Tim Stearn
Regular Contributor
Posts: 241

Re: Incremental Median Calculation

Posted in reply to TimStearn_SASProductManagement_
@Tim: You did not confuse me. The answer is: no.
PROC Star
Posts: 7,467

Re: Incremental Median Calculation

Posted in reply to chang_y_chung_hotmail_com
Tim,

If you are already calculating frequency counts, why not find the median from those counts? In one pass you can compute the frequency counts, sum, n and sum of squared values. From those you should be able to obtain everything you want in one pass of your data, and a second pass of justs the frequency counts.

Art
Super User
Posts: 11,343

Re: Incremental Median Calculation

Posted in reply to TimStearn_SASProductManagement_
> Hi All,
>
> Thanks for the responses. It appears I confused
> people by asking about "incrementally" calculating
> the median. The idea was to avoid the use of PROCs
> since I need to generate metrics that either would
> require multiple SAS PROCs or DATA Step code.
> Instead of doing 3 or more passes on the data, I'd
> like to try to do it in a single pass using one DATA
> Step.
>
> I can certainly hold all (or perhaps 1/2) of the
> unique values in memory and do a calculation on those
> values at the end of the DATA Step but was wondering
> if there was another option.
>
> Thanks,
>
> Tim Stearn

By any chance did you mean:
generate metrics that either would NOT require multiple SAS PROCs or DATA Step code

As mentioned above, a single pass through PROC MEANS or SUMMARY would probably be the most efficient both in time of code development and execution time.
Super User
Posts: 10,018

Re: Incremental Median Calculation

Posted in reply to TimStearn_SASProductManagement_
Hi.
Tim .I am too,Even though I know some very basic statistical knowledge.
For you question.You can use proc mean to calculated all the statistical estimator you need, include median.You can create a macro to dynamically calculate the incremental data 's median.Hash table is not very flexibility to calculate these estimator exclude sum.

So for your situation,I recommend highly to use proc rank, which flag the data from min to max with rank. EX: You have one hundred number, using proc rank to flag them, the fiftieth rank will be the number you look for,namely median or in other words fifty percentile.
Hope these can help you some.


Ksharp
Ask a Question
Discussion stats
  • 8 replies
  • 1067 views
  • 0 likes
  • 6 in conversation