BookmarkSubscribeRSS Feed
All,

I'm writing a program where I'm attempting to compute several descriptive statistics with only a single pass through the data. Example statistics are mean, stddev, frequency counts, etc. I am, of course, using DATA Step for this program;).

With most stats, I can use sums or at worst a hash table that stores each unique value along with a count. Is there a calculation formula I can use to incrementally calculate the median (or for that matter, any percentile) without having every unique value available in memory or in a table?

Sorry if this is a VERY naive question - I'm more of a programmer than a statistician:).

Thanks,

Tim Stearn
8 REPLIES 8
SPR
Quartz | Level 8 SPR
Quartz | Level 8
Hello Tim,

A standard way of calculating descriptive statistics (including percentiles) is proc UNIVARIATE. What do you mean saying "to incrementally calculate the median"?

Sincerely,
SPR
art297
Opal | Level 21
Tim,

I, too, am not sure what you mean by incrementally. The calculation of percentiles is dependent upon whether you have an odd or even number of cases. In both cases, the calculation requires that your data be sorted by the values of the variable you are trying to calculate.

In the case of the median, or 50th percentile, for an even number of cases it is simply the value of the variable at the n*.5 th record. For an odd number of cases, it is the average of the int(n*.5) and int(n*.5)+1 records.

HTH,
Art
chang_y_chung_hotmail_com
Obsidian | Level 7
> Is there a calculation formula I can use to

> incrementally calculate the median (or for that

> matter, any percentile) without having every unique

> value available in memory or in a table?



A straight forward way to calculate the median in this way seems to be: to maintain an ordered data store (i.e., an array or a hash) of half the total number of observations.



A StackOverflow article concurs.
Hi All,

Thanks for the responses. It appears I confused people by asking about "incrementally" calculating the median. The idea was to avoid the use of PROCs since I need to generate metrics that either would require multiple SAS PROCs or DATA Step code. Instead of doing 3 or more passes on the data, I'd like to try to do it in a single pass using one DATA Step.

I can certainly hold all (or perhaps 1/2) of the unique values in memory and do a calculation on those values at the end of the DATA Step but was wondering if there was another option.

Thanks,

Tim Stearn
chang_y_chung_hotmail_com
Obsidian | Level 7
@Tim: You did not confuse me. The answer is: no.
art297
Opal | Level 21
Tim,

If you are already calculating frequency counts, why not find the median from those counts? In one pass you can compute the frequency counts, sum, n and sum of squared values. From those you should be able to obtain everything you want in one pass of your data, and a second pass of justs the frequency counts.

Art
ballardw
Super User
> Hi All,
>
> Thanks for the responses. It appears I confused
> people by asking about "incrementally" calculating
> the median. The idea was to avoid the use of PROCs
> since I need to generate metrics that either would
> require multiple SAS PROCs or DATA Step code.
> Instead of doing 3 or more passes on the data, I'd
> like to try to do it in a single pass using one DATA
> Step.
>
> I can certainly hold all (or perhaps 1/2) of the
> unique values in memory and do a calculation on those
> values at the end of the DATA Step but was wondering
> if there was another option.
>
> Thanks,
>
> Tim Stearn

By any chance did you mean:
generate metrics that either would NOT require multiple SAS PROCs or DATA Step code

As mentioned above, a single pass through PROC MEANS or SUMMARY would probably be the most efficient both in time of code development and execution time.
Ksharp
Super User
Hi.
Tim .I am too,Even though I know some very basic statistical knowledge.
For you question.You can use proc mean to calculated all the statistical estimator you need, include median.You can create a macro to dynamically calculate the incremental data 's median.Hash table is not very flexibility to calculate these estimator exclude sum.

So for your situation,I recommend highly to use proc rank, which flag the data from min to max with rank. EX: You have one hundred number, using proc rank to flag them, the fiftieth rank will be the number you look for,namely median or in other words fifty percentile.
Hope these can help you some.


Ksharp

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 3912 views
  • 0 likes
  • 6 in conversation