turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Incremental Median Calculation

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2011 01:46 PM

All,

I'm writing a program where I'm attempting to compute several descriptive statistics with only a single pass through the data. Example statistics are mean, stddev, frequency counts, etc. I am, of course, using DATA Step for this program.

With most stats, I can use sums or at worst a hash table that stores each unique value along with a count. Is there a calculation formula I can use to incrementally calculate the median (or for that matter, any percentile) without having every unique value available in memory or in a table?

Sorry if this is a VERY naive question - I'm more of a programmer than a statistician.

Thanks,

Tim Stearn

I'm writing a program where I'm attempting to compute several descriptive statistics with only a single pass through the data. Example statistics are mean, stddev, frequency counts, etc. I am, of course, using DATA Step for this program.

With most stats, I can use sums or at worst a hash table that stores each unique value along with a count. Is there a calculation formula I can use to incrementally calculate the median (or for that matter, any percentile) without having every unique value available in memory or in a table?

Sorry if this is a VERY naive question - I'm more of a programmer than a statistician.

Thanks,

Tim Stearn

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2011 02:12 PM

Hello Tim,

A standard way of calculating descriptive statistics (including percentiles) is proc UNIVARIATE. What do you mean saying "to incrementally calculate the median"?

Sincerely,

SPR

A standard way of calculating descriptive statistics (including percentiles) is proc UNIVARIATE. What do you mean saying "to incrementally calculate the median"?

Sincerely,

SPR

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2011 02:37 PM

Tim,

I, too, am not sure what you mean by incrementally. The calculation of percentiles is dependent upon whether you have an odd or even number of cases. In both cases, the calculation requires that your data be sorted by the values of the variable you are trying to calculate.

In the case of the median, or 50th percentile, for an even number of cases it is simply the value of the variable at the n*.5 th record. For an odd number of cases, it is the average of the int(n*.5) and int(n*.5)+1 records.

HTH,

Art

I, too, am not sure what you mean by incrementally. The calculation of percentiles is dependent upon whether you have an odd or even number of cases. In both cases, the calculation requires that your data be sorted by the values of the variable you are trying to calculate.

In the case of the median, or 50th percentile, for an even number of cases it is simply the value of the variable at the n*.5 th record. For an odd number of cases, it is the average of the int(n*.5) and int(n*.5)+1 records.

HTH,

Art

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2011 02:50 PM

> Is there a calculation formula I can use to

> incrementally calculate the median (or for that

> matter, any percentile) without having every unique

> value available in memory or in a table?

A straight forward way to calculate the median in this way seems to be: to maintain an ordered data store (i.e., an array or a hash) of half the total number of observations.

A StackOverflow article concurs.

> incrementally calculate the median (or for that

> matter, any percentile) without having every unique

> value available in memory or in a table?

A straight forward way to calculate the median in this way seems to be: to maintain an ordered data store (i.e., an array or a hash) of half the total number of observations.

A StackOverflow article concurs.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2011 03:43 PM

Hi All,

Thanks for the responses. It appears I confused people by asking about "incrementally" calculating the median. The idea was to avoid the use of PROCs since I need to generate metrics that either would require multiple SAS PROCs or DATA Step code. Instead of doing 3 or more passes on the data, I'd like to try to do it in a single pass using one DATA Step.

I can certainly hold all (or perhaps 1/2) of the unique values in memory and do a calculation on those values at the end of the DATA Step but was wondering if there was another option.

Thanks,

Tim Stearn

Thanks for the responses. It appears I confused people by asking about "incrementally" calculating the median. The idea was to avoid the use of PROCs since I need to generate metrics that either would require multiple SAS PROCs or DATA Step code. Instead of doing 3 or more passes on the data, I'd like to try to do it in a single pass using one DATA Step.

I can certainly hold all (or perhaps 1/2) of the unique values in memory and do a calculation on those values at the end of the DATA Step but was wondering if there was another option.

Thanks,

Tim Stearn

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2011 04:26 PM

@Tim: You did not confuse me. The answer is: no.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2011 06:05 PM

Tim,

If you are already calculating frequency counts, why not find the median from those counts? In one pass you can compute the frequency counts, sum, n and sum of squared values. From those you should be able to obtain everything you want in one pass of your data, and a second pass of justs the frequency counts.

Art

If you are already calculating frequency counts, why not find the median from those counts? In one pass you can compute the frequency counts, sum, n and sum of squared values. From those you should be able to obtain everything you want in one pass of your data, and a second pass of justs the frequency counts.

Art

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-26-2011 02:05 PM

> Hi All,

>

> Thanks for the responses. It appears I confused

> people by asking about "incrementally" calculating

> the median. The idea was to avoid the use of PROCs

> since I need to generate metrics that either would

> require multiple SAS PROCs or DATA Step code.

> Instead of doing 3 or more passes on the data, I'd

> like to try to do it in a single pass using one DATA

> Step.

>

> I can certainly hold all (or perhaps 1/2) of the

> unique values in memory and do a calculation on those

> values at the end of the DATA Step but was wondering

> if there was another option.

>

> Thanks,

>

> Tim Stearn

By any chance did you mean:

generate metrics that either would NOT require multiple SAS PROCs or DATA Step code

As mentioned above, a single pass through PROC MEANS or SUMMARY would probably be the most efficient both in time of code development and execution time.

>

> Thanks for the responses. It appears I confused

> people by asking about "incrementally" calculating

> the median. The idea was to avoid the use of PROCs

> since I need to generate metrics that either would

> require multiple SAS PROCs or DATA Step code.

> Instead of doing 3 or more passes on the data, I'd

> like to try to do it in a single pass using one DATA

> Step.

>

> I can certainly hold all (or perhaps 1/2) of the

> unique values in memory and do a calculation on those

> values at the end of the DATA Step but was wondering

> if there was another option.

>

> Thanks,

>

> Tim Stearn

By any chance did you mean:

generate metrics that either would NOT require multiple SAS PROCs or DATA Step code

As mentioned above, a single pass through PROC MEANS or SUMMARY would probably be the most efficient both in time of code development and execution time.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2011 09:16 PM

Hi.

Tim .I am too,Even though I know some very basic statistical knowledge.

For you question.You can use proc mean to calculated all the statistical estimator you need, include median.You can create a macro to dynamically calculate the incremental data 's median.Hash table is not very flexibility to calculate these estimator exclude sum.

So for your situation,I recommend highly to use proc rank, which flag the data from min to max with rank. EX: You have one hundred number, using proc rank to flag them, the fiftieth rank will be the number you look for,namely median or in other words fifty percentile.

Hope these can help you some.

Ksharp

Tim .I am too,Even though I know some very basic statistical knowledge.

For you question.You can use proc mean to calculated all the statistical estimator you need, include median.You can create a macro to dynamically calculate the incremental data 's median.Hash table is not very flexibility to calculate these estimator exclude sum.

So for your situation,I recommend highly to use proc rank, which flag the data from min to max with rank. EX: You have one hundred number, using proc rank to flag them, the fiftieth rank will be the number you look for,namely median or in other words fifty percentile.

Hope these can help you some.

Ksharp