BookmarkSubscribeRSS Feed
AndersS
Pyrite | Level 9

Hi! I am working on a paper about calculation quantiles in a fast way. It is easy to calculate different sets of quantiles, like q25, q50, q75   or  the 9 quantiles (q1, q5, q10, q25, q50, q75, q90, q95, q99). And also the full set of 99 quantiles (q1, q2, q3,,,,q99).


My first question:   How many quantiles and which are of interest to the user?
How are quantile values like q1, q5, q10 (and q90, q95, q99) used?

(Note: the step from 3 to 99 quantiles is almost without any cost.)

My second question:  What accuracy is neeeded and wanted  and on what quantiles?

Is the estimated Xp value (the quantile on the x-axis)  important   or 
the p-value (on the CCDF, the percentage value on the y-axis,) of the estimated Xp-value?
  

 

 

Background: The calculated quantiles are usually more exact around the median, while q1, q5 and q95, q99 are less exact.

       The median q50 and similar are more exact, since most values fall between q25 and q75, with a peak around q50. A small change (error) delta in q50 corresponds to a rather big change epsilon in the Calculated Cumulative Distribution Function (CCDF), since the CCDF curve is very steep around q50.

        The calculations of q1, q5, q10 and q90, q95, q99 are less exact, since there are fewer data values at the extreme ends. A large change (error) delta in q1, q5, q10 (or q90, q95, q99) corresponds to a rather small change epsilon in the CCDF, since the CCDF curve is very flat at the extreme ends. All calculations are made on Calculated Probability Distribution Function (CPDF) and CCDF, when trying to get information to describe the Experimental Distribution Function (ECDF).

 

I hope that you understand my questions!

    /Br Anders

Anders Sköllermo (Skollermo in English)
8 REPLIES 8
ballardw
Super User

I suggest getting a good reference and reading. Then work out some problems by hand.

Any specific value to be of interest relies on why an varies by research question.

 

I am not sure what you mean by "The calculated quantiles are usually more exact around the median" as quantiles are order statistics. Place the values in order and the appropriate nth value (or a tie breaker rule applied) gets the value. Where the quantile is of interest has no real effect on how "exact" the quantile may be.

 

Your statement about where values lie only has application to some distributions of data. A basic exponential distribution may have "most" of the values at one tail or another. A uniform distribution has all values equally likely so no clustering around the median at all.

 

 

AndersS
Pyrite | Level 9

Hi! Note that my questions are not any easy questions. I have made lots and lots of calculations on "experimental" values.

Anders Sköllermo (Skollermo in English)
WeiChen
Obsidian | Level 7

@ballardw I think the OP meants that the CI for a quantile is often wider for extreme quantiles than for quantiles near the median. But as you and others have mentioned, it depends on the distribution.

 

@AndersS The book by Hahn and Meeker (1991) has lots f informationa bout confidence intervals, including for quantiles, I think. THe SAS doc fgor PROC UNIVARIATE discusses CIs for quantiles. If you assume the data are normal, you can use the CIPCTLNORMAL optin to estimate CIs.  In that situation (normal data), it is true that the CIs are narrower for the median than for extreme quantiles. See the example https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/procstat/procstat_univariate_examples10.ht...

When you don't know distribution oif the data, you can use CIPCTLDF option to get distribution-free intervals. The formulas are at https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/procstat/procstat_univariate_details14.htm...

 

Regarding "which quantiles are useful," it is true that many works report 5%, 10%, 25%, 50%, 75%, 90%, and 95%.  However, this blog shows that for bootstrap computations you can need 2.5% and 97.5%, too https://blogs.sas.com/content/iml/2016/08/10/bootstrap-confidence-interval-sas.html 

 

The blog at https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

have many  ways to compute quantiles. They are all simlar: form the ECDF, use it to estiamte the CDF (stepwise, piecewise, cubic spline, etc) and then use inverse interpolation to estimate the quantiles. As you say, slope of ECDF determine accruacy of estimate, but big slope can be anywhere, not just at median.

 

 

AndersS
Pyrite | Level 9

Hi! Very good answer. I will read it carefully. /Br Anders

Anders Sköllermo (Skollermo in English)
PaigeMiller
Diamond | Level 26

The calculated quantiles are usually more exact around the median, while q1, q5 and q95, q99 are less exact.


I don't know what "exact" means in any statistical sense. Are you talking about accuracy or precision or both or neither?

 

The median q50 and similar are more exact, since most values fall between q25 and q75, with a peak around q50.


Same comment about "exact". "Peak" of what? Also, there are all sorts of distributions, not all have the same properties around q50. I would imagine a normal distribution and a bi-modal distribution have different properties around q50.

--
Paige Miller
Reeza
Super User

@AndersS wrote:

 

 

Background: The calculated quantiles are usually more exact around the median, while q1, q5 and q95, q99 are less exact.

       The median q50 and similar are more exact, since most values fall between q25 and q75, with a peak around q50. A small change (error) delta in q50 corresponds to a rather big change epsilon in the Calculated Cumulative Distribution Function (CCDF), since the CCDF curve is very steep around q50.

        The calculations of q1, q5, q10 and q90, q95, q99 are less exact, since there are fewer data values at the extreme ends. A large change (error) delta in q1, q5, q10 (or q90, q95, q99) corresponds to a rather small change epsilon in the CCDF, since the CCDF curve is very flat at the extreme ends. All calculations are made on Calculated Probability Distribution Function (CPDF) and CCDF, when trying to get information to describe the Experimental Distribution Function (ECDF).

 

I hope that you understand my questions!

    /Br Anders


What are the assumptions of this statement? A normal distribution?

I would assume it's distribution dependent and a statement like that would depend highly on the Number of Observations and distribution. As N gets large, much of this isn't true. 

AndersS
Pyrite | Level 9

Hi! I have used the Gumbel dsitribution (The normal was "too easy"). About 5 million records.

Anders Sköllermo (Skollermo in English)
Reeza
Super User

Ok, that's a pretty important point to include. This is a distribution of extreme values which would be expected to be unstable.

This is a user based question though, not a statistical question so who are likely to be using these distributions and where can you find them to ask the question?

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1116 views
  • 2 likes
  • 5 in conversation