BookmarkSubscribeRSS Feed

Fast and Exact Computation of All quantiles for Very Large Amounts of Data

Started ‎09-13-2023 by
Modified ‎06-02-2025 by
Views 6,894

Fast and Exact Computation of All quantiles for Very Large Amounts of Data

Area: Quantiles, Exact values, Computational and Descriptive Statistics, Large Data.
Author: Anders Sköllermo Ph.D., Email: anders.skollermo@one.se   Date: 2025-June-02 (2025-06-02)

 

The present paper (dated 2025 June 02) is an extension of the paper submitted in 2023 September (with 5,528 views).

 It is our hope that the current paper, in a better way,  will help users solve their problems with (count) quantiles and outliers (and other types of quantiles, like sum quantiles).

 

Abstract: We present a “toolbox” with fast and robust methods for computing all specified quantiles exactly, for very large amounts of data values. We have computed 15 quantiles on up to 2’000 million data values, on a Windows PC where “Word is a bit slow”. Only a little computer memory is used. The basic idea of the method is simple: “Sort as few data values as possible.”
        We divide the region of interest into many intervals and call them “slots”. Each interval has a number, “slotindex” and a “counter”. In a loop we read a data value, compute the slotindex, update the counter for that slot by one, and continue with reading the next data value.

        After reading all data values we have the information in all the counters. This gives a good overall picture of the distribution of the data values. For each quantile p-value, it is easy to find the slot which contains that quantile. We save this slotindex as the “quantile slotnumber”.
       In a second loop we again read each data value and recompute the slotindex. If this equals one of the quantile slotnumbers, we save the data value in an external file, one file per quantile. After reading all data, we sort the small external files and compute the exact quantiles.

      When the region of interest is infinite, we divide it into three main regions: A Mid region, which we split into many slots of equal length, and a Low and a High region, where we use slots of unequal length. We arrive at a general solution, which we call the ES-method (“Exact Slot”) for computing exact quantiles and all wanted outliers. The method is fast and stable, also for very large amounts of data. The user can easily modify it to include post processing analysis.
      We believe that slots and the ES-method are useful tools for several complex questions, related to the distribution of the data values.
     We also discuss a “Hybrid-method”: We compute all quantiles in the Mid region to a user specified accuracy with only one read of all data, using many slots. In the infinite Low and High regions, we compute all wanted quantiles exactly, and all wanted outliers, using the ES-method. We only sort very small amounts of data. The method uses half the cpu-time of the ES-method.

 

The presentation consists of three parts:
  * The ES-method is described with a basic example and many figures. 

  * Discussion in more detail with formulae and choice of parameters. 
  * Computations and programming aspects.

 

 

The ES-method:  EXACT SLOT
                               ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞  ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞
Distribution used: To demonstrate the method we use data from a known theoretical distribution, called Gumbel, with the parameters μ(mu)= 2 and β (beta)= 1. This is highly right skewed, with positive and negative values. See Ref. (1). Any other distribution can be used.

      We use the expression Xi= mu - beta * ln( -ln(Rand(‘Uniform’)) ), to generate the random value Xi, where Rand(‘Uniform’) generates a uniform random value for each call and ln is the natural logarithm. The expression Xq(p) = mu – beta * ln( -ln(p) ),  where p is a value between 0 and 1, can be used to compute the exact quantile Xq(p).
      In Fig. 1 we show the Gumbel probability density function (pdf) in blue, the cumulative distribution function (cdf) in red and the 15 exact quantiles for the p-values: 0.001, 0.010, 0.100, 1, 5, 10, 25, 50, 75, 90, 95, 99, 99.900, 99.990 and 99.999 percent. The mean is shown in red.

 

AndersS_0-1748870423135.png1

 

Computation of quantiles for discrete data values: Let us consider a file of N data values, sorted in increasing order. We call them Z1, Z2, …, ZN. The quantile Xq(p) for the value p, in the discrete set of data, is defined as the data value Zj, where j is the first integer such that j >= p*N. We call p*N the “quantile X-Index”. If p= 0.25 we use the first Zj-value with j >= 0.25*N.  25% of the data values are smaller than or equal to this Zj-value.
       Other rules may be implemented. See Ref. (2), (3) and (4) about rules for exact quantile computation. In Ref. (5), (6) and in the Slot-method below, we use the “mid-point slot rule” for approximate quantile computation.

 

Basic example on a finite interval: We use a basic example, with many restrictions, to explain our ideas. Gradually we remove the restrictions. We start with a file of random data values Xi, i= 1,2, …, NXTot, where NXTot denotes the total number of values. We want to find the Xi-values of the quantiles for this set of random values, for all the p-values, (0.001, 0.010, …, 99.999 percent), mentioned above. We use the Gumbel distribution to generate values and only keep values between 0 and 8. The data can be interpreted as “Yearly income of a person”, measured in MSEK (millions of Swedish crowns).

The “slot” concept: We divide the interval 0 to 8, into 50 equidistant intervals, here called slots. The first slot has the left and right boundaries 0.000 and 0.160 (8/50). The second slot has the left and right boundaries 0.160 and 0.320, etc. The last slot has the boundaries 7.840 and 8.000. The slots are numbered 1, 2, …, 50, which we call Slotindex.

 

      Let NMid denote the number of slots, and Mw the slotwidth, the width of each slot. In the basic example NMid= 50 and Mw= (8 - 0)/50= 0.160. We count the number of data values, zero up to several thousands, that fall in each slot. We store the counts in an array NX, with NMid elements. No Xi data values are stored in the array. Data values that fall on the boundary of a slot are always considered to fall in the next upper slot.
     We store the successive accumulated sum of count of values in a second array NAX, with NMid elements. The last element contains the total count of values, NXTot.

 

How to compute the exact quantiles: We compute the quantile X-Index, QCXInd, as p*NXTot, for each quantile p-value, 0 < p < 1. To identify the slot, where the data value with sorted number p*NXTot falls, we compare the accumulated sum of counts of values in the slots to QCXInd,  i.e. we compare the successive values in NAX to QCXInd. The first slot, where NAX(i) >= QCXInd, is the quantile slot for the p-quantile. The index of this slot is the “quantile slotnumber”.

 

AndersS_1-1748870641919.png

 

Quantile concepts:
   
quantile p-value                   = Value between 0 and 1 (0 and 100 percent). 0.50 = Median.
    quantile X-index, QCXInd  = Quantile p-value * NXTot. Integer between 1 and NXTot.

    quantile slot                          = The slot, the X-interval, where we find the quantile.
    quantile slotnumber            = The index of the quantile slot. Integer between 1 and NMid.

     We read all the data values a second time and save the Xi-values that fall in the quantile slots in small “quantile data files”. The accumulated sum of counts before each quantile slot and the additional number needed to reach each quantile X-index, are known after the first read. We only need to sort the small quantile data files, to compute the exact values of all the quantiles. See “Discussion in more Detail” below.

 

 

The basic example with 50’000 values, NMid (number of slots) = 50 and some quantiles.

quantile p-value in %

0.10

1.00

5

50

95

99

99.90

99.99

quantile X-index    

50

500

2’500

25’000

47’500

49’500

49’950

49’995

quantile slotnumber

  1

  4

6

15

31

41

49

50

Xi - basic example

0.13

0.49

0.91

2.36

4.93

6.44

7.70

7.95


Slot examples: In the histograms in Fig. 2, 3, 4 and 5 we use a file with 50´000 random data values. We use the NMid values 50, 100, 200, 3’200 and compute the 15 quantile slots.
       The height of each bar shows the count of values within that slot. The red bars are the "Quantile slots", the slots where the selected quantiles are found. The total sum of the counts of Xi-values in the Quantile slots, is shown in the title line. This is the total number of values that we need to sort, to find the exact value of all quantiles.
       When we double the value of NMid, both the "Sum of Count of values in the Quantile Slots" and the “Height of the highest slot” are approximately halved. If we study a data file with the double amount of data, and the original value of NMid, all these values are instead doubled.

 

AndersS_2-1748871006055.png

 

      When increasing NMid: 50< 100< 200< 400< 800< 1´600< 3´200, the "Sum of Counts of values in the Quantile Slots" decreases: 25> 12> 6> 3> 1.50> 0.75 > 0.41 % of NXTot.
      In Fig. 5 we have used NMid= 3´200 and NXTot= 50’000. This gives only 204 values in the sum of counts, which is a very small amount to sort. (With e.g. 5 million values and 3’200 slots we would need to sort around 20’000 values in total).

     For NMid= 50 we can distinguish only 12 quantile slots, since the first slot contains three quantiles, 0.001%, 0.010% and 0.100%, and the last slot contains two quantiles 99.990% and 99.999%. For NMid= 100, 200 and 3´200 we can distinguish 13 quantiles, out of 15.
      The quantiles are separated better with larger values of NMid. In Fig. 4, with 200 slots and 50’000 data values we also begin to see how the count of values per slot is fluctuating.

 

AndersS_3-1748871143400.png

 

       When the slotwidth is very small, we may use the mid-point of each quantile slot as a reasonable approximation to the quantile value. The absolute error is at most half the slotwidth. We call this the “mid-point slot rule”. It is applicable on a finite interval for computing approximate quantile values. The accuracy depends only on the length of the interval and the number of slots, i.e. on the slotwidth. The Slot-method requires only one read of all the data values. It cannot handle infinite intervals and it will not give us the exact quantile values. See “Slot-method” and “Hybrid-Method” below.

       The accumulated sum of counts of values in the slots is an increasing step function. When we add this function, we get in a very continuous and regular curve in in Fig. 2 to Fig. 5.
       It is tempting to suggest a linear interpolation of the Xi-value in a slot, to compute the quantile that falls in that slot. We discussed computing quantiles using this linear interpolation in Ref. (6). However, the fluctuations in the count of values per slot mean that we cannot in general obtain a better accuracy than the accuracy, half the slotwidth, we get using the simple slot-method. Linear interpolation may even give worse accuracy for small slotwidths.

      In Fig. 5 below, with 3’200 slots, the “ruggedness”, i.e. the fluctuation of the count of values per slot, is much more pronounced, than in Fig. 4. We also show the Gumbel frequency curve in blue.

 

 AndersS_4-1748871322942.png

 

Infinite data region: Using the slot-method, see Ref. (6), we are restricted to a predefined region of interest and a pre-selected, often low, level of accuracy of the computed quantiles. In general, we cannot expect to use fixed lower and upper boundaries like 0 and 8 above.
     We now consider a situation where the data values range from minus infinity to plus infinity. We split the infinite x-axis into three regions: called “Low”, “Mid” and “High”. We use XLlim to denote the Lower limit of X in the Mid region and XUlim to denote the Upper limit.

      We split the Mid region (XLlim to XUlim) into equidistant slots as discussed above. The Low (-∞ to XLlim), and High (XUlim to +∞) regions are also divided into slots, which increase in width as we move further away from the Mid region. The first (leftmost) slot in Low, and the last (rightmost) slot in High are both of infinite width. We have defined the rightmost slot in Low and the leftmost slot in High to have the same width as the slots in the Mid region.
      We map the Xi-values in Mid, “XLlim to XUlim”, into the integers “1 to NMid”. In Low we use a mapping of the type 1/x, to map the values “-∞ to XLlim” into the integers “1 to NLow”, with all slots of different length. In High we map “XUlim to +∞” into “1 to NHigh”. The mapping formulae and the derivation of reasonable values for XLlim and XUlim are described in the section “Discussion in more Detail”.

      In Fig. 6 we show, in blue, the theoretical Gumbel distribution curve and all the 15 quantiles. We also show, in grey, a histogram for a small, random sample of 5’000 values (“the small sample”) of the basic example, without the limitation 0 to 8. We use an infinite region with NLow= 5, NMid= 50 and NHigh= 10. The computed quantile slots (shown in red) and the blue theoretical quantiles overlap or are adjacent, except for the 99% quantile.

      We have chosen to define the Low region to contain the 2% lowest data values (100 values) and the High region the 2% highest data values. This gives XLlim= 0.62 and XUlim= 5.84.
      The lowest data values fall into 5 slots of decreasing width (left to right). Slot 2 to 5 are shown in Fig. 6. We use a Mid region with 50 slots of equal width. The highest data values fall into 10 slots of increasing width (left to right). Slots 1 to 9 are shown in Fig. 6.
      We include XLlim, XUlim, (indicated by two broken vertical lines) and 12 red quantile slots in Fig. 6. The height of the bars equals the number of values in the slot.

 

AndersS_5-1748871448004.png

 

Some results of testing the ES-method:

In Fig. 7, 8 and 9 below we show how the cpu-time compares between different methods and increasing amount of data values, on an infinite region and 15 quantiles. The classical method to compute quantiles is to sort all the data values. We refer to this as the OS-method.
       In Fig. 7 we show the cpu-time for some methods: The OS-method (blue) can only be used, for amounts of data up to 500 million values. No outliers or post processing is possible.
      When running the test programs several times, we get cpu-times that vary ± 10% or more for all the methods. The blue bands in Fig. 7 and 8 indicate this.

       The ES-method (red), and the Hybrid-method (red dashed) have been tested for amounts up to 2’000 million values. All wanted outliers are computed. See “Hybrid-method”.
       The ES- and Hybrid-methods are faster than the traditional OS-method for 0.5 million values or more. They both have a “start-up” cost (0.2 cpu sec).
       In Fig. 7 the NMid value varies with NXTot, for the ES-method. See “NMid criteria C1-C3”.
For the Hybrid-method, we use a user-defined, fixed value of the wanted absolute error, in the quantiles (AbsErr= 0.00005) to compute the number of slots in the Mid-region.

 

 

AndersS_6-1748871631749.png

 

       In Fig. 8 below, we show the cpu-time for the ES-method, for different values of NMid. The amount of data values, NXTot, has a fixed value (600, 1’200 and 1’800 million) for each curve, and NMid is preselected (not determined by criteria C1-C3).
      The cpu-time for the ES-method depends very little on NMid. It increases linearly with the amount of data. Very small values of NMid will make the quantile slots large and perhaps increase the total cpu-time.
       In Fig. 9 we show the cpu-times for the ES-method and the one-pass Slot-method discussed in Ref. (6). We use the ES-method on the infinite interval. It uses 60’000 slots in the Mid region when NXTot= 2’000 million.

       We use the Slot-method, on the limited interval 0-8. We use the “mid-point slot rule” to compute the approximate quantiles. No outliers are computed. We show four choices of NMid. The absolute error is half the slotwidth. With NMid= 20k we get AbsErr= 0.00020. See “Accuracy” below.

We list NMid vs AbsErr as: 

     20k -- 0.00020,   100k -- 0.00004,   1’000k -- 4*10**(-6),   5’000k -- 0.8*10**(-6).

 

 

AndersS_7-1748871727052.png
 
AndersS_8-1748871765970.png

 

      The cpu-time for each of the methods OS-, ES-, Slot- and Hybrid increases like Const *NXTot, where Const has very different values for the different methods. The cpu-time used is only one of several aspects to consider, when comparing different methods.
     We have used the ES- and Hybrid-methods on very small amounts of data (5 values) up to 2’000 million data values, to compute 15 quantiles and all wanted outliers.

 

Discussion in more detail:
 ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ 
 * Limits of the Mid region and the number of slots.  Compute slotindex.

  * All regions (Low, Mid, and High) together.
  * Estimate NMid.
  * The ES-, Slot- and Hybrid-methods: Comparison and Extensions.

Limits of the Mid region and number of slots:
We need to determine values for the lower and upper limits of the Mid region and the number of slots (NLow, NMid and NHigh) in the Low, Mid and High regions. The exact values of XLlim, XUlim, NMid, etc. are not critical, but we want decent estimates. We use a heuristic technique.

       Boundaries of the Mid region: We assume that the data file is random. To compute estimates of the lower and upper limits of the Mid region we use a small random sample of 5’000 values (parameter RandIPop). We sort these data and compute the 2% and 98% (parameters pLow and pHigh) quantiles of this sample.
       These values are not good estimates of the 2% and 98% quantiles for the entire user data file, but we may use them as reasonable lower and upper limits of the Mid region. We denote them XLlim and XUlim (X-value Lower limit and X-value Upper limit). Roughly 100-2-2= 96% of all data values fall in the Mid region. These computations require only a little cpu-time.
       The choice of NMid determines the width of the slots, Mw= (XUlim-XLlim)/NMid, in the Mid region. Very wide slots will give us a large amount of data to sort for some quantile slots.

       Values for NLow and NHigh: The part of data values in the Low and High regions are pLow and 1-pHigh (2% in each region, when the default values are used).  Reasonable values of NLow and NHigh are pLow*NSlot and (1-pHigh)*NSlot.
       However, we give NLow and NHigh the values MNLow * NMid and MNHigh * NMid, using the parameters MNLow and MNHigh, with the default values 0.10 and 0.20.

Compute slotindex - Mid region: We define the Mid region as the interval XLlim to XUlim. We use NMid uniform slots, each of length Mw. We map the Xi-values in “XLlim to XUlim” into the integers “1 to NMid”. We compute IM = Floor(AMid + BMid*(Xi - XLlim)) for each data value Xi. See Ref. (7) for Floor function. IM is a positive integer number. We call it the local slotindex of the Mid region.
      We use AMid= 1 and BMid= NMid/(XUlim - XLlim). A Xi-value between XLlim and XLlim + Mw gives IM= 1. A value between XLlim + Mw and XLlim + 2*Mw gives IM=2, etc.
Finally, a value between XUlim - Mw and XUlim gives IM= NMid.

 IM

 1

 2

 

 NMid - 1

 NMid

 Width

 Mw

 Mw

 

 Mw

 Mw

 Left boundary

 XLlim

 XLlim + Mw

 

 XUlim - 2*Mw

 XUlim - Mw

 Right boundary

 XLlim + Mw

 XLlim + 2*Mw

 

 XUlim - Mw

 XUlim

 

 

Compute slotindex - Low region: We also need to handle data values that range from large negative values (i.e. large absolute value) to “normal” data values. We use NLow slots in this region to handle all Xi-values smaller than XLlim. We use a mapping of the type 1/x, to map the values “minus infinity to XLlim” into the integers “1 to NLow”, with all slots of different length.


     We compute the local slotindex as IL = Floor (ALow + Blow/(Clow + Xi)), for all Xi values smaller than XLlim. IL shall take the values 1 to NLow. We require the last/rightmost slot NLow to have the same length, Mw, as the slots in the Mid region. The first slot in the Low region has infinite length. We arrive at the following values:
ALow = 1,   BLow = -NLow*(NLow - 1)*Mw   and   CLow = -(NLow - 1)*Mw – XLlim


        The value XLlim – NLow*Mw  falls in a slot in the Low region with the local slotindex NLow/2 and the approximate width 4*Mw. The slots with the numbers NLow/2, …, NLow decrease in slotwidth from 4*Mw to Mw.

 

 IL

 1

 

 NLow/2

 

 NLow - 1

 NLow

 Width

 infinite

 

 4*Mw  (approx.)

 

 

 Mw

 Left boundary

 - ∞

 

 XLlim  - (NLow+ 3)*Mw    
     (approx.)

 

 

 XLlim -  Mw

 Right boundary

 XLlim
 - (NLow- 1)**2*Mw

 

 XLlim
 - (NLow- 1)*Mw

 

 XLlim - Mw

 XLlim

 

Compute slotindex - High region: We use NHigh slots in the “High region” for all Xi-values greater than or equal to XUlim. We use a mapping of the type 1/x, to map the region “XUlim to plus infinity” into the integers “1 to NHigh”. This mapping is similar to the mapping in Low.
       The High region has slots of non-uniform width. We require that the first/leftmost slot has the same length, Mw, as the slots in the Mid region.  The last/rightmost slot has infinite length.

 

 We compute the local slotindex as IH = Floor (AHigh + BHigh/(CHigh + Xi))   where
       AHigh=  NHigh + 1,   BHigh=  -NHigh*(NHigh-1)*Mw   and   CHigh=  (Nhigh-1)*Mw – XUlim.


       The value  XUlim + NHigh*Mw  falls in a slot in the High region with the local slotindex NHigh/2+1 and approximate width 4*Mw .  The slots with the numbers 1, 2, …, NHigh/2  increase in slotwidth from Mw  to 4*Mw.

 IH

 1

 2

 

 NHigh/2 + 1

 

 NHigh

 Width

Mw

 

 

 4*Mw (approx.)

 

infinite

 Left boundary

XUlim

XUlim + Mw

 

XUlim + (NHigh- 1)*Mw

 

XUlim +  (NHigh - 1)**2
*Mw

 Right boundary

XUlim + Mw

XUlim + 2*Mw

 

XUlim + (NHigh+ 3) *Mw  (approx.)

 

 + ∞

 

     Fig. 10 shows the width of the slots in the High region. The black line with an open circle in the right end, extends to plus infinity. The values on the X-axis refer to the Basic “0 to 8” example. For data without restrictions, “8” corresponds to XUlim, and “16” to XUlim +(XUlim-XLlim).

 

AndersS_9-1748872029187.png

 

 

All regions (Low, Mid, and High) together: The total number of slots is NLow + NMid + NHigh = NSlot. Each data value Xi falls in a specific slot. Using the formulae above, we find the global slotindex for Xi. The slotindex is always a positive integer, like 1, 2, …, 4’160. The Xi-values are negative or positive, perhaps large.
      The Xi-value XLlim falls in slot NLow+1, the first slot in the Mid region. The Xi-value XUlim falls in slot NLow + NMid +1, the first slot in the High region.
      The mixture of “Almost Mid slot width, close to the Mid region” and “Very wide slots far away” is useful. The slotwidth in Low and High increases very slowly close to the Mid region. Outliers will show up in the rightmost slots in High and in the leftmost slots in Low.


Region=

Low

Mid

High

X-values

-∞   to   XLlim

XLlim  to  XUlim

XUlim  to   +∞

Width of region

infinite

XUlim - XLlim

infinite

Number of slots

NLow

NMid

NHigh

Slotwidth

  ∞  to Mw

Mw

Mw  to  ∞

Total number of data values=  NXTot

pLow * NXTot
   (approx.)

(pHigh - pLow) * NXTot
      (approx.)

(1 - pHigh) * NXTot
       (approx.)

 

 

 

 

Local slotindex value

IL= 1,2, …, NLow

IM= 1,2, …, NMid

IH= 1,2, …, NHigh

Global Slotindex I

1,2, …, NLow

NLow + 1, …,
NLow + NMid

NLow + NMid + 1, …,
NLow + NMid + NHigh
(= NSlot)

 

Estimate NMid: We want a decent estimate of NMid. If the value is small the quantile data files to sort will be large and require more cpu-time. If NMid is very large the cpu-time to handle the slot arrays will increase. The Figs. 7, 8 and 9 show that the choice of value for NMid is not critical.

 

      Here we discuss some criteria for estimating useful values for NMid, for the ES-method. These shall only be regarded as guidelines. The user may want to use other criteria and other limits in the criteria, depending on the user requirements, computer and computer language used.

 

      Our idea to estimate NMid is simple. The data file is more dense in certain parts, i.e. the number of data values per slot is larger there. We use the small sample with RandIPop values and compute all 5, 10, …,90, 95% quantiles and also the length of each interval between these quantiles (Xq10 - Xq5, Xq15 - Xq10, …, Xq95 - Xq90). Each interval contains 0.05*RandIPop values.

 

      Let XShort denote both the shortest interval and the length of it. Xshort will contain several slots. The average number of data values per slot in XShort is larger than in all other 5% intervals. We assume, that the small sample is big enough to make the density of the XShort interval representative for the most dense interval of the user file, with NXTot data values.
     Using 5’000 values, out of NXTot values, we get Xq (.02)= 0.62, Xq (.98)= 5.84, XShort= 0.131. We introduce LaSlot,  “Total number of values in XShort”  divided by  “Number of slots in XShort”. Thus LaSlot= 0.05 * NXTot * (XUlim – XLlim)/ (XShort*NMid) is the Average number of values in a slot in the XShort interval.
      Introduce CX= 0.05 * (XUlim – XLlim)/XShort. Then LaSlot= CX * NXTot/NMid. This gives NMid= CX*NXTot/LaSlot. We keep LaSlot small (compared to NXTot) to avoid having very large quantile data files to sort. Computer runs show that LaSlot= 20’000 may be suitable.
      The CX-value is independent of parameters for the Normal and Gumbel distributions. For Uniform CX= 0.96, for Normal CX= 1.64 and for Gumbel CX= 1.94.

 

NMid criteria C1-C3:
       C1: The average number of values per slot in dense region: NMid= CX * NXTot/LaSlot.   
                CX= 0.05*(XUlim-XLlim)/Xshort, where Xshort is the length of the most dense 5%
                    interval of the RandIPop population.
       C2: NMid shall not be extremely large. NMid <= MaxNMid. MaxNMid= 60’000 is suitable.
       C3: NMid shall not be too small. NMid >= MinNMid.  MinNMid= 100 is suitable.


Values of NMid for each criterion. NXTot in millions. LaSlot= 20’000. The NMid row shows the final “default” value for NMid used in the ES-method. In the last row we show the result of some computer experiments:  QNX_QMax, the number of values in the largest quantile data file.

NXTot

1

5

10

50

100

500

1’000

 2’000

C1:

100

500

1’000

5’000

10’000

50’000

100’000

200’000

C2:

60’000

60’000

60’000

60’000

60’000

60’000

 60’000

 60’000

C3:

100

100

100

100

100

100

100

100

NMid

100

500

1’000

5’000

10’000

50’000

60’000

60’000

QNX_QMax

18’102

18’103

18’303

18’338

18’339

18’197

30’112

60’672

 

The dependence of the cpu-time on NXTot and NMid (NSlot) – the ES-, HYBRID-methods:
The results presented here are only guidelines. The values are difficult to predict and may be different on another PC. The cpu-time increases linearly with NXTot.

      The cpu-time does not increase, when using larger number of slots NMid (NSlot). The reason is that the “slot administration cost” increases with increasing number of slots, and, at the same time, the cost for sorting the quantile data files decreases. See Fig. 7 and 8.

 

The Slot-, ES- and Hybrid-methods:  Comparison and Extensions

Slot-method

ES-method

Finite predetermined region.

Infinite region – left and/or right.

Quantiles computed to a user selected (often low) accuracy. No outliers.

Exact values of all quantiles, and all wanted outliers. Values can be investigated in post processing.

Many slots needed for good accuracy. A lot of memory is needed.

Only a small number of slots is needed.
Only little memory is used.

Only one pass thru all data.

Two passes thru all data + some sorting.

No sorting.

Sort only a small number of data values.

Good for low accuracy computing. Cheap, but gives a good overview.

A very general method that can be modified for specific user needs and/or extreme volumes of data.
    See also “Hybrid-method” below

Described in Ref (6).

Described in this paper.

 

Slot-method: We compute approximate values for quantiles, but only in a predetermined limited region. No outliers or post processing is supported.
Accuracy:
AbsErr = (Region size for the Slot-method) / (2*No of slots in the Slot region) is the error of the quantiles using the “mid-point slot rule”. See also Accuracy” in Ref. (6)).

 

ES-method:  With the ES-method, we always compute exact values for the quantiles. First one read of all data values, with uniform mapping in the Mid region and 1/x mapping in the Low and High regions. In the next read we save data to several quantile data files.
      We only sort small files to compute exact values of the wanted quantiles.  Only a low number of slots is needed. The ES-method requires 2 passes thru all data + 15 small data files to sort. We can compute all outliers, according to user requirements.

 

Hybrid-method: The Hybrid-method allows a very flexible mix of the Slot- and ES-concepts. It combines the Slot-method in the Mid region, with the ES-method in the infinite Low and High regions. The Hybrid-method computes quantiles in Mid, to a user-defined absolute accuracy, and computes exact quantiles and outliers in Low and High, “at half the cost for the ES-method”.
       We first read all the data values once and find all the quantile slots. We save all data in the Low and High regions in two separate files.
       We compute the quantiles in the Mid region approximately, to a user specified absolute error, using the “mid-point slot rule”.  We avoid sorting but need many slots to get a good accuracy.

 

       To compute the quantiles in the Low and High regions, we read the Low and High data files and save all values from each quantile slot, in separate quantile data files. We sort these small files, to compute the exact value of each quantile. Fig. 7 show the gain in efficiency.

 

The Computations:
 ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞͞ ͞ ͞ ͞  ͞ ͞ ͞ ͞ ͞ ͞ ͞͞ 

At the top of the program we specify the value of some parameters using Macro variables.
We can run the program in ES- or HYBRID-mode.  E.g.  %LET MODE= HYBRID;

 

External parameters - given values as Macro variables:

 Name

 Default value and use of the parameter.

 EDSID *

 Necessary. Name of the data file to analyse. Stored in MYLIB directory.  The number of values,
NXTot is computed by the SAS System.

 MODE *

 ES  or  HYBRID  -  ES is the default.

 VER *

 Useful when examining the same table in several ways.

 AbsErr *

 0.00005 – ONLY used when MODE= Hybrid, to define the Mid slotwidth.

 OutLieN * 

 20            - Display the OutLieN lowest and highest data values.

 NMid *

 Use a non-zero value. The value computed using C1 - C3 is not used.

 

 Define Mid region - Use the RandIPop data sample. 

 RandIPop

 5’000 – Used for computing pLow, pHigh quantiles and XShort. 

 pLow

 0.02 - Compute       2% quantile to estimate XLlim.

 pHigh

 0.98 – Compute    98 % quantile to estimate XUlim.

 

 Compute  NMid

 LaSlot

 20’000    - Criterion C1. Average no of values in a slot in most dense area.

 MaxNMid

 60’000    - Criterion C2

 MinNMid

 100          - Criterion C3

 MNLow

 0.10    NLow= MNLow* NMid    - Define Number of slots in the Low    

 MNHigh

 0.20    NHigh= MNHigh* NMid   - and High regions.


       To use another value for AbsErr than the default, specify e.g. %LET AbsErr = 0.000001. When we specify NMid= 50000, the value is used in the ES-method and in the computations in Low and High regions in the HYBRID-method. The Criteria C1-C3 are not used. In Fig. 8 we used NMid.
       The red star * denotes the external parameter EDSID that the user must provide, and also parameters that the user often wants to specify (MODE, VER, AbsErr, OutLieN, NMid).
       We have tried to give the parameters “reasonable” default values, see the table above. “Bad values” of these parameters may increase the cpu-time and/or the memory used. Our experience is that the ES-method is not sensitive to the choices of the parameters.


      “Crazy parameters” – During program testing, the size 1’000 million was once interpreted as 1’000 values. This resulted in HUGE areas to sort. But still rather fast, robust, and exact quantiles.
      We have considered to include an “INITIAL” mode, which only reads the RandIPop data. This would validate the user parameters, compute an overview, and calculate the estimates of XLlim, XUlim, NMid for ES. This would be fast and very useful for an “unknown” data material. It is easy to implement as a separate extra computer program.

 

Initialize the ES-method: First compute a small random sample with RandIPop values from the large user data file. Compute the pLow, pHigh quantiles of the sample file. Use these values as XLlim, XUlim. Compute XShort, the 5% interval with the highest number of values, in the sample.
     Use these values together with the values for LaSlot, MinNMid and MaxNMid in the computation of NMid. Finally use MNLow and MNHigh to compute NLow and NHigh.



Initialize the computation of quantiles: We define an array NX with NSlot elements. We use the element with number j as the “counter” for the number of Xi-values that fall in the slot with global slotindex j. We define another array NAX (optional, but useful), the accumulated sum of counts of data values, with NSlot elements. The element NAX(j) is the sum of NX(1) to NX(j).  Another optional and useful array is XLeft, the left slot boundary, with NSlot elements.
       We define an array pVal, where we store the p-values of the wanted quantiles. We use the 15 p-values 0.001, 0.010, 0.100, …, 99.999 percent. We use QNX for the number of values in the quantile slot, QCXInd for the quantile (integer) X-index ROUND(p*NXTot), one for each p-value.

 

Quantile concepts:
quantile p-value, pVal           = Value between 0 and 1 (0 and 100 percent). PVal100= pVal*100.
quantile X-index, QCXInd     = Round (pVal * NXTot). Integer between 1 and NXTot.

quantile slot,                            = The slot, the X-interval, where we find the quantile.
quantile SlotNo, QSlotNo      = The slotindex for the quantile. Integer between 1 and NSlot.
quantile data file                     = An external file with all data values that belong to the quantile slot.

 

The basic example with 50’000 values in an infinite region:  XLlim= 0.62,  XUlim= 5.84. 
Low:  -∞  to XLlim.  Mid: XLlim to XUlim.  High: XUlim to +∞.   We specify NMid= 50.

 PVal100

 0.010

 0.100

 25

 50

 75

 99.900

 99.990

 99.999

 QCXInd

 5

 50

 12’500

 25’000

 37’500

 49’950

 49’995

 50’000

 QSlotNo

 2

 3

 15

 22

 30

 63

 64

 64

 QNX

 28

 246

 1’864

 1’786

 1’168

 86

 30

 30

 Xi

 -0.265

 0.054

 1.675

 2.362

 3.234

 8.953

 11.138

 12.838


Compute quantiles:
First step: In a loop we read a Xi-value, compute the slotindex and update the counter for that slot by one. The slotindex definition uses the Floor function, which gives the closest lower integer number. We do not store any Xi-value, only the value of the counter in the array element NX (slotindex). We repeat the process until all data values have been read.

       Using the information in the array NX we know how many data values that fall in each slot. We compute the accumulated sum of counts and save these values in the array NAX. For each quantile we compute the quantile X-Index QCXInd and compare this to the values in NAX. The quantile slotnumber, QSlotNo, is the number of the first element in NAX, whose value >= QCXInd.

 

After the first read of all the data, we know for each quantile:
   * PVal100, the p-value for the quantile (0 to 100) in percent.
   * The quantile X-index, QCXInd= Round(pVal*NXTot). Integer between 1 and NXTot.
   * The quantile slotnumber, QSlotNo, the slotindex for the quantile slot. Integer 1 to NSlot.
   * The accumulated sum of counts before this slot. i.e. NAX(QSlotNo - 1).
   * The number of values, QRemVals, inside this slot, needed to reach the X-value of the quantile.

     Second step: In a loop we read all the data values a second time. For each data value, we again compute the slotindex. If this equals one of the quantile slotnumbers we have found a value in a “quantile slot”. We save the data value in an external file, “quantile data file”, one file per quantile. The if-statements, used for comparing the slotindex to the quantile slotnumbers, are executed in
most probable to find order”, to increase the speed.
      After reading all data values a second time, we sort all the quantile data files. In each of these data files, the element number QRemVals is the X-value of the quantile.

 

Hybrid-computation: In the Hybrid-method the idea is to read all data only once. We use the parameters computed for the ES-method with addition of the HYBRID specific parameter AbsErr.  
      We compute NMid according to the ES-method and the corresponding values for NLow and NHigh. The local mapping of the Xi-values in the Low and High regions use the ordinary ES formulae, where the last slot in Low and the first Slot in High use the ordinary ES slotwidth (XUlim-XLlim)/NMid, (called EsMw in the program).

      We calculate HybNMid, number of slots needed in the Mid region, rounding the expression HybNMid = (XUlim- Llim)/(2*AbsErr) upwards. The total number of slots used in the Hybrid-method is NLow + HybNMid + NHigh. The arrays NX and NAX each have this length.


       We read all the data and use the mapping formulae to populate the slot array NX, covering the Low, Mid and High regions. In the Mid we use the slotwidth (XUlim - XLlim)/HybNMid.
       All data values in the Low and the High regions are saved in two external files, containing approximately 2% plus 2% (pLow + 1- pHigh) of NXTot.

      We populate the NAX array with the accumulated sum of counts from NX. We identify all the 15 quantile slots by comparing the NAX-values to the Quantile X-index values as in the ordinary ES-method. With our standard choice of p-values we have 7 quantiles in the Mid region and 4 each in the Low and High regions.


      Mid-region: We use the “mid-point slot rule” to compute approximate values of the quantiles to a user pre-defined accuracy, AbsErr. The data in the Mid- region are not read a second time and they are not sorted.

      In the basic example the total width of the Mid region is (5.84-0.62) = 5.22. The number of slots used in the Mid region does not depend on NXTot, only on the required accuracy AbsErr. In Fig. 7 we have used HybNMid=52’200 since default AbsErr=0.00005.

      A large value of HybNMid is needed for a high accuracy of the quantiles in the Mid region. If we want an accuracy say 100 times better than the default, then HybNMid needs to be 5.22 million and far more memory is needed. Perhaps the ES-method is better to use. See Fig 7.


Low- and High-regions:  To find the quantiles in the Low region we read the Low and High files and compute the slot index.  We save all data corresponding to the Low and High quantile slots, in very small external quantile data files. We sort each of them and compute the exact values of each quantile in the Low and High regions. These computations are faster than the ES-method, since, in the second read, we only read the Low and High files, and not the entire data file. This means that the Hybrid-method is much faster than the ES-method. But the quantiles in Mid are not Exact and more memory is used.


The HYBRID-example with 50’000 values in an infinite region:  XLlim= 0.62,  XUlim= 5.84.  We use AbsErr= 0.005 and NMid= 50. (This gives HybNMid= 523 and NSlot=538). The red numbers indicate incorrect values. The first 5 slots (Low) and the last 10 (High) are the same as in ES.

 PVal100

 0.010

 0.100

 25

 50

 75

 99.900

 99.990

 99.999

 QCXInd

 5

 50

 12’500

 25’000

 37’500

 49’950

 49’995

 50’000

 QSlotNo

 2

 3

 111

 180

 267

 536

537

 537

 QNX

 28

 246

 178

 155

 104

 86

 30

 30

 Xi

 -0.265

0.054

 1.673

 2.362

 3.231

 8.953

 11.138

 12.838

 

Performance of the ES- and Hybrid-methods: The exact details of the performance of these methods depend on the programming environment. Many modern computer languages, which support fast handling of internal arrays and storage of data, can be used. We have implemented these methods using the programming language and the data storage facilities in the SAS System, but only used the facilities that are general for this type of computer language. The methods work correctly and are very stable when 200 <= NXTot <= 2’000 M, and NMid >= 50.

      Our experience is that these methods work very well, for a wide range of values for NMid and other parameters, like LaSlot and MaxNMid. The process is stable and produces the correct results, for all cases that we have tested. “Bad choices” may require some more cpu-time.


Standard ways of computing the quantile value: In the SAS System the parameter QCTLDEF (also called PCTLDEF) can take the values 1 to 5. The default is 5, which means “use the average of two neighbouring Xi-values”. Our choice of computing quantiles corresponds to values 1 to 4. When NXTot is not small the values 1 to 4 all give the same result. See Ref. (3), (4) and (5).

 

Outliers: In the ES- and Hybrid-methods we also compute all user wanted “Outliers”. We use the information in the array NAX to identify the lowest and highest slots, that contain the outliers. In the second read we save the data values in these slots, in one file for Low outliers and in another file for High outliers. The Low file is sorted in increasing order and the High file in decreasing order. We keep OutLieN (default 20) data values for each.

 

Post processing:  The values in the arrays NX and XLeft can be used together with all the quantile data files for further analysis.


Implementation and comparison: A comparison between the OS-, ES-, Hybrid- and other methods regarding robustness, time, and accuracy would be very interesting. We only have access to the classical method of sorting of the entire file of data and can only make that comparison. Two SAS procedures, Univariate and Means, use the OS-method. We have used system sort and, in the Data-step, computed all the quantiles. See Ref. (8), (9), (10), (11), (12).


Overview of the different methods: Compiled versions would be faster.
     OS:              Compiled optimized code for sorting. Plus, a fast non-compiled Data-step.

     ES:              Not compiled. Several Data-steps that take cpu-time, 15 small sorts. The first and
                        second read of all the data values, take most of the total cpu-time. To sort the 15
                        external files only takes a little cpu-time.
     Slot:            Not compiled. No sorting.
    Hybrid:        Not compiled. One pass thru all data + Write, read and analyse small files for Low
                        and High. Only 8 very small sorts.


Overview cpu-time:  “slot” = compute slotnumber. Update Array(slotnumber).
      OS =           a*NXTot + b

      ES =           c*NXTot (read/slot) + d*NXTot (read/slot. Test 15 quantiles) +
                            e*NSlot (find 15 quantiles) +  f(write/sort. 15 quantile files)
      Hybrid =    c*NXTot (read/slot) + g*0.04*NXTot (write/read/slot. Test 2*4 quantiles) +
                           e*HybNSlot (find 15 quantiles) + h(write/sort. 2*4 quantile files)
Dependence on NXTot:    OS: ~ a         ES: ~(c+d)         Hybrid: ~(c+g*0.04)

User requirements – some points to consider when choosing method: 
   * Accuracy: low / medium / high / Exact.  Same accuracy everywhere / Higher in certain parts.
   * Region: infinite / half infinite / finite.

   * Quantiles wanted: Median / 5% quantiles / 15 quantiles / Extreme only.
   * Computer: Normal Home PC / Shared Server / Powerful computer.
   * Outliers: User specifies requirements.

 

Computer and computing environment: We have used the SAS System running on a PC, with several cores, using Microsoft Windows 11 Home, where “Word is a bit slow”, with 16 Gbyte of primary fast memory. We use three arrays NX (necessary) and NAX, XLeft (both optional) of NSlot elements each. E.g. 60’000 elements use 3*8*60 kilobytes or 1.4 Mbyte of fast memory.
     When running the programs several times, the cpu-times vary ± 10% or more, for all programs: Creation of the test data, the OS-, ES-, Slot- and Hybrid-methods. (Internet and WiFi are closed. No email. The SAS System is closed after each run. Temporary files are cleaned away.)
      The computed quantiles and all other derived quantities, like the amount of data values to sort, are the same. All the methods work entirely correct, in all the test runs.

 

      Cpu-time and memory are important when discussing methods. Many other aspects are also important. Our emphasis, when writing the computer program, has been on writing a very clear code, which is essentially independent of the programming language. No “SAS only” features.
     We have tried to write a very simple, robust, and well-documented (” slightly lengthy”) code. The method used should always be “robust”, i.e. always work and always give correct results – even with “rather bad parameters”. It is also important that the user can understand the code and modify the method according to his/her needs.
     The computer language we have used is the Data-step language in the SAS System. This is a table handling package, where the Data-step is a loop over all data values. We use it in a more traditional “Fortran like” way. We read the data values, use internal arrays for counters, and compute the wanted quantities. We use additional files for the quantile data files.

 

Comments:
* Simple SAS macro variables are used for replacing text, like a text editor and for simple

     conditional execution of part of the code. We use SYMEXIST (“macro_var”) in the Data-step.
     With “read” and “save” we mean read from and save to SAS data files. See Ref. (8).
* If we had used more advanced SAS macro features the length of the program would have been
     much shorter. But the cpu-time would have been the same and the time an ordinary user had
     to spend to fully understand the code, would have been much longer.
* We use temporary arrays to store the slot counter information. They are like Dimension in
     Fortran. The maximum size is 130 million array elements on our PC, Ref. (8). Depending on
     computer and computer memory, the elements with high index are more expensive to use.
* The areas under the histograms in Fig. 2 to Fig. 5 are NXTot*8/NMid. In Fig. 5 and 6 we have
      scaled the pdf-curve to make the sum of the “histogram heights” above and below the curve
      the same. In Fig. 5 we have used a curve to connect the points, since a histogram cannot show
      graphically the ups and downs of the values. We have also made the red quantile slots much
      broader, to make them visible. We use the SAS/GRAPH specific %SGRectangle macro to draw
      the 63 rectangles in the histogram in Fig. 6.
* In the SAS System it is, at present, faster to use one external file for each quantile, than to use
       one array for each quantile.

 

Conclusions: The ES-method is a reliable and fast method for computation of any desired set of quantiles for large amounts of data. The slot array makes it possible to apply other computational methods in the post processing. The advantages of the ES-method compared to other methods:
       + Exact quantiles and any wanted outliers are computed in a robust way.
       + Fast compared to the OS-method.
       + Only little internal memory and external disc is used.
       + Can be coded in many modern computer languages.
       + Simple and straightforward coding. Can easily be modified in several ways.

 

     The ES-method (“Exact Slot”) presented here, can be combined in several ways, e.g. with the Slot-method (Ref. 6) depending on the needs of the user.
      The Hybrid-method, described above, is a combination of the “good aspects of the ES-method” and “the speed of the Slot-method”. It computes, in a fast way, all outliers and all wanted quantiles to a user specified accuracy or exactly. See “Hybrid-method” and “Hybrid-computation”.
     We hope that these methods will work as a “toolbox”, to help the user to compute all the wanted quantiles and outliers and find ways to investigate other user defined quantities.


The Author: Anders Sköllermo, Ph.D. in “Applied Numerical Analysis” at Uppsala University, Sweden in 1976. Actuary, IT and SAS Specialist. email: anders.skollermo@one.se


Acknowledgement:
 
͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞
* With all my heart I want to thank my wife Ph.D. Gunilla Sköllermo for all her support in this
      work and for many valuable discussions.

* Ph.D. Jim Goodnight at SAS Institute and SAS Education who allowed me to use
      “Windows SAS on PC” and SASODA i.e. “SAS on Demand for Academics”.

*All SAS users who have answered many SAS questions on www.sascommunity.com.
     Really great with all these contacts. Many thanks!


Appendix - Possible extensions:

* Compute “Sum” quantiles together with ordinary “count” quantiles, at no extra cost. E.g. in the
       basic example find the 25% sum quantile: The income where the accumulated income is 25%
       of the total income. This corresponds to say 30 or 35% of the persons. See also Ref. (6) and
       Fig. 5A, 5B in “Calculate the Total Income (Tot Inc) distribution”.


*
Compute quantiles on data with several variables: E.g. with Income, Gender, and Age: For a

       certain income we use 3 Gender levels and 5 Age levels. In total NSlot*3*5 slots.
       See also Ref. (6) and “More variables - Further analysis with more variables from the EDS”.


* Speed-If: Use the quantiles slot boundaries in the if-statements, in the second read, instead of
       the slotindex. Perhaps faster, but more complex programming.


* Only Low and High regions: Use the 1/x mapping only, and no Mid region.


* Lower Accuracy: At present all data and all computations use 8 bytes. To use 4 bytes in data and
       computations would be faster. Perhaps the lower accuracy of 6-7 digits would be sufficient.

References:

(1) Gumbel distribution – https://en.wikipedia.org/wiki/Gumbel_distribution
(2) Quantile - https://en.wikipedia.org/wiki/Quantile

(3) Rick Wicklin, sample quantiles: A comparison of 9 definitions – See https://blogs.sas.com/
(4) Rob J Hyndman and Yanan Fan - sample quantiles in statistical packages -
         Rob J Hyndman and Yanan Fan - Errata: https://robjhyndman.com/publications/quantiles/
         Rob J Hyndman – “sample quantiles 20 years later” -
(5) A. Sköllermo “A Fast and Accurate One pass Method for Computing All Quantiles of Very
             Large Data Sets” – www.researchgate.net
(6) A. Sköllermo “
Fast and Accurate Calculation of Descriptive Statistics of Very Large Sets of

             Data” – www.researchgate.net 
(7)
https://en.wikipedia.org/wiki/Floor_and_ceiling_functions

(8) The SAS System - www.sas.com
(9) C. Masson “A fast and fully-mergeable quantile sketch with relative-error guarantees”
(10) G. Carmode “Small Summaries for Big Data”, Cambridge University Press 2020.
(11) T. Dunning, O. Ertl “Computing Extremely Accurate Quantiles Using t-Digests”.

(12) Z. Chien et.al. “A Survey of Approximate Quantile Computation on Large-scale Data”.

 

SAS-programs appended to this paper:

Name

 Short description

 Description

 P1

 OSES

 Generate a file with test data, to be used in P2 and P3.

 P2

 OS

 Compute all quantiles, using classical sorting of the data file.

 P3

 ES_HYBRID

 Compute all quantiles using the ES- or HYBRID-method.


The program has been tested using:
 200 <= Total number of data values (NXTot) <= 2’000 million    and
Number of slots in Mid (NMid) >= 50.

We have added documentation in all programs.
Just “Download and run”.

 

Comments

Hi @AndersS, thankyou. I like your practical approach. I myself am also interested in extreme quantiles. Have you thought about modifying / extending your approach into this direction?

Best

Markus

Hi Markus!    YES I have been considering extreme quantiles, but only slightly.
BUT,  I have NO knowledge about extreme quantiles.
I think that it is important to have a GOOD knowledge about the user problem.

In some ways they are easy, in other aspects they are difficult. Please describe the situation!

 

Please send me papers or reference to where I can download the papers.

I will read them, in due time. I have a cold  right now.

 

Likewise; i think that the SLOT method can be used with parallel computers  and/or   Streaming data.

BUT I do not have any knowledge of that.
/Br Anders      my email:   anders  dot skollermo   at one  dot   se

Hi Anders,

I work in Risk Management where for example 99,9% quantiles of loss distributions (VaR = Value of Risk) are of interested. But also the changes of stock prices, per hour or per second are of interest,

Best

Markus

Hi!  You do not specify any details, so the only thing I can do is to give a fast "Coffe Break Answer":

     Suppose you have 10 giga data values. Estimate the 99.9% quantile  VERY approximately using a small sample oif values. Then read the values, note the number of values, take away every value that is smaller than the estimated 99.9% quantile value. (That value was perhaps "rather incorrect" , it was the  99% quantile).

 

       Anyway you end up with 1% of 10 G, which is 10**8 values, or 100 M values. SInce you know the total number of values (the exact 10 G values), you also know that the smallest value in the 100 M data set, corresponds to 99.0 in the overall quantile calculation. So use the SLOT-method to calculate all the quantiles that you want.

      If the range of data values is VERY big, you can perhaps ask a computer specialist to write a small Assembler program. Look at a data value - if the power of 8 (or 16 on IBM computers) is too small, delete it, after counting it.

         This method is a little more than one-pass. The Assembler routine is good to speed it up.


Do you follow me? Any questions?
This is an easy problem, since you ONLY want the extreme qunatiles.

The problem is to calculated extreme quantiles AND also normal quantiles, together.

/Br Anders

 

Hi MarkusWeick!   I studied Risk Theory for Anders Martin-Löf at Stockholm University, many years ago. I added some parts to the course. We have published the Lecture Notes together. Only 50 pages, but the text is rather deep. See Arxiv.org  /Br Anders

Version history
Last update:
‎06-02-2025 10:44 AM
Updated by:
Contributors

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags