BookmarkSubscribeRSS Feed
vincent2
Calcite | Level 5

Hi, 

I am working on median value of a variable with large number of observations (5,000,000 obs). I run proc means with specifying qmethod = P2 calculate median value, the result of median value is not fixed value, it varies each time I run proc means. 

Is there anyone can help me with this issue? 

Thanks 

Vincent 

6 REPLIES 6
ballardw
Super User

@vincent2 wrote:

Hi, 

I am working on median value of a variable with large number of observations (5,000,000 obs). I run proc means with specifying qmethod = P2 calculate median value, the result of median value is not fixed value, it varies each time I run proc means. 

Is there anyone can help me with this issue? 

Thanks 

Vincent 


Please show the changes in the code that changes the value of the median.

 

Did you do anything that would change the order of the data between runs? That can effect the results.

Or did you change the number of or omit qmarkers?

Example of 3 different medians from the same data using qmethod=P2.

data example;
   do i= 1 to 1000;
      output;
   end;
run;
proc means data=example qmethod=p2 median;
   title 'Without Qmarkers';
   var i;
run; title;
proc means data=example qmarkers=101 qmethod=p2 median;
   title 'Qmarkers=101';
   var i;
run;title;

proc sort data=example;
   by descending i;
run;
proc means data=example qmarkers=101 qmethod=p2 median;
    title 'Sorted in descending order Qmarkers=101';
   var i;
run;title;
vincent2
Calcite | Level 5

Hi ballardw,

 

Thanks a lot.

 

I tried 2 data sets with 10,000,000 obs and 1,000,000 with the same proc means setting. The median value of data set with 10,000,000 obs varies from time to time But, for the data set with less obs (1,000,000), the median value is fixed. The code is shown below. Please help to take a look. Thank you.

 

For the data set example with 10,000,000 obs, each time I run the proc means with qmethod  = P2, the median values varies. But for the data set example2, the median value is fixed value whenever I run the proc means with qmethod  = P2.

 

 

data example;
do i= 1 to 10000000;
output;
end;
run;

proc means data=example qmethod=p2 median;
title 'Without Qmarkers';
var i;
run; title;

 

 

data example2;
do i= 1 to 1000000;
output;
end;
run;


proc means data=example2 qmethod=p2 median;
title 'Without Qmarkers';
var i;
run; title;

akulkarni
Fluorite | Level 6

Hi Vincent!

 

Since you've a large data-set, SAS maybe taking a different & smaller sample of your data-set each time to calculate the median (depends on the version of SAS you've installed). Have you tried the argument 'QMARKERS' yet? The number of markers controls the size of fixed memory space, so for calculating median you've to set 'QMARKERS= 7'. Please let me know if that works for you.

 

Good luck!

Anurag 

vincent2
Calcite | Level 5

Hi, akulkarni

 

Thanks for your reply.

 

I tried to set qmarker = 7. But, each time i run proc means to calculate the median value of my data set which has only one variable with more than 10,000,000 values, it still varies. This median variation will not happen when I subset my data set to smaller data set with less than 1,000,000 obs. The proc means setting of mine is shown below, please help to take a look.

 

proc means data = mydata n median qmethod = P2;

var score;

run;

 

Thanks

Vincet

 

 

 

 

Thank 

 

 

 

 

ballardw
Super User

From the documentation:

Tip Increase the number of markers above the defaults settings to improve the accuracy of the estimate; reduce the number of markers to conserve memory and computing time.

 

Since you don't show qmarker it is using 7. Try using a larger (perhaps much larger) odd integer value. But the method is an estimator that uses sampling and does not ever guarantee identical results and the larger the data set the less likely to be the same

 

Why are you using qmethod=p2 in the first place? Do you not have enough memory?

Unless your data set is WAY bigger than 10,000,000 records or your computer is very slow I might not worry about the qmethod. Running 10,000,000 records on my computer qmethod=p2 and qmarkers=7 runs in about 2 seconds, qmarkers=77 runs in 15.2 seconds and no qmethod (default OS) runs in 13.38 seconds.

If I sort the data set then the sort takes 5.47 and the median calculation takes 7.0.

 

If you don't want the median to change(which I don't blame you for) you'll have to use the default and deal with the slightly longer program run times.

 

vincent2
Calcite | Level 5

Hi 

 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 3619 views
  • 0 likes
  • 3 in conversation