BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
chris2377
Quartz | Level 8

Hi,

 

I want to select a random sample of 10 thousand obs. with replacement from a given dataset (with 20 thousand obs). First, I try different approaches in SAS and compare the medians. I try 5 different approaches (code below), 3 of them give one median value, 2 give another median value. Why is there a difference? Shouldn't the medians be the same given the relatively large sample?

 

/* Create dataset to sample from*/
data ds; do i= 1 to 20000; x = rand("T", 1); output; end; drop i; run; /* 1. Surveyselect + reps*/ proc surveyselect data = ds ranuni sampsize = 1000 reps = 10 seed = 12345 method = urs out = sample1 (keep = Replicate x:) noprint outhits; run; proc means data = sample1 noprint; var x; output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1; run; /* 2. Surveyselect without reps*/ proc surveyselect data = ds ranuni sampsize = 10000 seed = 12345 method = urs out = sample2 (keep = Replicate x:) noprint outhits; run; proc means data = sample2 noprint; var x; output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2; run; /* 3. IML*/ proc iml; use ds; read all into ds_mat; call randseed(12345); s = sample(ds_mat, 10000)`; create sample3 var {x}; append from s; close sample3 ds; proc means data = sample3 noprint; var x; output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml; run; /* 4. Data step*/ /* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/ sasfile ds load; data sample4(drop=i); call streaminit(12345); do i = 1 to 10000; p = ceil(NObs * rand("Uniform")); set ds nobs=NObs point=p; output; end; STOP; run; sasfile ds close; proc means data = sample4 noprint; var x; output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1; run; /* b. https://online.stat.psu.edu/stat482/book/export/html/660*/ data sample5; choose=int(ranuni(12345)*n)+1; set ds point=choose nobs=n; i+1; if i > 10000 then stop; run; proc means data = sample5 noprint; var x; output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2; run; data medians; merge med_:; run; proc print data = medians; run;

I've also tried to do the same in R (using sample function) and Python (using random.choices) and I got even different results for median:
R: 0.009
Python: -0.02848

So, I'm totally confused. Which result is correct? The background for my question is that I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results, so I want to understand the differences.

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

To get the same answers, you must use the same random sample. In the PROC SURVEYSELECT calls, you should delete the RANUNI option. The RANUNI option uses an old 1970s-style random number generator (RNG) instead of the modern RNG that is used by the RAND function. Similarly, you should replace the RANUNI call in the second DATA step (the one from psu.edu) with a call to RAND. If you do that, you will get the same random sample and, consequently, the same median, in each case:

 

/* Create dataset to sample from*/
data ds;
	do i= 1 to 20000;
		x = rand("T", 1);
		output;
	end;
	drop i;
run;

/* 1. Surveyselect + reps*/
proc surveyselect data = ds  sampsize = 1000 
     reps = 10 
     seed = 12345 method = urs 
     out = sample1 (keep = Replicate x:) noprint outhits;
run;

proc means data = sample1 median;
	var x;
	output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;

/* 2. Surveyselect without reps*/
proc surveyselect data = ds  sampsize = 10000 
     seed = 12345 method = urs out = sample2 (keep = Replicate x:) 
     noprint outhits;
run;

proc means data = sample2 median;
	var x;
	output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;


proc iml;
	use ds;
  	read all into ds_mat;
	call randseed(12345);
	s = sample(ds_mat, 10000)`; 
	create sample3 var {x};
	append from s;
	close sample3 ds;
quit;
	
proc means data = sample3 median;
	var x;
	output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;



/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;         
   p = ceil(NObs * rand("Uniform"));
   set ds nobs=NObs point=p;
   output;
end;
STOP;
run;
sasfile ds close;

proc means data = sample4 median;
	var x;
	output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;


/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
   call streaminit(12345);
	choose=int(rand("Uniform")*n)+1;
	set ds point=choose nobs=n;
	i+1;
	if i > 10000 then stop;
run;

proc means data = sample5 median;
	var x;
	output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;

data medians;
	merge med_:;
run;

proc print data = medians;
run;

Which result is correct?

All results are equally correct. You are asking for the median of a random sample. If you change the random sample, you will get a different median.

 

> I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results

I don't know what you mean by "completely different results."  You will get different results when you use different software because the RNGs are different even if you use the same seed. Also, as KSharp mentions, there will be minor differences due to different default definitions for the quantiles. But you shouldn't be getting "completely different" results. The results should be similar, and should all be within a few standard errors of 0, which is the median for the population.

View solution in original post

4 REPLIES 4
Ksharp
Super User
I think there are different definitions(or algorithm ) for median . Like quantiles there are 9 different method to calculate it.
@Rick_SAS wrote a blog about it before. and I think Rick could give you more details .

https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

https://blogs.sas.com/content/iml/2021/07/26/compare-quantiles-sas-r-python.html



PaigeMiller
Diamond | Level 26

Rule #1 for trying to figure out what is happening ... LOOK AT the data with your own eyes. If you simply look at SAMPLE1 and SAMPLE3, you will see there are differences between the values of X that are generated.

--
Paige Miller
Rick_SAS
SAS Super FREQ

To get the same answers, you must use the same random sample. In the PROC SURVEYSELECT calls, you should delete the RANUNI option. The RANUNI option uses an old 1970s-style random number generator (RNG) instead of the modern RNG that is used by the RAND function. Similarly, you should replace the RANUNI call in the second DATA step (the one from psu.edu) with a call to RAND. If you do that, you will get the same random sample and, consequently, the same median, in each case:

 

/* Create dataset to sample from*/
data ds;
	do i= 1 to 20000;
		x = rand("T", 1);
		output;
	end;
	drop i;
run;

/* 1. Surveyselect + reps*/
proc surveyselect data = ds  sampsize = 1000 
     reps = 10 
     seed = 12345 method = urs 
     out = sample1 (keep = Replicate x:) noprint outhits;
run;

proc means data = sample1 median;
	var x;
	output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;

/* 2. Surveyselect without reps*/
proc surveyselect data = ds  sampsize = 10000 
     seed = 12345 method = urs out = sample2 (keep = Replicate x:) 
     noprint outhits;
run;

proc means data = sample2 median;
	var x;
	output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;


proc iml;
	use ds;
  	read all into ds_mat;
	call randseed(12345);
	s = sample(ds_mat, 10000)`; 
	create sample3 var {x};
	append from s;
	close sample3 ds;
quit;
	
proc means data = sample3 median;
	var x;
	output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;



/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;         
   p = ceil(NObs * rand("Uniform"));
   set ds nobs=NObs point=p;
   output;
end;
STOP;
run;
sasfile ds close;

proc means data = sample4 median;
	var x;
	output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;


/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
   call streaminit(12345);
	choose=int(rand("Uniform")*n)+1;
	set ds point=choose nobs=n;
	i+1;
	if i > 10000 then stop;
run;

proc means data = sample5 median;
	var x;
	output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;

data medians;
	merge med_:;
run;

proc print data = medians;
run;

Which result is correct?

All results are equally correct. You are asking for the median of a random sample. If you change the random sample, you will get a different median.

 

> I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results

I don't know what you mean by "completely different results."  You will get different results when you use different software because the RNGs are different even if you use the same seed. Also, as KSharp mentions, there will be minor differences due to different default definitions for the quantiles. But you shouldn't be getting "completely different" results. The results should be similar, and should all be within a few standard errors of 0, which is the median for the population.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 998 views
  • 3 likes
  • 4 in conversation