Hi,
I want to select a random sample of 10 thousand obs. with replacement from a given dataset (with 20 thousand obs). First, I try different approaches in SAS and compare the medians. I try 5 different approaches (code below), 3 of them give one median value, 2 give another median value. Why is there a difference? Shouldn't the medians be the same given the relatively large sample?
/* Create dataset to sample from*/
data ds;
do i= 1 to 20000;
x = rand("T", 1);
output;
end;
drop i;
run;
/* 1. Surveyselect + reps*/
proc surveyselect data = ds ranuni sampsize = 1000 reps = 10 seed = 12345 method = urs out = sample1 (keep = Replicate x:) noprint outhits;
run;
proc means data = sample1 noprint;
var x;
output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;
/* 2. Surveyselect without reps*/
proc surveyselect data = ds ranuni sampsize = 10000 seed = 12345 method = urs out = sample2 (keep = Replicate x:) noprint outhits;
run;
proc means data = sample2 noprint;
var x;
output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;
/* 3. IML*/
proc iml;
use ds;
read all into ds_mat;
call randseed(12345);
s = sample(ds_mat, 10000)`;
create sample3 var {x};
append from s;
close sample3 ds;
proc means data = sample3 noprint;
var x;
output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;
/* 4. Data step*/
/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;
p = ceil(NObs * rand("Uniform"));
set ds nobs=NObs point=p;
output;
end;
STOP;
run;
sasfile ds close;
proc means data = sample4 noprint;
var x;
output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;
/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
choose=int(ranuni(12345)*n)+1;
set ds point=choose nobs=n;
i+1;
if i > 10000 then stop;
run;
proc means data = sample5 noprint;
var x;
output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;
data medians;
merge med_:;
run;
proc print data = medians;
run;
I've also tried to do the same in R (using sample function) and Python (using random.choices) and I got even different results for median:
R: 0.009
Python: -0.02848
So, I'm totally confused. Which result is correct? The background for my question is that I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results, so I want to understand the differences.
To get the same answers, you must use the same random sample. In the PROC SURVEYSELECT calls, you should delete the RANUNI option. The RANUNI option uses an old 1970s-style random number generator (RNG) instead of the modern RNG that is used by the RAND function. Similarly, you should replace the RANUNI call in the second DATA step (the one from psu.edu) with a call to RAND. If you do that, you will get the same random sample and, consequently, the same median, in each case:
/* Create dataset to sample from*/
data ds;
do i= 1 to 20000;
x = rand("T", 1);
output;
end;
drop i;
run;
/* 1. Surveyselect + reps*/
proc surveyselect data = ds sampsize = 1000
reps = 10
seed = 12345 method = urs
out = sample1 (keep = Replicate x:) noprint outhits;
run;
proc means data = sample1 median;
var x;
output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;
/* 2. Surveyselect without reps*/
proc surveyselect data = ds sampsize = 10000
seed = 12345 method = urs out = sample2 (keep = Replicate x:)
noprint outhits;
run;
proc means data = sample2 median;
var x;
output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;
proc iml;
use ds;
read all into ds_mat;
call randseed(12345);
s = sample(ds_mat, 10000)`;
create sample3 var {x};
append from s;
close sample3 ds;
quit;
proc means data = sample3 median;
var x;
output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;
/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;
p = ceil(NObs * rand("Uniform"));
set ds nobs=NObs point=p;
output;
end;
STOP;
run;
sasfile ds close;
proc means data = sample4 median;
var x;
output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;
/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
call streaminit(12345);
choose=int(rand("Uniform")*n)+1;
set ds point=choose nobs=n;
i+1;
if i > 10000 then stop;
run;
proc means data = sample5 median;
var x;
output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;
data medians;
merge med_:;
run;
proc print data = medians;
run;
> Which result is correct?
All results are equally correct. You are asking for the median of a random sample. If you change the random sample, you will get a different median.
> I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results
I don't know what you mean by "completely different results." You will get different results when you use different software because the RNGs are different even if you use the same seed. Also, as KSharp mentions, there will be minor differences due to different default definitions for the quantiles. But you shouldn't be getting "completely different" results. The results should be similar, and should all be within a few standard errors of 0, which is the median for the population.
Rule #1 for trying to figure out what is happening ... LOOK AT the data with your own eyes. If you simply look at SAMPLE1 and SAMPLE3, you will see there are differences between the values of X that are generated.
To get the same answers, you must use the same random sample. In the PROC SURVEYSELECT calls, you should delete the RANUNI option. The RANUNI option uses an old 1970s-style random number generator (RNG) instead of the modern RNG that is used by the RAND function. Similarly, you should replace the RANUNI call in the second DATA step (the one from psu.edu) with a call to RAND. If you do that, you will get the same random sample and, consequently, the same median, in each case:
/* Create dataset to sample from*/
data ds;
do i= 1 to 20000;
x = rand("T", 1);
output;
end;
drop i;
run;
/* 1. Surveyselect + reps*/
proc surveyselect data = ds sampsize = 1000
reps = 10
seed = 12345 method = urs
out = sample1 (keep = Replicate x:) noprint outhits;
run;
proc means data = sample1 median;
var x;
output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;
/* 2. Surveyselect without reps*/
proc surveyselect data = ds sampsize = 10000
seed = 12345 method = urs out = sample2 (keep = Replicate x:)
noprint outhits;
run;
proc means data = sample2 median;
var x;
output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;
proc iml;
use ds;
read all into ds_mat;
call randseed(12345);
s = sample(ds_mat, 10000)`;
create sample3 var {x};
append from s;
close sample3 ds;
quit;
proc means data = sample3 median;
var x;
output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;
/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;
p = ceil(NObs * rand("Uniform"));
set ds nobs=NObs point=p;
output;
end;
STOP;
run;
sasfile ds close;
proc means data = sample4 median;
var x;
output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;
/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
call streaminit(12345);
choose=int(rand("Uniform")*n)+1;
set ds point=choose nobs=n;
i+1;
if i > 10000 then stop;
run;
proc means data = sample5 median;
var x;
output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;
data medians;
merge med_:;
run;
proc print data = medians;
run;
> Which result is correct?
All results are equally correct. You are asking for the median of a random sample. If you change the random sample, you will get a different median.
> I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results
I don't know what you mean by "completely different results." You will get different results when you use different software because the RNGs are different even if you use the same seed. Also, as KSharp mentions, there will be minor differences due to different default definitions for the quantiles. But you shouldn't be getting "completely different" results. The results should be similar, and should all be within a few standard errors of 0, which is the median for the population.
Thanks @Rick_SAS @PaigeMiller @Ksharp for suggestions!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.