Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- Sampling with replacement - different methods, different results?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 04-03-2023 07:46 AM
(265 views)

Hi,

I want to select a random sample of 10 thousand obs. with replacement from a given dataset (with 20 thousand obs). First, I try different approaches in SAS and compare the medians. I try 5 different approaches (code below), 3 of them give one median value, 2 give another median value. Why is there a difference? Shouldn't the medians be the same given the relatively large sample?

`/* Create dataset to sample from*/`

data ds;
do i= 1 to 20000;
x = rand("T", 1);
output;
end;
drop i;
run;
/* 1. Surveyselect + reps*/
proc surveyselect data = ds ranuni sampsize = 1000 reps = 10 seed = 12345 method = urs out = sample1 (keep = Replicate x:) noprint outhits;
run;
proc means data = sample1 noprint;
var x;
output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;
/* 2. Surveyselect without reps*/
proc surveyselect data = ds ranuni sampsize = 10000 seed = 12345 method = urs out = sample2 (keep = Replicate x:) noprint outhits;
run;
proc means data = sample2 noprint;
var x;
output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;
/* 3. IML*/
proc iml;
use ds;
read all into ds_mat;
call randseed(12345);
s = sample(ds_mat, 10000)`;
create sample3 var {x};
append from s;
close sample3 ds;
proc means data = sample3 noprint;
var x;
output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;
/* 4. Data step*/
/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;
p = ceil(NObs * rand("Uniform"));
set ds nobs=NObs point=p;
output;
end;
STOP;
run;
sasfile ds close;
proc means data = sample4 noprint;
var x;
output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;
/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
choose=int(ranuni(12345)*n)+1;
set ds point=choose nobs=n;
i+1;
if i > 10000 then stop;
run;
proc means data = sample5 noprint;
var x;
output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;
data medians;
merge med_:;
run;
proc print data = medians;
run;

I've also tried to do the same in R (using sample function) and Python (using random.choices) and I got even different results for median:

R: 0.009

Python: -0.02848

So, I'm totally confused. Which result is correct? The background for my question is that I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results, so I want to understand the differences.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

To get the same answers, you must use the same random sample. In the PROC SURVEYSELECT calls, you should delete the RANUNI option. The RANUNI option uses an old 1970s-style random number generator (RNG) instead of the modern RNG that is used by the RAND function. Similarly, you should replace the RANUNI call in the second DATA step (the one from psu.edu) with a call to RAND. If you do that, you will get the same random sample and, consequently, the same median, in each case:

```
/* Create dataset to sample from*/
data ds;
do i= 1 to 20000;
x = rand("T", 1);
output;
end;
drop i;
run;
/* 1. Surveyselect + reps*/
proc surveyselect data = ds sampsize = 1000
reps = 10
seed = 12345 method = urs
out = sample1 (keep = Replicate x:) noprint outhits;
run;
proc means data = sample1 median;
var x;
output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;
/* 2. Surveyselect without reps*/
proc surveyselect data = ds sampsize = 10000
seed = 12345 method = urs out = sample2 (keep = Replicate x:)
noprint outhits;
run;
proc means data = sample2 median;
var x;
output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;
proc iml;
use ds;
read all into ds_mat;
call randseed(12345);
s = sample(ds_mat, 10000)`;
create sample3 var {x};
append from s;
close sample3 ds;
quit;
proc means data = sample3 median;
var x;
output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;
/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;
p = ceil(NObs * rand("Uniform"));
set ds nobs=NObs point=p;
output;
end;
STOP;
run;
sasfile ds close;
proc means data = sample4 median;
var x;
output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;
/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
call streaminit(12345);
choose=int(rand("Uniform")*n)+1;
set ds point=choose nobs=n;
i+1;
if i > 10000 then stop;
run;
proc means data = sample5 median;
var x;
output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;
data medians;
merge med_:;
run;
proc print data = medians;
run;
```

> *Which result is correct? *

All results are equally correct. You are asking for the median of a random sample. If you change the random sample, you will get a different median.

> *I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results*

I don't know what you mean by "completely different results." You will get different results when you use different software because the RNGs are different even if you use the same seed. Also, as KSharp mentions, there will be minor differences due to different default definitions for the quantiles. But you shouldn't be getting "completely different" results. The results should be similar, and should all be within a few standard errors of 0, which is the median for the population.

4 REPLIES 4

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I think there are different definitions(or algorithm ) for median . Like quantiles there are 9 different method to calculate it.

@Rick_SAS wrote a blog about it before. and I think Rick could give you more details .

https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

https://blogs.sas.com/content/iml/2021/07/26/compare-quantiles-sas-r-python.html

@Rick_SAS wrote a blog about it before. and I think Rick could give you more details .

https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

https://blogs.sas.com/content/iml/2021/07/26/compare-quantiles-sas-r-python.html

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Rule #1 for trying to figure out what is happening ... **LOOK AT** the data with your own eyes. If you simply look at SAMPLE1 and SAMPLE3, you will see there are differences between the values of X that are generated.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

To get the same answers, you must use the same random sample. In the PROC SURVEYSELECT calls, you should delete the RANUNI option. The RANUNI option uses an old 1970s-style random number generator (RNG) instead of the modern RNG that is used by the RAND function. Similarly, you should replace the RANUNI call in the second DATA step (the one from psu.edu) with a call to RAND. If you do that, you will get the same random sample and, consequently, the same median, in each case:

```
/* Create dataset to sample from*/
data ds;
do i= 1 to 20000;
x = rand("T", 1);
output;
end;
drop i;
run;
/* 1. Surveyselect + reps*/
proc surveyselect data = ds sampsize = 1000
reps = 10
seed = 12345 method = urs
out = sample1 (keep = Replicate x:) noprint outhits;
run;
proc means data = sample1 median;
var x;
output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;
/* 2. Surveyselect without reps*/
proc surveyselect data = ds sampsize = 10000
seed = 12345 method = urs out = sample2 (keep = Replicate x:)
noprint outhits;
run;
proc means data = sample2 median;
var x;
output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;
proc iml;
use ds;
read all into ds_mat;
call randseed(12345);
s = sample(ds_mat, 10000)`;
create sample3 var {x};
append from s;
close sample3 ds;
quit;
proc means data = sample3 median;
var x;
output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;
/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;
p = ceil(NObs * rand("Uniform"));
set ds nobs=NObs point=p;
output;
end;
STOP;
run;
sasfile ds close;
proc means data = sample4 median;
var x;
output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;
/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
call streaminit(12345);
choose=int(rand("Uniform")*n)+1;
set ds point=choose nobs=n;
i+1;
if i > 10000 then stop;
run;
proc means data = sample5 median;
var x;
output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;
data medians;
merge med_:;
run;
proc print data = medians;
run;
```

> *Which result is correct? *

All results are equally correct. You are asking for the median of a random sample. If you change the random sample, you will get a different median.

> *I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results*

I don't know what you mean by "completely different results." You will get different results when you use different software because the RNGs are different even if you use the same seed. Also, as KSharp mentions, there will be minor differences due to different default definitions for the quantiles. But you shouldn't be getting "completely different" results. The results should be similar, and should all be within a few standard errors of 0, which is the median for the population.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks @Rick_SAS @PaigeMiller @Ksharp for suggestions!

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.