BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
chris2377
Quartz | Level 8

Hi,

 

I want to select a random sample of 10 thousand obs. with replacement from a given dataset (with 20 thousand obs). First, I try different approaches in SAS and compare the medians. I try 5 different approaches (code below), 3 of them give one median value, 2 give another median value. Why is there a difference? Shouldn't the medians be the same given the relatively large sample?

 

/* Create dataset to sample from*/
data ds; do i= 1 to 20000; x = rand("T", 1); output; end; drop i; run; /* 1. Surveyselect + reps*/ proc surveyselect data = ds ranuni sampsize = 1000 reps = 10 seed = 12345 method = urs out = sample1 (keep = Replicate x:) noprint outhits; run; proc means data = sample1 noprint; var x; output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1; run; /* 2. Surveyselect without reps*/ proc surveyselect data = ds ranuni sampsize = 10000 seed = 12345 method = urs out = sample2 (keep = Replicate x:) noprint outhits; run; proc means data = sample2 noprint; var x; output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2; run; /* 3. IML*/ proc iml; use ds; read all into ds_mat; call randseed(12345); s = sample(ds_mat, 10000)`; create sample3 var {x}; append from s; close sample3 ds; proc means data = sample3 noprint; var x; output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml; run; /* 4. Data step*/ /* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/ sasfile ds load; data sample4(drop=i); call streaminit(12345); do i = 1 to 10000; p = ceil(NObs * rand("Uniform")); set ds nobs=NObs point=p; output; end; STOP; run; sasfile ds close; proc means data = sample4 noprint; var x; output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1; run; /* b. https://online.stat.psu.edu/stat482/book/export/html/660*/ data sample5; choose=int(ranuni(12345)*n)+1; set ds point=choose nobs=n; i+1; if i > 10000 then stop; run; proc means data = sample5 noprint; var x; output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2; run; data medians; merge med_:; run; proc print data = medians; run;

I've also tried to do the same in R (using sample function) and Python (using random.choices) and I got even different results for median:
R: 0.009
Python: -0.02848

So, I'm totally confused. Which result is correct? The background for my question is that I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results, so I want to understand the differences.

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

To get the same answers, you must use the same random sample. In the PROC SURVEYSELECT calls, you should delete the RANUNI option. The RANUNI option uses an old 1970s-style random number generator (RNG) instead of the modern RNG that is used by the RAND function. Similarly, you should replace the RANUNI call in the second DATA step (the one from psu.edu) with a call to RAND. If you do that, you will get the same random sample and, consequently, the same median, in each case:

 

/* Create dataset to sample from*/
data ds;
	do i= 1 to 20000;
		x = rand("T", 1);
		output;
	end;
	drop i;
run;

/* 1. Surveyselect + reps*/
proc surveyselect data = ds  sampsize = 1000 
     reps = 10 
     seed = 12345 method = urs 
     out = sample1 (keep = Replicate x:) noprint outhits;
run;

proc means data = sample1 median;
	var x;
	output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;

/* 2. Surveyselect without reps*/
proc surveyselect data = ds  sampsize = 10000 
     seed = 12345 method = urs out = sample2 (keep = Replicate x:) 
     noprint outhits;
run;

proc means data = sample2 median;
	var x;
	output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;


proc iml;
	use ds;
  	read all into ds_mat;
	call randseed(12345);
	s = sample(ds_mat, 10000)`; 
	create sample3 var {x};
	append from s;
	close sample3 ds;
quit;
	
proc means data = sample3 median;
	var x;
	output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;



/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;         
   p = ceil(NObs * rand("Uniform"));
   set ds nobs=NObs point=p;
   output;
end;
STOP;
run;
sasfile ds close;

proc means data = sample4 median;
	var x;
	output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;


/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
   call streaminit(12345);
	choose=int(rand("Uniform")*n)+1;
	set ds point=choose nobs=n;
	i+1;
	if i > 10000 then stop;
run;

proc means data = sample5 median;
	var x;
	output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;

data medians;
	merge med_:;
run;

proc print data = medians;
run;

Which result is correct?

All results are equally correct. You are asking for the median of a random sample. If you change the random sample, you will get a different median.

 

> I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results

I don't know what you mean by "completely different results."  You will get different results when you use different software because the RNGs are different even if you use the same seed. Also, as KSharp mentions, there will be minor differences due to different default definitions for the quantiles. But you shouldn't be getting "completely different" results. The results should be similar, and should all be within a few standard errors of 0, which is the median for the population.

View solution in original post

4 REPLIES 4
Ksharp
Super User
I think there are different definitions(or algorithm ) for median . Like quantiles there are 9 different method to calculate it.
@Rick_SAS wrote a blog about it before. and I think Rick could give you more details .

https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

https://blogs.sas.com/content/iml/2021/07/26/compare-quantiles-sas-r-python.html



PaigeMiller
Diamond | Level 26

Rule #1 for trying to figure out what is happening ... LOOK AT the data with your own eyes. If you simply look at SAMPLE1 and SAMPLE3, you will see there are differences between the values of X that are generated.

--
Paige Miller
Rick_SAS
SAS Super FREQ

To get the same answers, you must use the same random sample. In the PROC SURVEYSELECT calls, you should delete the RANUNI option. The RANUNI option uses an old 1970s-style random number generator (RNG) instead of the modern RNG that is used by the RAND function. Similarly, you should replace the RANUNI call in the second DATA step (the one from psu.edu) with a call to RAND. If you do that, you will get the same random sample and, consequently, the same median, in each case:

 

/* Create dataset to sample from*/
data ds;
	do i= 1 to 20000;
		x = rand("T", 1);
		output;
	end;
	drop i;
run;

/* 1. Surveyselect + reps*/
proc surveyselect data = ds  sampsize = 1000 
     reps = 10 
     seed = 12345 method = urs 
     out = sample1 (keep = Replicate x:) noprint outhits;
run;

proc means data = sample1 median;
	var x;
	output out = med_1 (drop = _TYPE_ _FREQ_) median= median_svy1;
run;

/* 2. Surveyselect without reps*/
proc surveyselect data = ds  sampsize = 10000 
     seed = 12345 method = urs out = sample2 (keep = Replicate x:) 
     noprint outhits;
run;

proc means data = sample2 median;
	var x;
	output out = med_2 (drop = _TYPE_ _FREQ_) median= median_svy2;
run;


proc iml;
	use ds;
  	read all into ds_mat;
	call randseed(12345);
	s = sample(ds_mat, 10000)`; 
	create sample3 var {x};
	append from s;
	close sample3 ds;
quit;
	
proc means data = sample3 median;
	var x;
	output out = med_3 (drop = _TYPE_ _FREQ_) median= median_iml;
run;



/* a. https://blogs.sas.com/content/iml/2014/01/29/sample-with-replacement-in-sas.html*/
sasfile ds load;
data sample4(drop=i);
call streaminit(12345);
do i = 1 to 10000;         
   p = ceil(NObs * rand("Uniform"));
   set ds nobs=NObs point=p;
   output;
end;
STOP;
run;
sasfile ds close;

proc means data = sample4 median;
	var x;
	output out = med_4 (drop = _TYPE_ _FREQ_) median= median_ds1;
run;


/* b. https://online.stat.psu.edu/stat482/book/export/html/660*/
data sample5;
   call streaminit(12345);
	choose=int(rand("Uniform")*n)+1;
	set ds point=choose nobs=n;
	i+1;
	if i > 10000 then stop;
run;

proc means data = sample5 median;
	var x;
	output out = med_5 (drop = _TYPE_ _FREQ_) median= median_ds2;
run;

data medians;
	merge med_:;
run;

proc print data = medians;
run;

Which result is correct?

All results are equally correct. You are asking for the median of a random sample. If you change the random sample, you will get a different median.

 

> I'm trying to replicate in SAS some analysis done in R that includes sampling and I get completely different results

I don't know what you mean by "completely different results."  You will get different results when you use different software because the RNGs are different even if you use the same seed. Also, as KSharp mentions, there will be minor differences due to different default definitions for the quantiles. But you shouldn't be getting "completely different" results. The results should be similar, and should all be within a few standard errors of 0, which is the median for the population.

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Get the $99 certification deal.jpg

 

 

Back in the Classroom!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 266 views
  • 3 likes
  • 4 in conversation