Solved: Re: How to sample from a posterior distribution

supp · Posted 06-26-2022 08:55 AM

I am working through the book "Statistical Rethinking" by Richard McElreath. The following is a toy example from the book that uses grid approximation to get a posterior distribution then samples from the posterior. My question is what is an equivalent way to obtain the same posterior and similar sample in SAS?

Since I am more familiar with SAS I am doing this for understanding and fun.

The set up is we toss a globe and catch it 9 times. The tip of our right index finger will land on either land (L) or water (W). We observe L= 3 and W= 6. We are trying to determine the proportion of the globe that is water using a bayesian approach.

The R code:

# create a grid of possible parameter values (proportion of globe that is W)
p_grid <- seq(from= 0, to= 1, length.out= 1000)
p_grid[1000]

# Set prior
prob_p <- rep(1, l)

# Liklihood function for grid of possible parameter values
prob_data <- dbinom(6, size = 9, p_grid)

# combine prior with liklihood to get posterior
posterior <- prob_data * prob_p
sum(posterior)

#Standard posterior
posterior <- posterior/sum(posterior)

# Sample from posterior
samples <- sample(p_grid, prob = posterior, size = 1e4, replace = TRUE)

#plot samples
plot(samples)
plot(p_grid, posterior)

Here is what the resulting posterior distribution looks like from the R script.

Here is my attempt to recreate the process in SAS. The tricky part seems to be is how to sample (or run simulations) based on the posterior. I used PROC SURVEYSELECT.

/**
Bayes practice
**/

data _test1;
	p_grid= 0;											/** Set initial parameter value **/
	
	do i= 1 to 1000;
		
/*		p_grid = rand('uniform') ;	*/					/** Alternative approach is to just randomly generate a bunch of parameter values **/
		prob_p = 1 ; 									/** Set a prior **/
		prob_data = pdf('binomial', 6, p_grid, 9); 		/** Liklihood function, liklihood of data given parameter value **/
		posterior = prob_data * prob_p;
		output;
		p_grid + .001;									/** update p_grid **/
	end;

drop i;
run;


proc sql;
	select sum(posterior) into: sum_posterior from _test1;         /* Sum posterior to standardize */

	create table _test2 as
		select *, posterior/(&sum_posterior.) as posterior_s from _test1 order by p_grid;
quit;

# create macro to sample from dataset using standard posterior as the size. 
%macro sample(i);
	proc surveyselect data= _test2 method= pps_wr out= _sample1 n= 1 outhits noprint;
		size posterior_s;
	run;

	proc append base= all_simulations data= _sample1; run;
%mend sample;

proc delete data= all_simulations; run;

data _null_;

	%let sim_size = 10000;
	do i = 1 to &sim_size;
		call execute("%sample("||i||");");
	end;
run; 

proc sql;	
	create table _sim_sum1 as
		select p_grid, sum(numberhits) as freq, sum(numberhits) / &sim_size. as proportion from all_simulations group by p_grid;

	select sum(proportion) as sum_proportion from _sim_sum1;
quit;


proc sgplot data= _sim_sum1;
	series x= p_grid y= proportion;
run;

Here is the resulting distribution from the plot: (It doesn't look right)

I acknowledge I may be way off base with this approach. I am trying to understand this material and would appreciate someone showing me a proper way to do this in SAS.

Ksharp · Posted 06-28-2022 08:11 AM

Here are some sas base code if you don't have iml module.

%let size=1000;
data have;
call streaminit(123);
 do p_grid=0 to 1 by 1/(&Size.-1);
   prob_data=pdf('binomial', 6, p_grid,9);
   output;
 end;
run;
proc sql;
create table have2 as
 select *,prob_data/sum(prob_data) as posterior
  from have;
quit;
proc sgplot data=have2;
series x=p_grid y=posterior;
run;


proc surveyselect data=have2 out=want seed=123 sampsize=10000 method=pps_wr outhits;
size posterior;
run;
proc sgplot data=want;
histogram p_grid;
run;

View solution in original post

Ksharp · Posted 06-26-2022 11:30 PM

In SAS, if you want to do some data simulation by bayesian method, you could try PROC MCMC .

Or try to use SAS/IML , @Rick_SAS could help you something .

supp · Posted 06-27-2022 09:21 AM

I looked into PROC MCMC. I can't figure out what the MODEL statement would be. Maybe this is too simple of an example?

Ksharp · Posted 06-27-2022 06:07 AM

You want this ?

%let size=1000;
proc iml;
call randseed(123);
p_grid=randfun(&size.,'uniform');
prob_data=pdf('binomial', 6,p_grid,9);

prob_p=j(&size.,1);
posterior= prob_data # prob_p;
posterior=posterior/sum(posterior);

samples = sample(p_grid, 1e4,'replace',posterior) ;

create want var{ p_grid posterior };
append;
close;
quit;

proc sort data=want; by p_grid;run;
proc sgplot data=want;
 series x=p_grid y=posterior ;
run;

Rick_SAS · Posted 06-27-2022 06:48 AM

To mimic the R script, use KSharp's idea but use

p_grid=T( do(0,1, 1/(&Size.-1)) );

With this change, you can skip the PROC SORT step. I also inserted a histogram for the sample:

%let size=1000;
proc iml;
call randseed(123);
p_grid=T( do(0,1, 1/(&Size.-1)) ); 
prob_data=pdf('binomial', 6, p_grid,9);

prob_p=j(&size.,1);
posterior= prob_data # prob_p;
posterior=posterior/sum(posterior);
call series (p_grid, posterior);

samples = sample(p_grid, 1e4,'replace',posterior) ;
call histogram(samples);

supp · Posted 06-27-2022 07:21 AM

Thanks for the reply. When I try running PROC IML I get a message suggesting my shop does not have a license for it. It seems PROC IML lets us approach the problem very similar to R using vectors.

Rick_SAS · Posted 06-27-2022 08:03 AM

Yes. The IML language is similar to R in many ways. Both are high-level languages that support matrix-vector computations, a large library of computational routines, and enable you to extend the language by defining your own functions. As this example shows, you can often translate an R program to PROC IML and vice versa.

The IML language is also syntactically similar to MATLAB.

supp · Posted 06-27-2022 08:38 AM

I spent some time summarizing the results from the SAS method. I now think the approach I used and results are valid. I still think there is a more elegant way to achieve the same results.

Here are some summaries I made from sampling the posterior.


/** Add a cumulative proportion column to find confidence intervals **/
data _sample3;
	set _sample2;
	
		retain cumulative_prop;
		if _n_ = 1 then cumulative_prop = proportion;
		else cumulative_prop = cumulative_prop + proportion;
run;



proc sgplot data= _sample3;
	series x= p_grid y= proportion;
run;


proc sql;
/** Add up posterior probability where p < 0.5 */
	select sum(posterior_s) as sum_standard_post from _sample3 where p_grid < 0.5;
	select sum(proportion) as sum_proportion from _sample3 where p_grid < 0.5;

/** What proportion of water is compatible with greater than 0.5 and less than .75 probability? **/
	select sum(proportion) as sum_proportion from _sample3 where 0.5 < p_grid < 0.75;

/** What proportion of water is compatible with greater less than 80% probability? **/
	select max(p_grid) as p_grid from _sample3 where cumulative_prop < .8;

/** What proportion of water is compatible with middle 80% probability? **/
	select min(p_grid) as lower_parameter, max(p_grid) as upper_parameter from _sample3 where 0.1 < cumulative_prop < .9;

/** Point estimate of highest probabilty  **/
	select p_grid as point_estimate, proportion, cumulative_prop from _sample3 having max(proportion) = proportion;

quit;

/**
Proportion of water less than .5
	Posterior --> 0.166 probability
	Sample --> 0.178 probability

Proportion of water greater than .5 and less than .75
	sample --> .596 probability

Lower 80% posterior probablity when proprotion of water is .76

Middle 80% posterior probablity when proprotion of water is between .445 and .812

point estimate for most likely proportion of water --> .625 (about .4% probability)
**/

Ksharp · Posted 06-28-2022 08:11 AM

Here are some sas base code if you don't have iml module.

%let size=1000;
data have;
call streaminit(123);
 do p_grid=0 to 1 by 1/(&Size.-1);
   prob_data=pdf('binomial', 6, p_grid,9);
   output;
 end;
run;
proc sql;
create table have2 as
 select *,prob_data/sum(prob_data) as posterior
  from have;
quit;
proc sgplot data=have2;
series x=p_grid y=posterior;
run;


proc surveyselect data=have2 out=want seed=123 sampsize=10000 method=pps_wr outhits;
size posterior;
run;
proc sgplot data=want;
histogram p_grid;
run;

supp · Posted 06-29-2022 08:04 AM

Thanks @Ksharp, the code you provided gets a posterior distribution without using PROC IML. It is also a little cleaner than my code in the original post.