BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
daufoi
Fluorite | Level 6

I have this code to run a monte carlo for contingency tables for a large data set.

 

proc freq data=remove_CF;
	table Rows*Cols / norow nocol nopct;
	exact pchi / MC N=1000;     /* set sample size, default is 10k */
run;

It evaluates the chi-square statistic on each table for 1000 random tables that have the same row and column sum as the observed table. But I do not want them to have the same row and column sum as the observed table. I want to make 1000 random tables with 1000 random samples from the original data set. Is there a parameter that I can add to this code that would allow me to do that?

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

OK. You didn't reference the blog post, so I didn't know if you had seen it.

 

The subsamples are not going to be good for much. For example, if you have rare events in the big data, you might have subsamples for which the rare events never appear.  Nevertheless, here's how you get the samples that you want: use the previous program that I posted for the Bootstrap, but use N=number instead of SAMPRATE=1 on the PROC SURVEYSELECT statement.

 

/* The basic bootstrap in SAS. 
   https://blogs.sas.com/content/iml/2016/08/10/bootstrap-confidence-interval-sas.html
*/
data Have(keep=x y);  /* data for crosstab */
   set Sashelp.Cars(rename=(Origin=x Type=y));
   if y^='Hybrid';
run;
 
proc freq data=Have;
tables x*y / chisq norow nocol nopercent;
run;

%let NumSamples = 1000;       /* number of resamples */
%let NSubset = 100;           /* size of each resample */

/* 2. Generate many bootstrap samples */
proc surveyselect data=Have NOPRINT seed=1
     out=BootSamp(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     N=&NSubset              /* each bootstrap sample has N observations */
     OUTHITS
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

/* 3. Compute the statistic for each bootstrap sample */
ods select none;
proc freq data=BootSamp;
   by SampleID;
   table x*y / chisq out=FreqOut;
   ods output ChiSq=ChiSq(where=(Statistic='Chi-Square'));
run;
ods select all;

/* distribution of chi-square statistics */
proc means data=ChiSq P5 Mean P95;
   var Value;
   label Value='Chi-Square';
run;

View solution in original post

4 REPLIES 4
Rick_SAS
SAS Super FREQ

For a discussion of exact tests and simulating contingency tables with fixed marginal frequencies, see https://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html 

 

I am unclear how you want to generate random samples. Let's say you have N observations in your big data set. Do you want each of the 1000 random samples to also have N observations (a bootstrap sample), or do you want a smaller number of observations in each random sample? if so, what size is the sample?

 

And what statistic are you trying to get for these samples? If you use bootstrap samples, you can get the sampling distribution of the chi-square statistic (and a confidence interval). Is that what you want?

daufoi
Fluorite | Level 6

@Rick_SASthanks for the response. I got the code from that blog post, but it says "generates 10,000 random tables that have the same row and column sum as the observed table." But I want 10,000 random tables with 10,000 random samples each (or whatever number I select). Entering the crosstabs in the proc iml code is not feasible considering the data set.

 

The data set has 900k observations looking for associations between binary variables. There are 35 total variable combinations that I am looking at, using chi sq test. As expected, running chi sq on all 900k observations will yield significance due to sample size. I am planning to subsample using monte carlo, looking at confidence intervals to determine the number and size of the random tables, and Cramer's V to look at effect size.

Rick_SAS
SAS Super FREQ

OK. You didn't reference the blog post, so I didn't know if you had seen it.

 

The subsamples are not going to be good for much. For example, if you have rare events in the big data, you might have subsamples for which the rare events never appear.  Nevertheless, here's how you get the samples that you want: use the previous program that I posted for the Bootstrap, but use N=number instead of SAMPRATE=1 on the PROC SURVEYSELECT statement.

 

/* The basic bootstrap in SAS. 
   https://blogs.sas.com/content/iml/2016/08/10/bootstrap-confidence-interval-sas.html
*/
data Have(keep=x y);  /* data for crosstab */
   set Sashelp.Cars(rename=(Origin=x Type=y));
   if y^='Hybrid';
run;
 
proc freq data=Have;
tables x*y / chisq norow nocol nopercent;
run;

%let NumSamples = 1000;       /* number of resamples */
%let NSubset = 100;           /* size of each resample */

/* 2. Generate many bootstrap samples */
proc surveyselect data=Have NOPRINT seed=1
     out=BootSamp(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     N=&NSubset              /* each bootstrap sample has N observations */
     OUTHITS
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

/* 3. Compute the statistic for each bootstrap sample */
ods select none;
proc freq data=BootSamp;
   by SampleID;
   table x*y / chisq out=FreqOut;
   ods output ChiSq=ChiSq(where=(Statistic='Chi-Square'));
run;
ods select all;

/* distribution of chi-square statistics */
proc means data=ChiSq P5 Mean P95;
   var Value;
   label Value='Chi-Square';
run;
Rick_SAS
SAS Super FREQ

If you want the basic bootstrap, you study the following program:

 

/* The basic bootstrap in SAS. 
   https://blogs.sas.com/content/iml/2016/08/10/bootstrap-confidence-interval-sas.html
*/
data Have(keep=x y);  /* data for crosstab */
   set Sashelp.Cars(rename=(Origin=x Type=y));
   if y^='Hybrid';
run;
 
proc freq data=Have;
tables x*y / chisq norow nocol nopercent;
run;

%let NumSamples = 1000;       /* number of bootstrap resamples */
/* 2. Generate many bootstrap samples */
proc surveyselect data=Have NOPRINT seed=1
     out=BootSamp(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

/* 3. Compute the statistic for each bootstrap sample */
ods select none;
proc freq data=BootSamp;
   by SampleID;
   weight NumberHits;
   table x*y / chisq out=FreqOut;
   ods output ChiSq=ChiSq(where=(Statistic='Chi-Square'));
run;
ods select all;

/* distribution of chi-square statistics */
proc means data=ChiSq P5 Mean P95;
   var Value;
   label Value='Chi-Square';
run;

title "Bootstrap Distribution of Chi-Square Statistic";
%let Est = 30.3003;
proc sgplot data=ChiSq;
   label Value='Chi-Square';
   histogram value;
   /* Optional: draw reference line at observed value and draw 90% CI */
   refline &Est / axis=x lineattrs=(color=red) 
                  name="Est" legendlabel="Observed Statistic = &Est";
   refline 23.2081132 56.0521988  / axis=x lineattrs=(color=blue) 
                  name="CI" legendlabel="90% CI";
   keylegend "Est" "CI";
run;

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 526 views
  • 0 likes
  • 2 in conversation