My question relates to simulating data in the data step, not IML, but I did not see a more likely community (redirect me if I missed it). I'm using SAS 9.4 TS1M3. I have two SAS programs that I think should generate the same stream of observations for a simple one-way ANOVA with 3 treatments. One program was written a few years ago and seems to verify results from Scheffe's Design text Table 10.4.2 on inflation of alpha with unequal variances. I was trying to use an updated simulation program based upon Rick Wicklin's Simulation book as it seems more elegant, however, when I run that version I get somewhat different results that do not match Scheffe as well. I am including 2 samples worth of data from both runs using the same generator (RAND with Normal) and the same starting seed ... 30 observations total. What has me completely puzzled is why the data for the first treatment in each sample *IS* identical (see underlined obs), but the 2nd and 3rd treatments differ ... and it is not simply a sequence difference as the numbers are completely different (although curously it seems that the signs of the generated values are the same across the two programs for the last 2 treatments. What obvious issue am I missing? Below are the first two samples from each program followed by the programs. Original code (RAND – starting seed = 12765) Obs sample iter trt yield 1 1 1 A 0.14838 2 1 2 A -1.02313 3 1 3 A 0.25564 4 1 1 B 1.21598 5 1 2 B -0.59991 6 1 3 B -0.89688 7 1 4 B -0.22455 8 1 5 B 1.01282 9 1 6 B 0.66041 10 1 7 B 1.34232 11 1 8 B 2.37081 12 1 9 B -0.28913 13 1 1 C -0.37018 14 1 2 C -0.06950 15 1 3 C 2.18236 16 2 1 A 0.10620 17 2 2 A -2.09921 18 2 3 A 0.15333 19 2 1 B 0.82591 20 2 2 B 0.04954 21 2 3 B 0.07163 22 2 4 B -0.08633 23 2 5 B 0.67917 24 2 6 B 1.33329 25 2 7 B -0.57104 26 2 8 B 0.06797 27 2 9 B -1.10328 28 2 1 C -2.56702 29 2 2 C -1.20457 30 2 3 C 2.17559 After Wicklin using arrays (RAND – starting seed = 12765) Grand Obs Mean Sample i Treatment j Y 1 0 1 1 1 1 0.14838 2 0 1 1 1 2 -1.02313 3 0 1 1 1 3 0.25564 4 0 1 2 2 1 1.71965 5 0 1 2 2 2 -0.84840 6 0 1 2 2 3 -1.26838 7 0 1 2 2 4 -0.31757 8 0 1 2 2 5 1.43234 9 0 1 2 2 6 0.93396 10 0 1 2 2 7 1.89833 11 0 1 2 2 8 3.35283 12 0 1 2 2 9 -0.40890 13 0 1 3 3 1 -0.64117 14 0 1 3 3 2 -0.12038 15 0 1 3 3 3 3.77996 16 0 2 1 1 1 0.10620 17 0 2 1 1 2 -2.09921 18 0 2 1 1 3 0.15333 19 0 2 2 2 1 1.16801 20 0 2 2 2 2 0.07007 21 0 2 2 2 3 0.10130 22 0 2 2 2 4 -0.12209 23 0 2 2 2 5 0.96049 24 0 2 2 2 6 1.88556 25 0 2 2 2 7 -0.80757 26 0 2 2 2 8 0.09613 27 0 2 2 2 9 -1.56027 28 0 2 3 3 1 -4.44621 29 0 2 3 3 2 -2.08639 30 0 2 3 3 3 3.76822 My original code is below ... i left the GLM etc. but that can be ignored - I'm just trying to understand why the data are different. options linesize=80 pagesize=60 formchar="|----|+|---+=|-/\<>*";
title1 'Illustration of Statistical Power in CRD ANOVA -- Uses RAND';
***********************************************************************
*** The generate macro accepts 7 parameters as follows: ***
*** samples = # of samples to generate ***
*** n1,n2,n3 = sample size for each treatment ***
*** mu = population mean ***
*** sigma1, sigma2, sigma3 = population variances ***
*** seed0 = starting seed for simulation ***
*** The RAND function generates a random value from the specified ***
*** distribution (here normal with mean mu and sd sigma). ***
***********************************************************************;
%macro generate(samples, n1, n2, n3, mu, sigma1, sigma2, sigma3, seed0);
data mc (keep=sample iter trt yield);
call streaminit(&seed0); *** Initialize with desired seed. ***;
sigma1=sqrt(&sigma1);
sigma2=sqrt(&sigma2);
sigma3=sqrt(&sigma3);
do sample=1 to &samples;
do iter=1 to &n1;
trt='A'; yield=rand('Normal',(&mu),sigma1); output;
end;
do iter=1 to &n2;
trt='B'; yield=rand('Normal',(&mu),sigma2); output;
end;
do iter=1 to &n3;
trt='C'; yield=rand('Normal',(&mu),sigma3); output;
end;
end;
run;
proc print data=mc;
run;
proc sort;
by sample trt;
run;
*proc univariate plot normal;
* var yield;
* title2 'Check Distribution of Simulated Data';
* run;
proc glm noprint outstat=goal;
by sample;
class trt;
model yield = trt;
means trt / tukey hovtest;
title2 "Completely Randomized Design (CRD) for Simulated Yields";
run;
data goal;
set goal;
if prob<=.05 then flag=1; *** Identify rejections with alpha=.05 ;
else flag=0;
run;
proc print;
title2 'Verify Results';
run;
proc freq;
where _type_='SS3'; *** Count rejections using Type III SS results.;
tables flag;
run;
%mend generate;
%*generate(10000, 5, 5, 5, 0, 1, 2, 3, 0);
%generate(5, 3, 9, 3, 0, 1, 2, 3, 12765);
%*generate(10000, 7, 5, 3, 0, 1, 2, 3, 0);
%*generate(10000, 5, 5, 5, 0, 1, 1, 3, 0);
%*generate(10000, 7, 5, 3, 0, 1, 1, 3, 0); The updated version is below modified from code in Wicklin's Simulating Data using SAS. options linesize=80 pagesize=60 formchar="|----|+|---+=|-/\<>*";
title1 'Illustration of Type I Error/Power in CRD ANOVA';
*************************************************************
Settings from Scheffe table 10-4-2
n1 n2 n3 mn s1 s2 s3
5, 5, 5, 0, 1, 2, 3
3, 9, 3, 0, 1, 2, 3
7, 5, 3, 0, 1, 2, 3
5, 5, 5, 0, 1, 1, 3
7, 5, 3, 0, 1, 1, 3
*************************************************************;
%let Samples=5; *** Specify # of Samples ***;
%let GM=0; *** Specify Grand Mean. ***;
%let NRepV=3 9 3; *** Specify Sample Size Vector. ***;
%let ESV=0 0 0; *** Specify Effect Size Vector. ***;
%let SDV=1 2 3; *** Specify SD Vector. ***;
data raw;
call streaminit(12765);
GrandMean = &GM; *** Specify Grand Mean from Model. ***;
array Size{3} _temporary_ (&NRepV); *** Sample sizes ***;
array Effect{3} _temporary_ (&ESV); *** Effects ***;
array Std{3} _temporary_ (&SDV); *** SDs ***;
do Sample=1 to &Samples;
do i = 1 to dim(Effect); *** Number of treatment levels ***;
Treatment = i;
do j = 1 to Size{i}; *** Number of reps per trt level.***;
Y = GrandMean + Effect{i} + rand("Normal", 0, Std{i});
output;
end;
end;
end;
run;
proc print;
title2 'verify Data';
run;
ods graphics off; ods exclude all; ods noresults;
proc glm data=raw outstat=F_Pooled(where=(_type_='SS3'));
by sample;
class Treatment;
model Y = Treatment;
title2 "CRD with Ns:&NRepV ESs:&ESV SDs:&SDV";
run;
ods graphics on; ods exclude none; ods results;
proc print;
title2 'Verify Results';
run;
proc format;
value sig 0 -.05 = 'Reject'
.05<-1 = 'DNR';
run;
proc freq data=F_Pooled;
tables Prob;
format Prob sig.;
title2 'Rejection Rate for Actual F Tests';
title3 "CRD with Ns:&NRepV ESs:&ESV SDs:&SDV";
run;
data goal;
set F_Pooled;
if prob<=.05 then flag=1; *** Identify rejections with alpha=.05 ;
else flag=0;
run;
proc freq;
where _type_='SS3'; *** Count rejections using Type III SS results.;
tables flag;
run;
I apologize if I'm missing something obvious ... I'm just not seeing it. The results of 10K simulated samples differ as well ... the old code comes pretty close to Scheffe's table but the newer code seems to reject at about a 1-3% higher rate ... If I could understand why the data diiffer, I assume it will be more clear why the results differ. All suggestions are appreciated. Thanks. Curt
... View more