My question relates to simulating data in the data step, not IML, but I did not see a more likely community (redirect me if I missed it).
I'm using SAS 9.4 TS1M3.
I have two SAS programs that I think should generate the same stream of observations for a simple one-way ANOVA with 3 treatments. One program was written a few years ago and seems to verify results from Scheffe's Design text Table 10.4.2 on inflation of alpha with unequal variances. I was trying to use an updated simulation program based upon Rick Wicklin's Simulation book as it seems more elegant, however, when I run that version I get somewhat different results that do not match Scheffe as well. I am including 2 samples worth of data from both runs using the same generator (RAND with Normal) and the same starting seed ... 30 observations total. What has me completely puzzled is why the data for the first treatment in each sample *IS* identical (see underlined obs), but the 2nd and 3rd treatments differ ... and it is not simply a sequence difference as the numbers are completely different (although curously it seems that the signs of the generated values are the same across the two programs for the last 2 treatments.
What obvious issue am I missing?
Below are the first two samples from each program followed by the programs.
Original code (RAND – starting seed = 12765)
Obs sample iter trt yield
1 1 1 A 0.14838
2 1 2 A -1.02313
3 1 3 A 0.25564
4 1 1 B 1.21598
5 1 2 B -0.59991
6 1 3 B -0.89688
7 1 4 B -0.22455
8 1 5 B 1.01282
9 1 6 B 0.66041
10 1 7 B 1.34232
11 1 8 B 2.37081
12 1 9 B -0.28913
13 1 1 C -0.37018
14 1 2 C -0.06950
15 1 3 C 2.18236
16 2 1 A 0.10620
17 2 2 A -2.09921
18 2 3 A 0.15333
19 2 1 B 0.82591
20 2 2 B 0.04954
21 2 3 B 0.07163
22 2 4 B -0.08633
23 2 5 B 0.67917
24 2 6 B 1.33329
25 2 7 B -0.57104
26 2 8 B 0.06797
27 2 9 B -1.10328
28 2 1 C -2.56702
29 2 2 C -1.20457
30 2 3 C 2.17559
After Wicklin using arrays (RAND – starting seed = 12765)
Grand
Obs Mean Sample i Treatment j Y
1 0 1 1 1 1 0.14838
2 0 1 1 1 2 -1.02313
3 0 1 1 1 3 0.25564
4 0 1 2 2 1 1.71965
5 0 1 2 2 2 -0.84840
6 0 1 2 2 3 -1.26838
7 0 1 2 2 4 -0.31757
8 0 1 2 2 5 1.43234
9 0 1 2 2 6 0.93396
10 0 1 2 2 7 1.89833
11 0 1 2 2 8 3.35283
12 0 1 2 2 9 -0.40890
13 0 1 3 3 1 -0.64117
14 0 1 3 3 2 -0.12038
15 0 1 3 3 3 3.77996
16 0 2 1 1 1 0.10620
17 0 2 1 1 2 -2.09921
18 0 2 1 1 3 0.15333
19 0 2 2 2 1 1.16801
20 0 2 2 2 2 0.07007
21 0 2 2 2 3 0.10130
22 0 2 2 2 4 -0.12209
23 0 2 2 2 5 0.96049
24 0 2 2 2 6 1.88556
25 0 2 2 2 7 -0.80757
26 0 2 2 2 8 0.09613
27 0 2 2 2 9 -1.56027
28 0 2 3 3 1 -4.44621
29 0 2 3 3 2 -2.08639
30 0 2 3 3 3 3.76822
My original code is below ... i left the GLM etc. but that can be ignored - I'm just trying to understand why the data are different.
options linesize=80 pagesize=60 formchar="|----|+|---+=|-/\<>*";
title1 'Illustration of Statistical Power in CRD ANOVA -- Uses RAND';
***********************************************************************
*** The generate macro accepts 7 parameters as follows: ***
*** samples = # of samples to generate ***
*** n1,n2,n3 = sample size for each treatment ***
*** mu = population mean ***
*** sigma1, sigma2, sigma3 = population variances ***
*** seed0 = starting seed for simulation ***
*** The RAND function generates a random value from the specified ***
*** distribution (here normal with mean mu and sd sigma). ***
***********************************************************************;
%macro generate(samples, n1, n2, n3, mu, sigma1, sigma2, sigma3, seed0);
data mc (keep=sample iter trt yield);
call streaminit(&seed0); *** Initialize with desired seed. ***;
sigma1=sqrt(&sigma1);
sigma2=sqrt(&sigma2);
sigma3=sqrt(&sigma3);
do sample=1 to &samples;
do iter=1 to &n1;
trt='A'; yield=rand('Normal',(&mu),sigma1); output;
end;
do iter=1 to &n2;
trt='B'; yield=rand('Normal',(&mu),sigma2); output;
end;
do iter=1 to &n3;
trt='C'; yield=rand('Normal',(&mu),sigma3); output;
end;
end;
run;
proc print data=mc;
run;
proc sort;
by sample trt;
run;
*proc univariate plot normal;
* var yield;
* title2 'Check Distribution of Simulated Data';
* run;
proc glm noprint outstat=goal;
by sample;
class trt;
model yield = trt;
means trt / tukey hovtest;
title2 "Completely Randomized Design (CRD) for Simulated Yields";
run;
data goal;
set goal;
if prob<=.05 then flag=1; *** Identify rejections with alpha=.05 ;
else flag=0;
run;
proc print;
title2 'Verify Results';
run;
proc freq;
where _type_='SS3'; *** Count rejections using Type III SS results.;
tables flag;
run;
%mend generate;
%*generate(10000, 5, 5, 5, 0, 1, 2, 3, 0);
%generate(5, 3, 9, 3, 0, 1, 2, 3, 12765);
%*generate(10000, 7, 5, 3, 0, 1, 2, 3, 0);
%*generate(10000, 5, 5, 5, 0, 1, 1, 3, 0);
%*generate(10000, 7, 5, 3, 0, 1, 1, 3, 0);
The updated version is below modified from code in Wicklin's Simulating Data using SAS.
options linesize=80 pagesize=60 formchar="|----|+|---+=|-/\<>*";
title1 'Illustration of Type I Error/Power in CRD ANOVA';
*************************************************************
Settings from Scheffe table 10-4-2
n1 n2 n3 mn s1 s2 s3
5, 5, 5, 0, 1, 2, 3
3, 9, 3, 0, 1, 2, 3
7, 5, 3, 0, 1, 2, 3
5, 5, 5, 0, 1, 1, 3
7, 5, 3, 0, 1, 1, 3
*************************************************************;
%let Samples=5; *** Specify # of Samples ***;
%let GM=0; *** Specify Grand Mean. ***;
%let NRepV=3 9 3; *** Specify Sample Size Vector. ***;
%let ESV=0 0 0; *** Specify Effect Size Vector. ***;
%let SDV=1 2 3; *** Specify SD Vector. ***;
data raw;
call streaminit(12765);
GrandMean = &GM; *** Specify Grand Mean from Model. ***;
array Size{3} _temporary_ (&NRepV); *** Sample sizes ***;
array Effect{3} _temporary_ (&ESV); *** Effects ***;
array Std{3} _temporary_ (&SDV); *** SDs ***;
do Sample=1 to &Samples;
do i = 1 to dim(Effect); *** Number of treatment levels ***;
Treatment = i;
do j = 1 to Size{i}; *** Number of reps per trt level.***;
Y = GrandMean + Effect{i} + rand("Normal", 0, Std{i});
output;
end;
end;
end;
run;
proc print;
title2 'verify Data';
run;
ods graphics off; ods exclude all; ods noresults;
proc glm data=raw outstat=F_Pooled(where=(_type_='SS3'));
by sample;
class Treatment;
model Y = Treatment;
title2 "CRD with Ns:&NRepV ESs:&ESV SDs:&SDV";
run;
ods graphics on; ods exclude none; ods results;
proc print;
title2 'Verify Results';
run;
proc format;
value sig 0 -.05 = 'Reject'
.05<-1 = 'DNR';
run;
proc freq data=F_Pooled;
tables Prob;
format Prob sig.;
title2 'Rejection Rate for Actual F Tests';
title3 "CRD with Ns:&NRepV ESs:&ESV SDs:&SDV";
run;
data goal;
set F_Pooled;
if prob<=.05 then flag=1; *** Identify rejections with alpha=.05 ;
else flag=0;
run;
proc freq;
where _type_='SS3'; *** Count rejections using Type III SS results.;
tables flag;
run;
I apologize if I'm missing something obvious ... I'm just not seeing it. The results of 10K simulated samples differ as well ... the old code comes pretty close to Scheffe's table but the newer code seems to reject at about a 1-3% higher rate ... If I could understand why the data diiffer, I assume it will be more clear why the results differ.
All suggestions are appreciated. Thanks. Curt
In one case you are specifying sigma, in the other std. You need
%generate(5, 3, 9, 3, 0, 1, 4, 9, 12765);
to get the same sequence.
Your assignment (in the macro) is using the timeclock as a seed, i.e. mu=0
Art, CEO, AnalystFinder.com
That is true that the original code used the clock before as the starting seed, but for the uncommented call to the macro I have the fixed seed chosen (I probably should have deleted the extra macro calls - sorry). I believe this is why the two streams are the same at the start as they do both use the same seed. What I don't understand is why they do not remain the same for the entire set of treatments in each sample. - Curt
I haven't closely review both sets of code, but in an uncommeted macro call you are assigning 0 as the seed for mu
%generate(5, 3, 9, 3, 0, 1, 2, 3, 12765);
If you are consistently using the same seed using Rick's method, shouldn't you be using:
%generate(5, 3, 9, 3, 12765, 1, 2, 3, 12765);
Art, CEO, AnalystFinder.com
Beyond that, I do believe there was an update to the random number generators in SAS between 9.2 and 9.3.
Not sure if this may impact you, but you may want to take a look into that.
Also, I would probably ask SAS Tech Support, if you haven't already.
In one case you are specifying sigma, in the other std. You need
%generate(5, 3, 9, 3, 0, 1, 4, 9, 12765);
to get the same sequence.
Thank you for identifying the problem ... I suspected it was something innocuous that I simply wasn't picking up on but I didn't have anyone else here to go over the code for me and after a couple hours decided to post. Usually we talk about specifying the variance when we discuss the normal distribution in class but RAND uses the standard deviation - I knew that was going to catch me some day. Should have caught it when I noticed the first treatment with SD=1 (or Variance=1) yielded the same results but the other treatments did not.
Thank you again, PG. -- Curt
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.