Solved: Why are these simulated data sets different?

Chiron77omus · Posted 03-29-2017 05:46 PM

My question relates to simulating data in the data step, not IML, but I did not see a more likely community (redirect me if I missed it).

I'm using SAS 9.4 TS1M3.

I have two SAS programs that I think should generate the same stream of observations for a simple one-way ANOVA with 3 treatments. One program was written a few years ago and seems to verify results from Scheffe's Design text Table 10.4.2 on inflation of alpha with unequal variances. I was trying to use an updated simulation program based upon Rick Wicklin's Simulation book as it seems more elegant, however, when I run that version I get somewhat different results that do not match Scheffe as well. I am including 2 samples worth of data from both runs using the same generator (RAND with Normal) and the same starting seed ... 30 observations total. What has me completely puzzled is why the data for the first treatment in each sample *IS* identical (see underlined obs), but the 2nd and 3rd treatments differ ... and it is not simply a sequence difference as the numbers are completely different (although curously it seems that the signs of the generated values are the same across the two programs for the last 2 treatments.

What obvious issue am I missing?

Below are the first two samples from each program followed by the programs.

Original code (RAND – starting seed = 12765)

Obs sample iter trt yield

1 1 1 A 0.14838

2 1 2 A -1.02313

3 1 3 A 0.25564

4 1 1 B 1.21598

5 1 2 B -0.59991

6 1 3 B -0.89688

7 1 4 B -0.22455

8 1 5 B 1.01282

9 1 6 B 0.66041

10 1 7 B 1.34232

11 1 8 B 2.37081

12 1 9 B -0.28913

13 1 1 C -0.37018

14 1 2 C -0.06950

15 1 3 C 2.18236

16 2 1 A 0.10620

17 2 2 A -2.09921

18 2 3 A 0.15333

19 2 1 B 0.82591

20 2 2 B 0.04954

21 2 3 B 0.07163

22 2 4 B -0.08633

23 2 5 B 0.67917

24 2 6 B 1.33329

25 2 7 B -0.57104

26 2 8 B 0.06797

27 2 9 B -1.10328

28 2 1 C -2.56702

29 2 2 C -1.20457

30 2 3 C 2.17559

After Wicklin using arrays (RAND – starting seed = 12765)

Grand

Obs Mean Sample i Treatment j Y

1 0 1 1 1 1 0.14838

2 0 1 1 1 2 -1.02313

3 0 1 1 1 3 0.25564

4 0 1 2 2 1 1.71965

5 0 1 2 2 2 -0.84840

6 0 1 2 2 3 -1.26838

7 0 1 2 2 4 -0.31757

8 0 1 2 2 5 1.43234

9 0 1 2 2 6 0.93396

10 0 1 2 2 7 1.89833

11 0 1 2 2 8 3.35283

12 0 1 2 2 9 -0.40890

13 0 1 3 3 1 -0.64117

14 0 1 3 3 2 -0.12038

15 0 1 3 3 3 3.77996

16 0 2 1 1 1 0.10620

17 0 2 1 1 2 -2.09921

18 0 2 1 1 3 0.15333

19 0 2 2 2 1 1.16801

20 0 2 2 2 2 0.07007

21 0 2 2 2 3 0.10130

22 0 2 2 2 4 -0.12209

23 0 2 2 2 5 0.96049

24 0 2 2 2 6 1.88556

25 0 2 2 2 7 -0.80757

26 0 2 2 2 8 0.09613

27 0 2 2 2 9 -1.56027

28 0 2 3 3 1 -4.44621

29 0 2 3 3 2 -2.08639

30 0 2 3 3 3 3.76822

My original code is below ... i left the GLM etc. but that can be ignored - I'm just trying to understand why the data are different.

options linesize=80 pagesize=60 formchar="|----|+|---+=|-/\<>*";

title1 'Illustration of Statistical Power in CRD ANOVA -- Uses RAND';

***********************************************************************
*** The generate macro accepts 7 parameters as follows:             ***
***    samples = # of samples to generate                           ***
***    n1,n2,n3   = sample size for each treatment                  ***
***    mu  = population mean                                        ***
***    sigma1, sigma2, sigma3 = population variances                ***
***    seed0 = starting seed for simulation                         ***
*** The RAND function generates a random value from the specified   ***
*** distribution (here normal with mean mu and sd sigma).           ***
***********************************************************************;

%macro generate(samples, n1, n2, n3, mu, sigma1, sigma2, sigma3, seed0);
  
  data mc (keep=sample iter trt yield);

    call streaminit(&seed0);  *** Initialize with desired seed. ***;
    sigma1=sqrt(&sigma1);
    sigma2=sqrt(&sigma2);
    sigma3=sqrt(&sigma3);
    
    do sample=1 to &samples;
      do iter=1 to &n1;
        trt='A'; yield=rand('Normal',(&mu),sigma1); output;
      end;
      do iter=1 to &n2;
        trt='B'; yield=rand('Normal',(&mu),sigma2); output;
      end;
      do iter=1 to &n3;
        trt='C'; yield=rand('Normal',(&mu),sigma3); output;
      end;
    end;
  run;
  
proc print data=mc;
  run;
  
proc sort;
  by sample trt;
  run;

*proc univariate plot normal;
*  var yield;
*  title2 'Check Distribution of Simulated Data';
*  run;

proc glm noprint outstat=goal;
  by sample;
  class trt;
  model yield = trt;
  means trt / tukey hovtest;
  title2 "Completely Randomized Design (CRD) for Simulated Yields";
  run;

data goal;
  set goal;
  if prob<=.05 then flag=1;  *** Identify rejections with alpha=.05 ;
     else flag=0;
  run;
proc print;
  title2 'Verify Results';
  run;
  
proc freq;
  where _type_='SS3';   *** Count rejections using Type III SS results.;
  tables flag;
  run;

%mend generate;

%*generate(10000, 5, 5, 5, 0, 1, 2, 3, 0);
%generate(5, 3, 9, 3, 0, 1, 2, 3, 12765);
%*generate(10000, 7, 5, 3, 0, 1, 2, 3, 0);

%*generate(10000, 5, 5, 5, 0, 1, 1, 3, 0);
%*generate(10000, 7, 5, 3, 0, 1, 1, 3, 0);

The updated version is below modified from code in Wicklin's Simulating Data using SAS.

options linesize=80 pagesize=60 formchar="|----|+|---+=|-/\<>*";

title1 'Illustration of Type I Error/Power in CRD ANOVA';

*************************************************************
Settings from Scheffe table 10-4-2 
n1 n2 n3 mn s1 s2 s3
 5, 5, 5, 0, 1, 2, 3
 3, 9, 3, 0, 1, 2, 3
 7, 5, 3, 0, 1, 2, 3

 5, 5, 5, 0, 1, 1, 3
 7, 5, 3, 0, 1, 1, 3
*************************************************************;

%let Samples=5;              *** Specify # of Samples ***;
%let GM=0;                       *** Specify Grand Mean. ***;
%let NRepV=3 9 3;   *** Specify Sample Size Vector. ***;
%let ESV=0 0 0;     *** Specify Effect Size Vector. ***;
%let SDV=1 2 3;              *** Specify SD Vector. ***;

data raw;
  call streaminit(12765);
  GrandMean = &GM;     *** Specify Grand Mean from Model. ***;
  
  array Size{3}   _temporary_ (&NRepV);  *** Sample sizes ***;
  array Effect{3} _temporary_ (&ESV);         *** Effects ***;
  array Std{3}    _temporary_ (&SDV);         *** SDs     ***;
  
  do Sample=1 to &Samples;
   do i = 1 to dim(Effect); *** Number of treatment levels ***;
     Treatment = i;
     do j = 1 to Size{i}; *** Number of reps per trt level.***;
        Y = GrandMean + Effect{i} + rand("Normal", 0, Std{i});
        output;
     end;
   end;
  end;
  run;
proc print;
  title2 'verify Data';
  run;
  
ods graphics off;  ods exclude all;  ods noresults;
proc glm data=raw outstat=F_Pooled(where=(_type_='SS3'));
  by sample;
  class Treatment;
  model Y = Treatment;
  title2 "CRD with Ns:&NRepV    ESs:&ESV    SDs:&SDV";
  run;
ods graphics on;  ods exclude none;  ods results;
proc print;
  title2 'Verify Results';
  run;

proc format;
  value sig 0 -.05 = 'Reject' 
            .05<-1 = 'DNR';
  run;

proc freq data=F_Pooled;
  tables Prob;
  format Prob sig.;
  title2 'Rejection Rate for Actual F Tests';
  title3 "CRD with Ns:&NRepV    ESs:&ESV    SDs:&SDV";
 run;
  
data goal;
  set F_Pooled;
  if prob<=.05 then flag=1;  *** Identify rejections with alpha=.05 ;
     else flag=0;
  run;

proc freq;
  where _type_='SS3';   *** Count rejections using Type III SS results.;
  tables flag;
  run;

I apologize if I'm missing something obvious ... I'm just not seeing it. The results of 10K simulated samples differ as well ... the old code comes pretty close to Scheffe's table but the newer code seems to reject at about a 1-3% higher rate ... If I could understand why the data diiffer, I assume it will be more clear why the results differ.

All suggestions are appreciated. Thanks. Curt

PGStats · Posted 03-29-2017 06:30 PM

In one case you are specifying sigma, in the other std. You need

%generate(5, 3, 9, 3, 0, 1, 4, 9, 12765);

to get the same sequence.

PG

View solution in original post

art297 · Posted 03-29-2017 05:59 PM

Your assignment (in the macro) is using the timeclock as a seed, i.e. mu=0

Art, CEO, AnalystFinder.com

Chiron77omus · Posted 03-29-2017 06:24 PM

That is true that the original code used the clock before as the starting seed, but for the uncommented call to the macro I have the fixed seed chosen (I probably should have deleted the extra macro calls - sorry). I believe this is why the two streams are the same at the start as they do both use the same seed. What I don't understand is why they do not remain the same for the entire set of treatments in each sample. - Curt

art297 · Posted 03-29-2017 06:30 PM

I haven't closely review both sets of code, but in an uncommeted macro call you are assigning 0 as the seed for mu

%generate(5, 3, 9, 3, 0, 1, 2, 3, 12765);

If you are consistently using the same seed using Rick's method, shouldn't you be using:

%generate(5, 3, 9, 3, 12765, 1, 2, 3, 12765);

Art, CEO, AnalystFinder.com

Reeza · Posted 03-29-2017 06:28 PM

Beyond that, I do believe there was an update to the random number generators in SAS between 9.2 and 9.3.

Not sure if this may impact you, but you may want to take a look into that.

Also, I would probably ask SAS Tech Support, if you haven't already.

PGStats · Posted 03-29-2017 06:30 PM

In one case you are specifying sigma, in the other std. You need

%generate(5, 3, 9, 3, 0, 1, 4, 9, 12765);

to get the same sequence.

PG

Chiron77omus · Posted 03-30-2017 02:54 AM

Thank you for identifying the problem ... I suspected it was something innocuous that I simply wasn't picking up on but I didn't have anyone else here to go over the code for me and after a couple hours decided to post. Usually we talk about specifying the variance when we discuss the normal distribution in class but RAND uses the standard deviation - I knew that was going to catch me some day. Should have caught it when I noticed the first treatment with SD=1 (or Variance=1) yielded the same results but the other treatments did not.

Thank you again, PG. -- Curt

Why are these simulated data sets different?

Re: Why are these simulated data sets different?

Re: Why are these simulated data sets different?

Re: Why are these simulated data sets different?

Re: Why are these simulated data sets different?

Re: Why are these simulated data sets different?

Re: Why are these simulated data sets different?

Re: Why are these simulated data sets different?

Registration is open