Statistical programming, matrix languages, and more

Why are these simulated data sets different?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

Why are these simulated data sets different?

My question relates to simulating data in the data step, not IML, but I did not see a more likely community (redirect me if I missed it).

 

I'm using SAS 9.4 TS1M3.

 

I have two SAS programs that I think should generate the same stream of observations for a simple one-way ANOVA with 3 treatments.  One program was written a few years ago and seems to verify results from Scheffe's Design text Table 10.4.2 on inflation of alpha with unequal variances.  I was trying to use an updated simulation program based upon Rick Wicklin's Simulation book as it seems more elegant, however, when I run that version I get somewhat different results that do not match Scheffe as well.  I am including 2 samples worth of data from both runs using the same generator (RAND with Normal) and the same starting seed ... 30 observations total.  What has me completely puzzled is why the data for the first treatment in each sample *IS* identical (see underlined obs), but the 2nd and 3rd treatments differ ... and it is not simply a sequence difference as the numbers are completely different (although curously it seems that the signs of the generated values are the same across the two programs for the last 2 treatments.  

 

What obvious issue am I missing?

 

Below are the first two samples from each program followed by the programs.

 

Original code (RAND – starting seed = 12765)

                    Obs    sample    iter    trt      yield

 

                      1       1        1      A      0.14838

                      2       1        2      A     -1.02313

                      3       1        3      A      0.25564

                      4       1        1      B      1.21598

                      5       1        2      B     -0.59991

                      6       1        3      B     -0.89688

                      7       1        4      B     -0.22455

                      8       1        5      B      1.01282

                      9       1        6      B      0.66041

                     10       1        7      B      1.34232

                     11       1        8      B      2.37081

                     12       1        9      B     -0.28913

                     13       1        1      C     -0.37018

                     14       1        2      C     -0.06950

                     15       1        3      C      2.18236

                     16       2        1      A      0.10620

                     17       2        2      A     -2.09921

                     18       2        3      A      0.15333

                     19       2        1      B      0.82591

                     20       2        2      B      0.04954

                     21       2        3      B      0.07163

                     22       2        4      B     -0.08633

                     23       2        5      B      0.67917

                     24       2        6      B      1.33329

                     25       2        7      B     -0.57104

                     26       2        8      B      0.06797

                     27       2        9      B     -1.10328

                     28       2        1      C     -2.56702

                     29       2        2      C     -1.20457

                     30       2        3      C      2.17559

 

 

After Wicklin using arrays  (RAND – starting seed = 12765)

                 Grand

           Obs     Mean    Sample    i    Treatment    j        Y

             1      0         1      1        1        1     0.14838

             2      0         1      1        1        2    -1.02313

             3      0         1      1        1        3     0.25564

             4      0         1      2        2        1     1.71965

             5      0         1      2        2        2    -0.84840

             6      0         1      2        2        3    -1.26838

             7      0         1      2        2        4    -0.31757

             8      0         1      2        2        5     1.43234

             9      0         1      2        2        6     0.93396

            10      0         1      2        2        7     1.89833

            11      0         1      2        2        8     3.35283

            12      0         1      2        2        9    -0.40890

            13      0         1      3        3        1    -0.64117

            14      0         1      3        3        2    -0.12038

            15      0         1      3        3        3     3.77996

            16      0         2      1        1        1     0.10620

            17      0         2      1        1        2    -2.09921

            18      0         2      1        1        3     0.15333

            19      0         2      2        2        1     1.16801

            20      0         2      2        2        2     0.07007

            21      0         2      2        2        3     0.10130

            22      0         2      2        2        4    -0.12209

            23      0         2      2        2        5     0.96049

            24      0         2      2        2        6     1.88556

            25      0         2      2        2        7    -0.80757

            26      0         2      2        2        8     0.09613

            27      0         2      2        2        9    -1.56027

            28      0         2      3        3        1    -4.44621

            29      0         2      3        3        2    -2.08639

            30      0         2      3        3        3     3.76822

My original code is below ... i left the GLM etc. but that can be ignored - I'm just trying to understand why the data are different.

options linesize=80 pagesize=60 formchar="|----|+|---+=|-/\<>*";

title1 'Illustration of Statistical Power in CRD ANOVA -- Uses RAND';

***********************************************************************
*** The generate macro accepts 7 parameters as follows:             ***
***    samples = # of samples to generate                           ***
***    n1,n2,n3   = sample size for each treatment                  ***
***    mu  = population mean                                        ***
***    sigma1, sigma2, sigma3 = population variances                ***
***    seed0 = starting seed for simulation                         ***
*** The RAND function generates a random value from the specified   ***
*** distribution (here normal with mean mu and sd sigma).           ***
***********************************************************************;

%macro generate(samples, n1, n2, n3, mu, sigma1, sigma2, sigma3, seed0);
  
  data mc (keep=sample iter trt yield);

    call streaminit(&seed0);  *** Initialize with desired seed. ***;
    sigma1=sqrt(&sigma1);
    sigma2=sqrt(&sigma2);
    sigma3=sqrt(&sigma3);
    
    do sample=1 to &samples;
      do iter=1 to &n1;
        trt='A'; yield=rand('Normal',(&mu),sigma1); output;
      end;
      do iter=1 to &n2;
        trt='B'; yield=rand('Normal',(&mu),sigma2); output;
      end;
      do iter=1 to &n3;
        trt='C'; yield=rand('Normal',(&mu),sigma3); output;
      end;
    end;
  run;
  
proc print data=mc;
  run;
  
proc sort;
  by sample trt;
  run;

*proc univariate plot normal;
*  var yield;
*  title2 'Check Distribution of Simulated Data';
*  run;

proc glm noprint outstat=goal;
  by sample;
  class trt;
  model yield = trt;
  means trt / tukey hovtest;
  title2 "Completely Randomized Design (CRD) for Simulated Yields";
  run;

data goal;
  set goal;
  if prob<=.05 then flag=1;  *** Identify rejections with alpha=.05 ;
     else flag=0;
  run;
proc print;
  title2 'Verify Results';
  run;
  
proc freq;
  where _type_='SS3';   *** Count rejections using Type III SS results.;
  tables flag;
  run;

%mend generate;

%*generate(10000, 5, 5, 5, 0, 1, 2, 3, 0);
%generate(5, 3, 9, 3, 0, 1, 2, 3, 12765);
%*generate(10000, 7, 5, 3, 0, 1, 2, 3, 0);

%*generate(10000, 5, 5, 5, 0, 1, 1, 3, 0);
%*generate(10000, 7, 5, 3, 0, 1, 1, 3, 0);

The updated version is below modified from code in Wicklin's Simulating Data using SAS.

options linesize=80 pagesize=60 formchar="|----|+|---+=|-/\<>*";

title1 'Illustration of Type I Error/Power in CRD ANOVA';

*************************************************************
Settings from Scheffe table 10-4-2 
n1 n2 n3 mn s1 s2 s3
 5, 5, 5, 0, 1, 2, 3
 3, 9, 3, 0, 1, 2, 3
 7, 5, 3, 0, 1, 2, 3

 5, 5, 5, 0, 1, 1, 3
 7, 5, 3, 0, 1, 1, 3
*************************************************************;

%let Samples=5;              *** Specify # of Samples ***;
%let GM=0;                       *** Specify Grand Mean. ***;
%let NRepV=3 9 3;   *** Specify Sample Size Vector. ***;
%let ESV=0 0 0;     *** Specify Effect Size Vector. ***;
%let SDV=1 2 3;              *** Specify SD Vector. ***;

data raw;
  call streaminit(12765);
  GrandMean = &GM;     *** Specify Grand Mean from Model. ***;
  
  array Size{3}   _temporary_ (&NRepV);  *** Sample sizes ***;
  array Effect{3} _temporary_ (&ESV);         *** Effects ***;
  array Std{3}    _temporary_ (&SDV);         *** SDs     ***;
  
  do Sample=1 to &Samples;
   do i = 1 to dim(Effect); *** Number of treatment levels ***;
     Treatment = i;
     do j = 1 to Size{i}; *** Number of reps per trt level.***;
        Y = GrandMean + Effect{i} + rand("Normal", 0, Std{i});
        output;
     end;
   end;
  end;
  run;
proc print;
  title2 'verify Data';
  run;
  
ods graphics off;  ods exclude all;  ods noresults;
proc glm data=raw outstat=F_Pooled(where=(_type_='SS3'));
  by sample;
  class Treatment;
  model Y = Treatment;
  title2 "CRD with Ns:&NRepV    ESs:&ESV    SDs:&SDV";
  run;
ods graphics on;  ods exclude none;  ods results;
proc print;
  title2 'Verify Results';
  run;

proc format;
  value sig 0 -.05 = 'Reject' 
            .05<-1 = 'DNR';
  run;

proc freq data=F_Pooled;
  tables Prob;
  format Prob sig.;
  title2 'Rejection Rate for Actual F Tests';
  title3 "CRD with Ns:&NRepV    ESs:&ESV    SDs:&SDV";
 run;
  
data goal;
  set F_Pooled;
  if prob<=.05 then flag=1;  *** Identify rejections with alpha=.05 ;
     else flag=0;
  run;

proc freq;
  where _type_='SS3';   *** Count rejections using Type III SS results.;
  tables flag;
  run;

I apologize if I'm missing something obvious ... I'm just not seeing it.  The results of 10K simulated samples differ as well ... the old code comes pretty close to Scheffe's table but the newer code seems to reject at about a 1-3% higher rate ... If I could understand why the data diiffer, I assume it will be more clear why the results differ.

 

All suggestions are appreciated.  Thanks.  Curt

Attachment
Attachment

Accepted Solutions
Solution
‎03-30-2017 02:48 AM
Respected Advisor
Posts: 4,606

Re: Why are these simulated data sets different?

In one case you are specifying sigma, in the other std. You need

 

%generate(5, 3, 9, 3, 0, 1, 4, 9, 12765);

 

to get the same sequence.

PG

View solution in original post


All Replies
Esteemed Advisor
Posts: 7,285

Re: Why are these simulated data sets different?

Your assignment (in the macro) is using the timeclock as a seed, i.e. mu=0

 

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 5

Re: Why are these simulated data sets different?

That is true that the original code used the clock before as the starting seed, but for the uncommented call to the macro I have the fixed seed chosen (I probably should have deleted the extra macro calls - sorry).  I believe this is why the two streams are the same at the start as they do both use the same seed.  What I don't understand is why they do not remain the same for the entire set of treatments in each sample. - Curt 

Esteemed Advisor
Posts: 7,285

Re: Why are these simulated data sets different?

I haven't closely review both sets of code, but in an uncommeted macro call you are assigning 0 as the seed for mu

 

%generate(5, 3, 9, 3, 0, 1, 2, 3, 12765);

If you are consistently using the same seed using Rick's method, shouldn't you be using:

 

%generate(5, 3, 9, 3, 12765, 1, 2, 3, 12765);

Art, CEO, AnalystFinder.com

 

Grand Advisor
Posts: 17,313

Re: Why are these simulated data sets different?

Beyond that, I do believe there was an update to the random number generators in SAS between 9.2 and 9.3. 

 

Not sure if this may impact you, but you may want to take a look into that.

 

Also, I would probably ask SAS Tech Support, if you haven't already. 

Solution
‎03-30-2017 02:48 AM
Respected Advisor
Posts: 4,606

Re: Why are these simulated data sets different?

In one case you are specifying sigma, in the other std. You need

 

%generate(5, 3, 9, 3, 0, 1, 4, 9, 12765);

 

to get the same sequence.

PG
Occasional Contributor
Posts: 5

Re: Why are these simulated data sets different?

Thank you for identifying the problem ... I suspected it was something innocuous that I simply wasn't picking up on but I didn't have anyone else here to go over the code for me and after a couple hours decided to post.  Usually we talk about specifying the variance when we discuss the normal distribution in class but RAND uses the standard deviation - I knew that was going to catch me some day.  Should have caught it when I noticed the first treatment with SD=1 (or Variance=1) yielded the same results but the other treatments did not.

 

Thank you again, PG. -- Curt

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 271 views
  • 4 likes
  • 4 in conversation