Solved: Simulate bivariate normal data for multiple Sample size for two groups

SWEETSAS · Posted 05-12-2020 07:28 AM

Hello,

I am trying to simulate bivariate normal data for two groups with different covariance matrix. I need to generate this data for different sample sizes, e.g., sample=20, 40 etc. and 2000 replicates. I want the ratio of sample size in group 1: group 2 to be 1:2.

Below is the code I have for one group. For some reasons, the CREATE statement is creating two separate MVN datasets, and the last datasets overrides the initial one.

Any help will be appreciated.

proc iml;
Numsamples=10;
/*specify population mean and covariance:grp1*/
mean1={6.0 6.0};
Cov1={0.5280563 0.502445,

0.502445 0.5280563
};

/*specify population mean and covariance:grp2*/
mean2={6.2499 5.7399};
Cov2={0.6280563 0.200978,

0.200978 0.401956
};
call randseed(132);
do N=5 to 10 by 5;
X=RandNormal(N*Numsamples,Mean2,Cov2);
ID=colvec(repeat(T(1:Numsamples),1,N));
Z=ID||X;

create MVN from Z[c={"ID" "y0" "y1" }];
append from Z;
*end;
close MVN;
end;

quit;

Rick_SAS · Posted 05-12-2020 08:42 AM

It looks like you've already read the article "How to generate multiple samples from the multivariate normal distribution in SAS."

For simulation studies, it can be convenient to write each sample to a data set from within a SAS/IML loop.

So put the CREATE statement before the loop and the CLOSE statement after the loop. Inside the loop, you create both samples and then use the APPEND statement.

It's not clear to me how you want a 1:2 ratio when the sample size is not divisible by 3. For example, when N = 20, do you want 6 and 14 as the sample sizes, or do you want 20 and 40. In the following program, I've used the second option.

I suspect you will also need a second ID variable to identify which observations come from the first distribution and which from the second. I called that the GROUP variable:

proc iml;
Numsamples=10;
/*specify population mean and covariance:grp1*/
mean1={6.0 6.0};
Cov1={0.5280563  0.502445,
      0.502445   0.5280563
  };
/*specify population mean and covariance:grp2*/
mean2={6.2499 5.7399};
Cov2={0.6280563  0.200978,
      0.200978   0.401956
  };
call randseed(132);

Z = {. . . .};    /* tell IML Z is numeric */
create MVN from Z[c={"Group" "ID" "y0" "y1" }];

do N=5 to 10 by 5;
   N1 = N;      /* or N1 = floor(N/3); ? */
   X=RandNormal(N1*Numsamples,Mean2,Cov2);
   ID=colvec(repeat(T(1:Numsamples),1,N1));
   Group = j(nrow(ID), 1, 1);  
   Z=Group||ID||X;
   append from Z;

   N2 = 2*N;    /* or N2 = N - N1; ? */
   X=RandNormal(N2*Numsamples,Mean2,Cov2);
   ID=colvec(repeat(T(1:Numsamples),1,N2));
   Group = j(nrow(ID), 1, 2);  
   Z=Group||ID||X;
   append from Z;
end;

close MVN;
quit;

View solution in original post

PaigeMiller · Posted 05-12-2020 07:37 AM

Yes, each time it creates a data set named MVN (the exact same name each time, so it overwrites the previous version of MVN). That's how the program is written.

This article from @Rick_SAS explains how you can overcome this

https://blogs.sas.com/content/iml/2015/02/09/array-of-matrices.html

--
Paige Miller

IanWakeling · Posted 05-12-2020 08:41 AM

As Paige says you are overwritting the same data set in the loop. An alternative would be to build one data set with successive appends as follows:

create MVN var {"N" "ID" "y0" "y1" };

do N=5 to 10 by 5;
 X=RandNormal(N*Numsamples,Mean2,Cov2);
 ID=colvec(repeat(T(1:Numsamples),1,N));
 Z = j(nrow(X),1,N)||ID||X;
 append from Z;
end;
  
close MVN;

I have added the loop variable N to the output data set which you can use in WHERE or BY statements in other SAS PROCs.

SWEETSAS · Posted 05-12-2020 09:10 AM

Great! Many thanks!

Rick_SAS · Posted 05-12-2020 08:42 AM

It looks like you've already read the article "How to generate multiple samples from the multivariate normal distribution in SAS."

For simulation studies, it can be convenient to write each sample to a data set from within a SAS/IML loop.

So put the CREATE statement before the loop and the CLOSE statement after the loop. Inside the loop, you create both samples and then use the APPEND statement.

It's not clear to me how you want a 1:2 ratio when the sample size is not divisible by 3. For example, when N = 20, do you want 6 and 14 as the sample sizes, or do you want 20 and 40. In the following program, I've used the second option.

I suspect you will also need a second ID variable to identify which observations come from the first distribution and which from the second. I called that the GROUP variable:

proc iml;
Numsamples=10;
/*specify population mean and covariance:grp1*/
mean1={6.0 6.0};
Cov1={0.5280563  0.502445,
      0.502445   0.5280563
  };
/*specify population mean and covariance:grp2*/
mean2={6.2499 5.7399};
Cov2={0.6280563  0.200978,
      0.200978   0.401956
  };
call randseed(132);

Z = {. . . .};    /* tell IML Z is numeric */
create MVN from Z[c={"Group" "ID" "y0" "y1" }];

do N=5 to 10 by 5;
   N1 = N;      /* or N1 = floor(N/3); ? */
   X=RandNormal(N1*Numsamples,Mean2,Cov2);
   ID=colvec(repeat(T(1:Numsamples),1,N1));
   Group = j(nrow(ID), 1, 1);  
   Z=Group||ID||X;
   append from Z;

   N2 = 2*N;    /* or N2 = N - N1; ? */
   X=RandNormal(N2*Numsamples,Mean2,Cov2);
   ID=colvec(repeat(T(1:Numsamples),1,N2));
   Group = j(nrow(ID), 1, 2);  
   Z=Group||ID||X;
   append from Z;
end;

close MVN;
quit;

SWEETSAS · Posted 05-12-2020 09:08 AM

Excellent!! Many thanks! Yes, I did read the referenced resource.

Is it possible to add column for sample size (N) to Allow analysis by ID N?

Rick_SAS · Posted 05-12-2020 09:58 AM

Yes, of course. Just add an additional column to Z. You'll want to modify the Z= assignments and the CREATE statement.

Simulate bivariate normal data for multiple Sample size for two groups

Re: Simulate bivariate normal data for multiple Sample size for two groups

Re: Simulate bivariate normal data for multiple Sample size for two groups

Re: Simulate bivariate normal data for multiple Sample size for two groups

Re: Simulate bivariate normal data for multiple Sample size for two groups

Re: Simulate bivariate normal data for multiple Sample size for two groups

Re: Simulate bivariate normal data for multiple Sample size for two groups

Re: Simulate bivariate normal data for multiple Sample size for two groups

SAS Innovate 2025: Call for Content