Hello,
I am trying to simulate bivariate normal data for two groups with different covariance matrix. I need to generate this data for different sample sizes, e.g., sample=20, 40 etc. and 2000 replicates. I want the ratio of sample size in group 1: group 2 to be 1:2.
Below is the code I have for one group. For some reasons, the CREATE statement is creating two separate MVN datasets, and the last datasets overrides the initial one.
Any help will be appreciated.
It looks like you've already read the article "How to generate multiple samples from the multivariate normal distribution in SAS."
For simulation studies, it can be convenient to write each sample to a data set from within a SAS/IML loop.
So put the CREATE statement before the loop and the CLOSE statement after the loop. Inside the loop, you create both samples and then use the APPEND statement.
It's not clear to me how you want a 1:2 ratio when the sample size is not divisible by 3. For example, when N = 20, do you want 6 and 14 as the sample sizes, or do you want 20 and 40. In the following program, I've used the second option.
I suspect you will also need a second ID variable to identify which observations come from the first distribution and which from the second. I called that the GROUP variable:
proc iml;
Numsamples=10;
/*specify population mean and covariance:grp1*/
mean1={6.0 6.0};
Cov1={0.5280563 0.502445,
0.502445 0.5280563
};
/*specify population mean and covariance:grp2*/
mean2={6.2499 5.7399};
Cov2={0.6280563 0.200978,
0.200978 0.401956
};
call randseed(132);
Z = {. . . .}; /* tell IML Z is numeric */
create MVN from Z[c={"Group" "ID" "y0" "y1" }];
do N=5 to 10 by 5;
N1 = N; /* or N1 = floor(N/3); ? */
X=RandNormal(N1*Numsamples,Mean2,Cov2);
ID=colvec(repeat(T(1:Numsamples),1,N1));
Group = j(nrow(ID), 1, 1);
Z=Group||ID||X;
append from Z;
N2 = 2*N; /* or N2 = N - N1; ? */
X=RandNormal(N2*Numsamples,Mean2,Cov2);
ID=colvec(repeat(T(1:Numsamples),1,N2));
Group = j(nrow(ID), 1, 2);
Z=Group||ID||X;
append from Z;
end;
close MVN;
quit;
Yes, each time it creates a data set named MVN (the exact same name each time, so it overwrites the previous version of MVN). That's how the program is written.
This article from @Rick_SAS explains how you can overcome this
https://blogs.sas.com/content/iml/2015/02/09/array-of-matrices.html
As Paige says you are overwritting the same data set in the loop. An alternative would be to build one data set with successive appends as follows:
create MVN var {"N" "ID" "y0" "y1" };
do N=5 to 10 by 5;
X=RandNormal(N*Numsamples,Mean2,Cov2);
ID=colvec(repeat(T(1:Numsamples),1,N));
Z = j(nrow(X),1,N)||ID||X;
append from Z;
end;
close MVN;
I have added the loop variable N to the output data set which you can use in WHERE or BY statements in other SAS PROCs.
Great! Many thanks!
It looks like you've already read the article "How to generate multiple samples from the multivariate normal distribution in SAS."
For simulation studies, it can be convenient to write each sample to a data set from within a SAS/IML loop.
So put the CREATE statement before the loop and the CLOSE statement after the loop. Inside the loop, you create both samples and then use the APPEND statement.
It's not clear to me how you want a 1:2 ratio when the sample size is not divisible by 3. For example, when N = 20, do you want 6 and 14 as the sample sizes, or do you want 20 and 40. In the following program, I've used the second option.
I suspect you will also need a second ID variable to identify which observations come from the first distribution and which from the second. I called that the GROUP variable:
proc iml;
Numsamples=10;
/*specify population mean and covariance:grp1*/
mean1={6.0 6.0};
Cov1={0.5280563 0.502445,
0.502445 0.5280563
};
/*specify population mean and covariance:grp2*/
mean2={6.2499 5.7399};
Cov2={0.6280563 0.200978,
0.200978 0.401956
};
call randseed(132);
Z = {. . . .}; /* tell IML Z is numeric */
create MVN from Z[c={"Group" "ID" "y0" "y1" }];
do N=5 to 10 by 5;
N1 = N; /* or N1 = floor(N/3); ? */
X=RandNormal(N1*Numsamples,Mean2,Cov2);
ID=colvec(repeat(T(1:Numsamples),1,N1));
Group = j(nrow(ID), 1, 1);
Z=Group||ID||X;
append from Z;
N2 = 2*N; /* or N2 = N - N1; ? */
X=RandNormal(N2*Numsamples,Mean2,Cov2);
ID=colvec(repeat(T(1:Numsamples),1,N2));
Group = j(nrow(ID), 1, 2);
Z=Group||ID||X;
append from Z;
end;
close MVN;
quit;
Excellent!! Many thanks! Yes, I did read the referenced resource.
Is it possible to add column for sample size (N) to Allow analysis by ID N?
Yes, of course. Just add an additional column to Z. You'll want to modify the Z= assignments and the CREATE statement.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.