Hi, I have been trying to simulate some correlated data in SAS Proc IML but I am having some troubles. Any help is appreciated. I saw the post by Rick Wicklin (Rick_SAS). I tried to modify the code provided here: https://communities.sas.com/t5/Statistical-Procedures/Generate-correlated-random-variables-that-foll...
but I have only had little success. Any help would be greatly appreciated. Here's my code below. By I am having some error messages.
I am trying to simulate the variable Age (continuous) and Male (binary) using their means, standard deviation, and covariance as seen in below
SAS Output
0.0689125085 | 0.0015515212 |
0.0015515212 | 0.2501159183 |
SAS Output
530 | 11.23208 | 0.26251 | 5953 | 10.50000 | 12.00000 |
530 | 0.48113 | 0.50012 | 255.00000 | 0 | 1.00000 |
SAS Output
|
| ||||
|
|
Here's the code I was trying to write but there some errors
proc iml;
call randseed(1);
mu_age=11.23208;
sigma_age=0.26251;
p_male=0.48113;
N = 1000;
Mean = {11.23208, 0.48113};
Cov = {0.0689125085 0.0015515212,
0.0015515212 0.2501159183};
Z = RandNormal(N, Mean, Cov);
U = cdf("Normal", Z);
Age= quantile("normal", U[,1], mu_age, sigma_age);
Male= quantile("binomial", U[,2], p_male,1);
mean_Age=mean(Age);
std_Age=std(Age);
mean_Male=mean(Male);
covZ = cov(Z);
covZ1 =cov(Age, Male);
print mean_Age std_Age, mean_Male covZ covz1;
quit;
I am planning on getting the book Simulating data with by Rick Wicklin, but I am hoping someone could help me in the meantime
Thank you
In your example, the variables are uncorrelated (p=0.7861), so you can use independent normal and Bernoulli variables.
I assume, however, you want to see how to simulate the data if the target correlation is larger, such as 0.3.
See Chapter 9 in Simulating Data with SAS (2013), the section on copulas. Although in the chapter I show how to use copulas for continuous distributions, the idea can apply to continuous-discrete distributions. The tricky part (as I discuss in the book) is that if your TARGET correlation is 0.3, you need to generate normal data from an "intermediate" distribution that has a larger correlation, such as 0.37. Here's some code you can play around with:
proc iml;
call randseed(1);
mu_age=11.23208;
sigma_age=0.26251;
p_male=0.48113;
N = 1000;
/* let's try to simulate data that has correlation {1 0.3, 0.3 1} */
R = {1 0.37, /* the intermediate correlation matrix must be inflated if you use Pearson corr */
0.37 1};
Z = randnormal(N, {0 0}, R);
X1 = mu_age + sigma_age*Z[,1]; /* scale to N(mu, sigma) */
U2 = cdf("normal", Z[,2]); /* ~ U(0,1) */
B2 = quantile("Bernoulli", U2, 0.48113); /* ~ Bern(p=0.48) */
X = X1 || B2;
mean = mean(X);
corr = corr(X);
cov = cov(X);
print mean, corr, cov;
Thank you, Reeza. But I couldn't find how to simulate the variables from the quantile distribution
@Rick_SAS wrote a blog about simulating data from its quantiles.
Here is my code .Also check Rick's blog.
%let n=10000;
proc iml;
/* quantiles of total cholesterol from NHANES study
http://blogs.sas.com/content/iml/2014/06/18/distribution-from-quantiles.html
*/
Quantile = {0 , .01, .05, .10, .25, .50, .75, .90, .95, .99, 1};
Estimate = {80, 128, 151, 162, 184, 209, 236, 265, 284, 333, 727};
n=nrow(Quantile);
bin_value=j(n-1,1);
bin_width=j(n-1,1);
p=j(n-1,1);
do i=1 to n-1;
bin_value[i]=Estimate[i];
bin_width[i]=Estimate[i+1]-Estimate[i];
p[i]=Quantile[i+1]-Quantile[i];
end;
print Quantile Estimate bin_value bin_width p;
call randseed(123456789);
_bin_value=sample(bin_value,&n,'replace',p);
_bin_width=j(&n,1);
do j=1 to &n;
_bin_width[j]=bin_width[loc(bin_value=_bin_value[j])];
end;
eps=randfun(&n,'uniform');
value=_bin_value`+ceil(_bin_width#eps);
call histogram(value) label="Simulation" ;
call qntl(SimEst, value, Quantile);
print Quantile[F=percent6.] Estimate SimEst[F=5.1];
quit;
In your example, the variables are uncorrelated (p=0.7861), so you can use independent normal and Bernoulli variables.
I assume, however, you want to see how to simulate the data if the target correlation is larger, such as 0.3.
See Chapter 9 in Simulating Data with SAS (2013), the section on copulas. Although in the chapter I show how to use copulas for continuous distributions, the idea can apply to continuous-discrete distributions. The tricky part (as I discuss in the book) is that if your TARGET correlation is 0.3, you need to generate normal data from an "intermediate" distribution that has a larger correlation, such as 0.37. Here's some code you can play around with:
proc iml;
call randseed(1);
mu_age=11.23208;
sigma_age=0.26251;
p_male=0.48113;
N = 1000;
/* let's try to simulate data that has correlation {1 0.3, 0.3 1} */
R = {1 0.37, /* the intermediate correlation matrix must be inflated if you use Pearson corr */
0.37 1};
Z = randnormal(N, {0 0}, R);
X1 = mu_age + sigma_age*Z[,1]; /* scale to N(mu, sigma) */
U2 = cdf("normal", Z[,2]); /* ~ U(0,1) */
B2 = quantile("Bernoulli", U2, 0.48113); /* ~ Bern(p=0.48) */
X = X1 || B2;
mean = mean(X);
corr = corr(X);
cov = cov(X);
print mean, corr, cov;
Thank you so much, Rick. It worked like a charm.
Two follow-up questions:
1) Is there a particular reason why we should use the correlation matrix instead of the covariance matrix for these types of copula simulations especially since the argument for RANDNORMAL
(N, Mean, Cov ) asks for a covariance? Any clarification would be great
2) In the subsequent simulations after generating the randnormal, could we interchange the Z[,1] and Z[,2] or does SAS know that the first Z[,1] is for Age and Z[,2] for the other variable Male?
Thank you so much
1. No reason, other than my preference.
2. You can use the first column to simulate age, if you prefer.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.