Hi, I have been trying to simulate some correlated data in SAS Proc IML but I am having some troubles. Any help is appreciated. I saw the post by Rick Wicklin (Rick_SAS). I tried to modify the code provided here: https://communities.sas.com/t5/Statistical-Procedures/Generate-correlated-random-variables-that-foll...
but I have only had little success. Any help would be greatly appreciated. Here's my code below. By I am having some error messages.
I am trying to simulate the variable Age (continuous) and Male (binary) using their means, standard deviation, and covariance as seen in below
SAS Output
0.0689125085 | 0.0015515212 |
0.0015515212 | 0.2501159183 |
SAS Output
530 | 11.23208 | 0.26251 | 5953 | 10.50000 | 12.00000 |
530 | 0.48113 | 0.50012 | 255.00000 | 0 | 1.00000 |
SAS Output
|
| ||||
|
|
Here's the code I was trying to write but there some errors
proc iml;
call randseed(1);
mu_age=11.23208;
sigma_age=0.26251;
p_male=0.48113;
N = 1000;
Mean = {11.23208, 0.48113};
Cov = {0.0689125085 0.0015515212,
0.0015515212 0.2501159183};
Z = RandNormal(N, Mean, Cov);
U = cdf("Normal", Z);
Age= quantile("normal", U[,1], mu_age, sigma_age);
Male= quantile("binomial", U[,2], p_male,1);
mean_Age=mean(Age);
std_Age=std(Age);
mean_Male=mean(Male);
covZ = cov(Z);
covZ1 =cov(Age, Male);
print mean_Age std_Age, mean_Male covZ covz1;
quit;
I am planning on getting the book Simulating data with by Rick Wicklin, but I am hoping someone could help me in the meantime
Thank you
In your example, the variables are uncorrelated (p=0.7861), so you can use independent normal and Bernoulli variables.
I assume, however, you want to see how to simulate the data if the target correlation is larger, such as 0.3.
See Chapter 9 in Simulating Data with SAS (2013), the section on copulas. Although in the chapter I show how to use copulas for continuous distributions, the idea can apply to continuous-discrete distributions. The tricky part (as I discuss in the book) is that if your TARGET correlation is 0.3, you need to generate normal data from an "intermediate" distribution that has a larger correlation, such as 0.37. Here's some code you can play around with:
proc iml;
call randseed(1);
mu_age=11.23208;
sigma_age=0.26251;
p_male=0.48113;
N = 1000;
/* let's try to simulate data that has correlation {1 0.3, 0.3 1} */
R = {1 0.37, /* the intermediate correlation matrix must be inflated if you use Pearson corr */
0.37 1};
Z = randnormal(N, {0 0}, R);
X1 = mu_age + sigma_age*Z[,1]; /* scale to N(mu, sigma) */
U2 = cdf("normal", Z[,2]); /* ~ U(0,1) */
B2 = quantile("Bernoulli", U2, 0.48113); /* ~ Bern(p=0.48) */
X = X1 || B2;
mean = mean(X);
corr = corr(X);
cov = cov(X);
print mean, corr, cov;
Thank you, Reeza. But I couldn't find how to simulate the variables from the quantile distribution
@Rick_SAS wrote a blog about simulating data from its quantiles.
Here is my code .Also check Rick's blog.
%let n=10000;
proc iml;
/* quantiles of total cholesterol from NHANES study
http://blogs.sas.com/content/iml/2014/06/18/distribution-from-quantiles.html
*/
Quantile = {0 , .01, .05, .10, .25, .50, .75, .90, .95, .99, 1};
Estimate = {80, 128, 151, 162, 184, 209, 236, 265, 284, 333, 727};
n=nrow(Quantile);
bin_value=j(n-1,1);
bin_width=j(n-1,1);
p=j(n-1,1);
do i=1 to n-1;
bin_value[i]=Estimate[i];
bin_width[i]=Estimate[i+1]-Estimate[i];
p[i]=Quantile[i+1]-Quantile[i];
end;
print Quantile Estimate bin_value bin_width p;
call randseed(123456789);
_bin_value=sample(bin_value,&n,'replace',p);
_bin_width=j(&n,1);
do j=1 to &n;
_bin_width[j]=bin_width[loc(bin_value=_bin_value[j])];
end;
eps=randfun(&n,'uniform');
value=_bin_value`+ceil(_bin_width#eps);
call histogram(value) label="Simulation" ;
call qntl(SimEst, value, Quantile);
print Quantile[F=percent6.] Estimate SimEst[F=5.1];
quit;
In your example, the variables are uncorrelated (p=0.7861), so you can use independent normal and Bernoulli variables.
I assume, however, you want to see how to simulate the data if the target correlation is larger, such as 0.3.
See Chapter 9 in Simulating Data with SAS (2013), the section on copulas. Although in the chapter I show how to use copulas for continuous distributions, the idea can apply to continuous-discrete distributions. The tricky part (as I discuss in the book) is that if your TARGET correlation is 0.3, you need to generate normal data from an "intermediate" distribution that has a larger correlation, such as 0.37. Here's some code you can play around with:
proc iml;
call randseed(1);
mu_age=11.23208;
sigma_age=0.26251;
p_male=0.48113;
N = 1000;
/* let's try to simulate data that has correlation {1 0.3, 0.3 1} */
R = {1 0.37, /* the intermediate correlation matrix must be inflated if you use Pearson corr */
0.37 1};
Z = randnormal(N, {0 0}, R);
X1 = mu_age + sigma_age*Z[,1]; /* scale to N(mu, sigma) */
U2 = cdf("normal", Z[,2]); /* ~ U(0,1) */
B2 = quantile("Bernoulli", U2, 0.48113); /* ~ Bern(p=0.48) */
X = X1 || B2;
mean = mean(X);
corr = corr(X);
cov = cov(X);
print mean, corr, cov;
Thank you so much, Rick. It worked like a charm.
Two follow-up questions:
1) Is there a particular reason why we should use the correlation matrix instead of the covariance matrix for these types of copula simulations especially since the argument for RANDNORMAL
(N, Mean, Cov ) asks for a covariance? Any clarification would be great
2) In the subsequent simulations after generating the randnormal, could we interchange the Z[,1] and Z[,2] or does SAS know that the first Z[,1] is for Age and Z[,2] for the other variable Male?
Thank you so much
1. No reason, other than my preference.
2. You can use the first column to simulate age, if you prefer.