BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SimRock
Obsidian | Level 7

Hi, I have been trying to simulate some correlated data in SAS Proc IML but I am having some troubles. Any help is appreciated. I saw the post by Rick Wicklin (

 

Covariance Matrix, DF = 529  AGE MALEAGEMALE
0.06891250850.0015515212
0.00155152120.2501159183

Simple StatisticsVariable N Mean Std Dev Sum Minimum MaximumAGEMALE
53011.232080.26251595310.5000012.00000
5300.481130.50012255.0000001.00000

 

SAS Output

Pearson Correlation Coefficients, N = 530Prob > |r| under H0: Rho=0  AGE MALEAGEMALE
1.00000
 
0.01182
0.7861
0.01182
0.7861
1.00000

Here's the code I was trying to write but there some errors

 

proc iml;
	call randseed(1);
	mu_age=11.23208;
  	sigma_age=0.26251;
 	p_male=0.48113;
	N = 1000;
	Mean = {11.23208,	0.48113};
	Cov = 	{0.0689125085	0.0015515212, 
			0.0015515212	0.2501159183};


  Z = RandNormal(N, Mean, Cov);
  

  U = cdf("Normal", Z);  

  Age= quantile("normal", U[,1], mu_age, sigma_age);

  Male= quantile("binomial", U[,2], p_male,1);

  mean_Age=mean(Age);
  std_Age=std(Age);
  mean_Male=mean(Male);
  covZ = cov(Z);
  covZ1 =cov(Age, Male);
  
 print mean_Age std_Age, mean_Male covZ covz1;
quit;

I am planning on getting the book Simulating data with by Rick Wicklin, but I am hoping someone could help me in the meantime

Thank you

 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

In your example, the variables are uncorrelated (p=0.7861), so you can use independent normal and Bernoulli variables.

 

I assume, however, you want to see how to simulate the data if the target correlation is larger, such as 0.3.

See Chapter 9 in Simulating Data with SAS (2013), the section on copulas. Although in the chapter I show how to use copulas for continuous distributions, the idea can apply to continuous-discrete distributions. The tricky part (as I discuss in the book) is that if your TARGET correlation is 0.3, you need to generate normal data from an "intermediate" distribution that has a larger correlation, such as 0.37. Here's some code you can play around with:

 

proc iml;
	call randseed(1);
	mu_age=11.23208;
  	sigma_age=0.26251;
 	p_male=0.48113;
	N = 1000;

/* let's try to simulate data that has correlation {1 0.3, 0.3 1} */
R = {1    0.37,  /* the intermediate correlation matrix must be inflated if you use Pearson corr */
     0.37 1};
Z = randnormal(N, {0 0}, R);

X1 = mu_age + sigma_age*Z[,1];           /* scale to N(mu, sigma) */
U2 = cdf("normal", Z[,2]);               /* ~ U(0,1) */
B2 = quantile("Bernoulli", U2, 0.48113); /* ~ Bern(p=0.48) */
X = X1 || B2;

mean = mean(X);
corr = corr(X);
cov = cov(X);
print mean, corr, cov;

 

 

View solution in original post

8 REPLIES 8
SimRock
Obsidian | Level 7

Thank you, Reeza. But I couldn't find how to simulate the variables from the quantile distribution

Ksharp
Super User

@Rick_SAS wrote a blog about simulating data from its quantiles.

 

Here is my code .Also check Rick's blog.


%let n=10000;
proc iml;
/* quantiles of total cholesterol from NHANES study
http://blogs.sas.com/content/iml/2014/06/18/distribution-from-quantiles.html
 */
Quantile = {0 , .01, .05, .10, .25, .50, .75, .90, .95, .99, 1};
Estimate = {80, 128, 151, 162, 184, 209, 236, 265, 284, 333, 727};
n=nrow(Quantile);
bin_value=j(n-1,1);
bin_width=j(n-1,1);
p=j(n-1,1);
do i=1 to n-1;
 bin_value[i]=Estimate[i];
 bin_width[i]=Estimate[i+1]-Estimate[i];
 p[i]=Quantile[i+1]-Quantile[i];
end;
print Quantile Estimate bin_value bin_width p;
call randseed(123456789);
_bin_value=sample(bin_value,&n,'replace',p);
_bin_width=j(&n,1);
do j=1 to &n;
 _bin_width[j]=bin_width[loc(bin_value=_bin_value[j])];
end;
eps=randfun(&n,'uniform');
value=_bin_value`+ceil(_bin_width#eps);
call histogram(value) label="Simulation" ;
call qntl(SimEst, value, Quantile);
print Quantile[F=percent6.] Estimate SimEst[F=5.1];
quit;

 

Rick_SAS
SAS Super FREQ

In your example, the variables are uncorrelated (p=0.7861), so you can use independent normal and Bernoulli variables.

 

I assume, however, you want to see how to simulate the data if the target correlation is larger, such as 0.3.

See Chapter 9 in Simulating Data with SAS (2013), the section on copulas. Although in the chapter I show how to use copulas for continuous distributions, the idea can apply to continuous-discrete distributions. The tricky part (as I discuss in the book) is that if your TARGET correlation is 0.3, you need to generate normal data from an "intermediate" distribution that has a larger correlation, such as 0.37. Here's some code you can play around with:

 

proc iml;
	call randseed(1);
	mu_age=11.23208;
  	sigma_age=0.26251;
 	p_male=0.48113;
	N = 1000;

/* let's try to simulate data that has correlation {1 0.3, 0.3 1} */
R = {1    0.37,  /* the intermediate correlation matrix must be inflated if you use Pearson corr */
     0.37 1};
Z = randnormal(N, {0 0}, R);

X1 = mu_age + sigma_age*Z[,1];           /* scale to N(mu, sigma) */
U2 = cdf("normal", Z[,2]);               /* ~ U(0,1) */
B2 = quantile("Bernoulli", U2, 0.48113); /* ~ Bern(p=0.48) */
X = X1 || B2;

mean = mean(X);
corr = corr(X);
cov = cov(X);
print mean, corr, cov;

 

 

SimRock
Obsidian | Level 7

Thank you so much, Rick. It worked like a charm.

Two follow-up questions:

 

1) Is there a particular reason why we should use the correlation matrix instead of the covariance matrix for these types of copula simulations especially since the argument for RANDNORMAL (N, Mean, Cov )  asks for a covariance? Any clarification would be great

 

2) In the subsequent simulations after generating the randnormal, could we interchange the Z[,1] and Z[,2] or does SAS know that the first Z[,1] is for Age and Z[,2] for the other variable Male?

 

Thank you so much 

Rick_SAS
SAS Super FREQ

1. No reason, other than my preference. 

2. You can use the first column to simulate age, if you prefer.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 8 replies
  • 3089 views
  • 5 likes
  • 4 in conversation