BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
vstorme
Obsidian | Level 7

Dear sasusers,

 

For a genome-wide association application, I would like to simulate data with a given variance-covariance structure. I want to simulate a vector with n observations. The variance is given by:

var(Y) = K*sigma_p + I*sigma_e

 

K is a similarity matrix of dimension nxn and I is the unity matrix of dimension nxn. There is one fixed effect.

 

I don't know where to start. Can somebody give me some hints or links?

 

with kind regards,

Veronique

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

When you are developing code, don't begin by using the full size problem (100 dimensions). Use a smaller example.

 

In your program:

1. You don't need NearestCorr (Higham's algorithm) because the AR1Corr function returns a positive definite kinship matrix.

2. I think your main problem is that you are assigning snp1 = snp2 = ... = snp5 = GT[,1].  If these are supposed to be covariates in the mixed model, they can't be equal or you have a singular design matrix. I think the ERROR is coming from this mistake. Also, shouldn't they be CLASS variables in PROC MIXED?

3. You are currently generating one random variate from MVN(0, varcov. I assume this should be 100 variates:
eps = RandNormal(100, zero, varCov).

 

View solution in original post

6 REPLIES 6
PaigeMiller
Diamond | Level 26

You don't say what distribution you want to simulate with. If you are talking about a multivariate normal distribution, you can generate random observations from a multivariate normal distribution with specified covariance matrix using PROC SIMNORMAL. https://blogs.sas.com/content/iml/2017/09/25/simulate-multivariate-normal-data-sas-simnormal.html

--
Paige Miller
Ksharp
Super User

@Rick_SAS  might give you a hand.

Rick wrote a blog about it before, although you need SAS/IML module .

Rick_SAS
SAS Super FREQ

See Section 8.3 of Simulating Data with SAS.

 

You might also want to look at Section 12.3, especially p 232-235, for applications to mixed models. There is an errata for p. 233. The math is correct but the CS and R subscripts are reversed. The term \sigma^2_R on the diagonal is the residual covariance. The off-diagonal terms, which should be labeled as \sigma^2_{CS},  is the common covariance.

vstorme
Obsidian | Level 7

Based on chapter 12.3.2 and implementing Higham's algorithm, I simulated the following:

 

/******************
* test simulation *
******************/

* step 1: create 5 random SNPS;
**********************************;
* in practice you retrieve 5 snps that are located nearby and you assemble them in a dataframe;
data Categories(keep=V1--V5);
  call streaminit(4321);
  array p[3] (0.4 0.2 0.4); /* probabilities */
  do i = 1 to 100;
  V1 = rand("Table", of p[*]); /* use OF operator */
	V2 = rand("Table", of p[*]); /* use OF operator */
  V3 = rand("Table", of p[*]); /* use OF operator */
  V4 = rand("Table", of p[*]); /* use OF operator */
  V5 = rand("Table", of p[*]); /* use OF operator */
  output;
  end;
run;
proc print data=Categories;
run;
data snps (keep= snp1--snp5);
  set Categories;
	snp1 = V1-1;
	snp2 = V2-1;
	snp3 = V3-1;
	snp4 = V4-1;
	snp5 = V5-1;
	keep snp1--snp
run;
proc print data=snps;
run;
proc freq data=snps;
tables snp1 snp2 snp3 snp4 snp5/ nocum nopercent;
run;

* step2: create a kinship matrix, I will use the AR1 structure to get a quick sort of similarity matrix;
*******************************************************************************************;
proc iml;
/* return p x p matrix whose (i,j)th element is rho^|i - j| */
  start AR1Corr(rho, n); 
     return rho##distance(T(1:n), T(1:n), "L1");
  finish;
  /* test on 100 x 100 matrix with rho = 0.8 */
  rho = 0.8;  
  n = 100;
  kinship = AR1Corr(rho, n);
  print kinship[format=Best7.];

	*export to dataset;
  create dataK from kinship;
  append from kinship;
  close dataK;
quit;

* simulate a trait;
*******************;
* in practice read the kinship from csv file, create matrix from dataset;
proc iml;
  * import kinship from data;
  use work.dataK;                       /* open the data set */
  read all var _ALL_ into K; 
  close work.dataK;
  print(K); 

	* implement Higham's algorithm for computing the nearest correlation matrix to a given symmetric matrix;
	* Ref: https://blogs.sas.com/content/iml/2012/11/28/computing-the-nearest-correlation-matrix.html ;

	/* Project symmetric X onto S={positive semidefinite matrices}.
   Replace any negative eigenvalues of X with zero */
   start ProjS(X);
     call eigen(D, Q, X); /* note X = Q*D*Q` */
     V = choose(D>0, D, 0);
     W = Q#sqrt(V`);      /* form Q*diag(V)*Q` */
     return( W*W` );      /* W*W` = Q*diag(V)*Q` */
   finish;
 
  /* project square X onto U={matrices with unit diagonal}.
   Return X with the diagonal elements replaced by ones. */
  start ProjU(X);
     n = nrow(X);
     Y = X;
     diagIdx = do(1, n*n, n+1);
     Y[diagIdx] = 1;      /* set diagonal elements to 1 */
     return ( Y );
  finish;
 
  /* Helper function: the infinity norm is the max abs value of the row sums */
  start MatInfNorm(A);
     return( max(abs(A[,+])) );
  finish;
 
  /* Given a symmetric correlation matrix, A, 
   project A onto the space of positive semidefinite matrices.
   The function uses the algorithm of Higham (2002) to return
   the matrix X that is closest to A in the Frobenius norm. */
  start NearestCorr(A);
     maxIter = 100; tol = 1e-8;  /* parameters...you can change these */
     iter = 1;      maxd   = 1;  /* initial values */ 
     Yold = A;  Xold = A;  dS = 0;
 
     do while( (iter <= maxIter) & (maxd > tol) );
       R = Yold - dS; /* dS is Dykstra's correction */
       X = ProjS(R);  /* project onto S={positive semidefinite} */
       dS = X - R;
       Y = ProjU(X);  /* project onto U={matrices with unit diagonal} */
 
       /* How much has X changed? (Eqn 4.1) */
       dx = MatInfNorm(X-Xold) / MatInfNorm(X);
       dy = MatInfNorm(Y-Yold) / MatInfNorm(Y);
       dxy = MatInfNorm(Y - X) / MatInfNorm(Y);
       maxd = max(dx,dy,dxy);
 
       iter = iter + 1;
       Xold = X; Yold = Y; /* update matrices */
     end;
     return( X ); /* X is positive semidefinite */
  finish;

	KK = NearestCorr(K);
	  /* for large matrices, might need to correct for numerical rounding errors */
  epsilon = 1e-10;
  KKK = ProjU( KK/(1+epsilon) );  /* divide off-diagonal elements by 1+epsilon */
  
	* import snps from data;
  use work.snps;                       /* open the data set */
  read all var _ALL_ into GT; 
  close work.snps;
  print(GT);
  
  snp1 = GT[,1]; 
	snp2 = GT[,1]; 
  snp3 = GT[,1]; 
  snp4 = GT[,1]; 
  snp5 = GT[,1]; 

	* set parms;
	n=100;                            /* 100 subjects  */
  sigma2_g = 0.1;                 /* Parameter 1: residual covariance */
  sigma2_e = 0.5;
  beta = {1, 0.5, 0.7, 3, 0.7, 0.5}; /* ratio data, value of 1 for intercept*/

	* varcov;
	G = sigma2_g*KKK ;
	R = sigma2_e*I(n);
	varcov = G + R ;

	* subjectIDs;
	treeID = colvec(1:n);
	*print(treeID);

	* design matrix;
	X = J(n,1,1)||GT ; /*col vector of 1 for intercept*/

	* simulate trait;
	call randseed(1234);
  create Mix var {"treeID" "Trait1" "snp1" "snp2" "snp3" "snp4" "snp5"};     
	          /* create dataset testsim  */

  * Note:  if you are trying to simulate random multivariate normal data, 
	you must use a positive definite matrix.;

  zero = j(1, n, 0);            /* the zero vector                  */
  eps = RandNormal(1, zero, varcov);   /* eps ~ MVN(0,R)                   */
  trait1 = X*beta + eps`;         /* fixed effects, correlated errors */

  append; 
  close;
quit;
	

* analysis;
**************;
proc sgplot data=Mix;
  title "distribution parameters";
  histogram trait1;
  density trait1;
  density trait1/type=kernel;
run;
* you clearly see a biodal distribution;

* analysis;
**********************************;
* the kinship data set must have in
 - column 1 : the treeIDs 
 - column 2 : parm (all 1's)
 - column 3 : values from 1 to n ;
data kin ;
  retain treeID parm row ;
  set datak;
	treeID = _N_ ;
	parm = 1 ;
	row = _N_ ;
run;
Proc mixed data=Mix METHOD=REML noprofile covtest itdetails ;
		class treeID  ;
		model trait1 = snp1 snp2 snp3 snp4 snp5 /solution  ddfm=KR chisq;
		random treeID / type=lin(1) ldata=work.kin ;
		parms/ lowerb= 0 0;
		ods output covParms=myParms;
run;

I do however not have the expected results. I can't see what I did wrong.

Despite the use of Higham's algorithm, I get the notes:

An infinite likelihood is assumed in iteration 1 because of a nonpositive definite
estimated V matrix for Subject 1.
NOTE: Convergence criteria met.
NOTE: Estimated G matrix is not positive definite.

Rick_SAS
SAS Super FREQ

When you are developing code, don't begin by using the full size problem (100 dimensions). Use a smaller example.

 

In your program:

1. You don't need NearestCorr (Higham's algorithm) because the AR1Corr function returns a positive definite kinship matrix.

2. I think your main problem is that you are assigning snp1 = snp2 = ... = snp5 = GT[,1].  If these are supposed to be covariates in the mixed model, they can't be equal or you have a singular design matrix. I think the ERROR is coming from this mistake. Also, shouldn't they be CLASS variables in PROC MIXED?

3. You are currently generating one random variate from MVN(0, varcov. I assume this should be 100 variates:
eps = RandNormal(100, zero, varCov).

 

vstorme
Obsidian | Level 7

Thanks a lot Rick, this was really helpful, I found my mistakes.

 

In your program:

1. You don't need NearestCorr (Higham's algorithm) because the AR1Corr function returns a positive definite kinship matrix.

     Here it is not needed, but I might need it when I am using the true similarity matrix

2. I think your main problem is that you are assigning snp1 = snp2 = ... = snp5 = GT[,1].  If these are supposed to be covariates in the mixed model, they can't be equal or you have a singular design matrix. I think the ERROR is coming from this mistake. Also, shouldn't they be CLASS variables in PROC MIXED?

I overlooked this, it should be snp2=GT[,2], snp3=GT[,3],...Now, I get the expected results Smiley Happy

I assume a dosage effect of the alleles, snp is either 0 (no allele effect, snp=1 effect of 1 allele, snp=2, effect of 2 alleles), hence they are not class variables.

3. You are currently generating one random variate from MVN(0, varcov. I assume this should be 100 variates:eps = RandNormal(100, zero, varCov).

I think eps = RandNormal(1, zero, R) is ok. Trait has dimension 100x1, X(100x6), beta(6x1), and thus eps(100x1). This is similar to the example 12.3.2.2. in your book on simulation.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1694 views
  • 4 likes
  • 4 in conversation