topic Re: simulate with given variance-covariance in Statistical Procedures

simulate with given variance-covariance

vstorme — Mon, 27 May 2019 08:32:35 GMT

Dear sasusers,

For a genome-wide association application, I would like to simulate data with a given variance-covariance structure. I want to simulate a vector with n observations. The variance is given by:

var(Y) = K*sigma_p + I*sigma_e

K is a similarity matrix of dimension nxn and I is the unity matrix of dimension nxn. There is one fixed effect.

I don't know where to start. Can somebody give me some hints or links?

with kind regards,

Veronique

Re: simulate with given variance-covariance

PaigeMiller — Mon, 27 May 2019 10:55:18 GMT

You don't say what distribution you want to simulate with. If you are talking about a multivariate normal distribution, you can generate random observations from a multivariate normal distribution with specified covariance matrix using PROC SIMNORMAL. https://blogs.sas.com/content/iml/2017/09/25/simulate-multivariate-normal-data-sas-simnormal.html

Re: simulate with given variance-covariance

Ksharp — Mon, 27 May 2019 12:34:32 GMT

@Rick_SAS might give you a hand.

Rick wrote a blog about it before, although you need SAS/IML module .

Re: simulate with given variance-covariance

Rick_SAS — Tue, 28 May 2019 13:55:21 GMT

See Section 8.3 of Simulating Data with SAS.

You might also want to look at Section 12.3, especially p 232-235, for applications to mixed models. There is an errata for p. 233. The math is correct but the CS and R subscripts are reversed. The term \sigma^2_R on the diagonal is the residual covariance. The off-diagonal terms, which should be labeled as \sigma^2_{CS}, is the common covariance.

Re: simulate with given variance-covariance

vstorme — Mon, 03 Jun 2019 15:19:12 GMT

Based on chapter 12.3.2 and implementing Higham's algorithm, I simulated the following:

/******************
* test simulation *
******************/

* step 1: create 5 random SNPS;
**********************************;
* in practice you retrieve 5 snps that are located nearby and you assemble them in a dataframe;
data Categories(keep=V1--V5);
  call streaminit(4321);
  array p[3] (0.4 0.2 0.4); /* probabilities */
  do i = 1 to 100;
  V1 = rand("Table", of p[*]); /* use OF operator */
	V2 = rand("Table", of p[*]); /* use OF operator */
  V3 = rand("Table", of p[*]); /* use OF operator */
  V4 = rand("Table", of p[*]); /* use OF operator */
  V5 = rand("Table", of p[*]); /* use OF operator */
  output;
  end;
run;
proc print data=Categories;
run;
data snps (keep= snp1--snp5);
  set Categories;
	snp1 = V1-1;
	snp2 = V2-1;
	snp3 = V3-1;
	snp4 = V4-1;
	snp5 = V5-1;
	keep snp1--snp
run;
proc print data=snps;
run;
proc freq data=snps;
tables snp1 snp2 snp3 snp4 snp5/ nocum nopercent;
run;

* step2: create a kinship matrix, I will use the AR1 structure to get a quick sort of similarity matrix;
*******************************************************************************************;
proc iml;
/* return p x p matrix whose (i,j)th element is rho^|i - j| */
  start AR1Corr(rho, n); 
     return rho##distance(T(1:n), T(1:n), "L1");
  finish;
  /* test on 100 x 100 matrix with rho = 0.8 */
  rho = 0.8;  
  n = 100;
  kinship = AR1Corr(rho, n);
  print kinship[format=Best7.];

	*export to dataset;
  create dataK from kinship;
  append from kinship;
  close dataK;
quit;

* simulate a trait;
*******************;
* in practice read the kinship from csv file, create matrix from dataset;
proc iml;
  * import kinship from data;
  use work.dataK;                       /* open the data set */
  read all var _ALL_ into K; 
  close work.dataK;
  print(K); 

	* implement Higham's algorithm for computing the nearest correlation matrix to a given symmetric matrix;
	* Ref: https://blogs.sas.com/content/iml/2012/11/28/computing-the-nearest-correlation-matrix.html ;

	/* Project symmetric X onto S={positive semidefinite matrices}.
   Replace any negative eigenvalues of X with zero */
   start ProjS(X);
     call eigen(D, Q, X); /* note X = Q*D*Q` */
     V = choose(D>0, D, 0);
     W = Q#sqrt(V`);      /* form Q*diag(V)*Q` */
     return( W*W` );      /* W*W` = Q*diag(V)*Q` */
   finish;
 
  /* project square X onto U={matrices with unit diagonal}.
   Return X with the diagonal elements replaced by ones. */
  start ProjU(X);
     n = nrow(X);
     Y = X;
     diagIdx = do(1, n*n, n+1);
     Y[diagIdx] = 1;      /* set diagonal elements to 1 */
     return ( Y );
  finish;
 
  /* Helper function: the infinity norm is the max abs value of the row sums */
  start MatInfNorm(A);
     return( max(abs(A[,+])) );
  finish;
 
  /* Given a symmetric correlation matrix, A, 
   project A onto the space of positive semidefinite matrices.
   The function uses the algorithm of Higham (2002) to return
   the matrix X that is closest to A in the Frobenius norm. */
  start NearestCorr(A);
     maxIter = 100; tol = 1e-8;  /* parameters...you can change these */
     iter = 1;      maxd   = 1;  /* initial values */ 
     Yold = A;  Xold = A;  dS = 0;
 
     do while( (iter <= maxIter) & (maxd > tol) );
       R = Yold - dS; /* dS is Dykstra's correction */
       X = ProjS(R);  /* project onto S={positive semidefinite} */
       dS = X - R;
       Y = ProjU(X);  /* project onto U={matrices with unit diagonal} */
 
       /* How much has X changed? (Eqn 4.1) */
       dx = MatInfNorm(X-Xold) / MatInfNorm(X);
       dy = MatInfNorm(Y-Yold) / MatInfNorm(Y);
       dxy = MatInfNorm(Y - X) / MatInfNorm(Y);
       maxd = max(dx,dy,dxy);
 
       iter = iter + 1;
       Xold = X; Yold = Y; /* update matrices */
     end;
     return( X ); /* X is positive semidefinite */
  finish;

	KK = NearestCorr(K);
	  /* for large matrices, might need to correct for numerical rounding errors */
  epsilon = 1e-10;
  KKK = ProjU( KK/(1+epsilon) );  /* divide off-diagonal elements by 1+epsilon */
  
	* import snps from data;
  use work.snps;                       /* open the data set */
  read all var _ALL_ into GT; 
  close work.snps;
  print(GT);
  
  snp1 = GT[,1]; 
	snp2 = GT[,1]; 
  snp3 = GT[,1]; 
  snp4 = GT[,1]; 
  snp5 = GT[,1]; 

	* set parms;
	n=100;                            /* 100 subjects  */
  sigma2_g = 0.1;                 /* Parameter 1: residual covariance */
  sigma2_e = 0.5;
  beta = {1, 0.5, 0.7, 3, 0.7, 0.5}; /* ratio data, value of 1 for intercept*/

	* varcov;
	G = sigma2_g*KKK ;
	R = sigma2_e*I(n);
	varcov = G + R ;

	* subjectIDs;
	treeID = colvec(1:n);
	*print(treeID);

	* design matrix;
	X = J(n,1,1)||GT ; /*col vector of 1 for intercept*/

	* simulate trait;
	call randseed(1234);
  create Mix var {"treeID" "Trait1" "snp1" "snp2" "snp3" "snp4" "snp5"};     
	          /* create dataset testsim  */

  * Note:  if you are trying to simulate random multivariate normal data, 
	you must use a positive definite matrix.;

  zero = j(1, n, 0);            /* the zero vector                  */
  eps = RandNormal(1, zero, varcov);   /* eps ~ MVN(0,R)                   */
  trait1 = X*beta + eps`;         /* fixed effects, correlated errors */

  append; 
  close;
quit;
	

* analysis;
**************;
proc sgplot data=Mix;
  title "distribution parameters";
  histogram trait1;
  density trait1;
  density trait1/type=kernel;
run;
* you clearly see a biodal distribution;

* analysis;
**********************************;
* the kinship data set must have in
 - column 1 : the treeIDs 
 - column 2 : parm (all 1's)
 - column 3 : values from 1 to n ;
data kin ;
  retain treeID parm row ;
  set datak;
	treeID = _N_ ;
	parm = 1 ;
	row = _N_ ;
run;
Proc mixed data=Mix METHOD=REML noprofile covtest itdetails ;
		class treeID  ;
		model trait1 = snp1 snp2 snp3 snp4 snp5 /solution  ddfm=KR chisq;
		random treeID / type=lin(1) ldata=work.kin ;
		parms/ lowerb= 0 0;
		ods output covParms=myParms;
run;

I do however not have the expected results. I can't see what I did wrong.

Despite the use of Higham's algorithm, I get the notes:

An infinite likelihood is assumed in iteration 1 because of a nonpositive definite
estimated V matrix for Subject 1.
NOTE: Convergence criteria met.
NOTE: Estimated G matrix is not positive definite.

Re: simulate with given variance-covariance

Rick_SAS — Mon, 03 Jun 2019 15:51:25 GMT

When you are developing code, don't begin by using the full size problem (100 dimensions). Use a smaller example.

In your program:

1. You don't need NearestCorr (Higham's algorithm) because the AR1Corr function returns a positive definite kinship matrix.

2. I think your main problem is that you are assigning snp1 = snp2 = ... = snp5 = GT[,1]. If these are supposed to be covariates in the mixed model, they can't be equal or you have a singular design matrix. I think the ERROR is coming from this mistake. Also, shouldn't they be CLASS variables in PROC MIXED?

3. You are currently generating one random variate from MVN(0, varcov. I assume this should be 100 variates:
eps = RandNormal(100, zero, varCov).

Re: simulate with given variance-covariance

vstorme — Tue, 04 Jun 2019 08:20:27 GMT

Thanks a lot Rick, this was really helpful, I found my mistakes.

In your program:

1. You don't need NearestCorr (Higham's algorithm) because the AR1Corr function returns a positive definite kinship matrix.

Here it is not needed, but I might need it when I am using the true similarity matrix

I overlooked this, it should be snp2=GT[,2], snp3=GT[,3],...Now, I get the expected results

I assume a dosage effect of the alleles, snp is either 0 (no allele effect, snp=1 effect of 1 allele, snp=2, effect of 2 alleles), hence they are not class variables.

3. You are currently generating one random variate from MVN(0, varcov. I assume this should be 100 variates:eps = RandNormal(100, zero, varCov).

I think eps = RandNormal(1, zero, R) is ok. Trait has dimension 100x1, X(100x6), beta(6x1), and thus eps(100x1). This is similar to the example 12.3.2.2. in your book on simulation.