Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- simulate with given variance-covariance

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 05-27-2019 04:32 AM
(1693 views)

Dear sasusers,

For a genome-wide association application, I would like to simulate data with a given variance-covariance structure. I want to simulate a vector with n observations. The variance is given by:

var(Y) = K*sigma_p + I*sigma_e

K is a similarity matrix of dimension nxn and I is the unity matrix of dimension nxn. There is one fixed effect.

I don't know where to start. Can somebody give me some hints or links?

with kind regards,

Veronique

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

When you are developing code, don't begin by using the full size problem (100 dimensions). Use a smaller example.

In your program:

1. You don't need NearestCorr (Higham's algorithm) because the AR1Corr function returns a positive definite kinship matrix.

2. I think your main problem is that you are assigning snp1 = snp2 = ... = snp5 = GT[,1]. If these are supposed to be covariates in the mixed model, they can't be equal or you have a singular design matrix. I think the ERROR is coming from this mistake. Also, shouldn't they be CLASS variables in PROC MIXED?

3. You are currently generating one random variate from MVN(0, varcov. I assume this should be 100 variates:

eps = RandNormal(100, zero, varCov).

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You don't say what distribution you want to simulate with. If you are talking about a multivariate normal distribution, you can generate random observations from a multivariate normal distribution with specified covariance matrix using PROC SIMNORMAL. https://blogs.sas.com/content/iml/2017/09/25/simulate-multivariate-normal-data-sas-simnormal.html

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Rick_SAS might give you a hand.

Rick wrote a blog about it before, although you need SAS/IML module .

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

See Section 8.3 of *Simulating Data with SAS*.

You might also want to look at Section 12.3, especially p 232-235, for applications to mixed models. There is an errata for p. 233. The math is correct but the CS and R subscripts are reversed. The term \sigma^2_R on the diagonal is the residual covariance. The off-diagonal terms, which should be labeled as \sigma^2_{CS}, is the common covariance.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Based on chapter 12.3.2 and implementing Higham's algorithm, I simulated the following:

```
/******************
* test simulation *
******************/
* step 1: create 5 random SNPS;
**********************************;
* in practice you retrieve 5 snps that are located nearby and you assemble them in a dataframe;
data Categories(keep=V1--V5);
call streaminit(4321);
array p[3] (0.4 0.2 0.4); /* probabilities */
do i = 1 to 100;
V1 = rand("Table", of p[*]); /* use OF operator */
V2 = rand("Table", of p[*]); /* use OF operator */
V3 = rand("Table", of p[*]); /* use OF operator */
V4 = rand("Table", of p[*]); /* use OF operator */
V5 = rand("Table", of p[*]); /* use OF operator */
output;
end;
run;
proc print data=Categories;
run;
data snps (keep= snp1--snp5);
set Categories;
snp1 = V1-1;
snp2 = V2-1;
snp3 = V3-1;
snp4 = V4-1;
snp5 = V5-1;
keep snp1--snp
run;
proc print data=snps;
run;
proc freq data=snps;
tables snp1 snp2 snp3 snp4 snp5/ nocum nopercent;
run;
* step2: create a kinship matrix, I will use the AR1 structure to get a quick sort of similarity matrix;
*******************************************************************************************;
proc iml;
/* return p x p matrix whose (i,j)th element is rho^|i - j| */
start AR1Corr(rho, n);
return rho##distance(T(1:n), T(1:n), "L1");
finish;
/* test on 100 x 100 matrix with rho = 0.8 */
rho = 0.8;
n = 100;
kinship = AR1Corr(rho, n);
print kinship[format=Best7.];
*export to dataset;
create dataK from kinship;
append from kinship;
close dataK;
quit;
* simulate a trait;
*******************;
* in practice read the kinship from csv file, create matrix from dataset;
proc iml;
* import kinship from data;
use work.dataK; /* open the data set */
read all var _ALL_ into K;
close work.dataK;
print(K);
* implement Higham's algorithm for computing the nearest correlation matrix to a given symmetric matrix;
* Ref: https://blogs.sas.com/content/iml/2012/11/28/computing-the-nearest-correlation-matrix.html ;
/* Project symmetric X onto S={positive semidefinite matrices}.
Replace any negative eigenvalues of X with zero */
start ProjS(X);
call eigen(D, Q, X); /* note X = Q*D*Q` */
V = choose(D>0, D, 0);
W = Q#sqrt(V`); /* form Q*diag(V)*Q` */
return( W*W` ); /* W*W` = Q*diag(V)*Q` */
finish;
/* project square X onto U={matrices with unit diagonal}.
Return X with the diagonal elements replaced by ones. */
start ProjU(X);
n = nrow(X);
Y = X;
diagIdx = do(1, n*n, n+1);
Y[diagIdx] = 1; /* set diagonal elements to 1 */
return ( Y );
finish;
/* Helper function: the infinity norm is the max abs value of the row sums */
start MatInfNorm(A);
return( max(abs(A[,+])) );
finish;
/* Given a symmetric correlation matrix, A,
project A onto the space of positive semidefinite matrices.
The function uses the algorithm of Higham (2002) to return
the matrix X that is closest to A in the Frobenius norm. */
start NearestCorr(A);
maxIter = 100; tol = 1e-8; /* parameters...you can change these */
iter = 1; maxd = 1; /* initial values */
Yold = A; Xold = A; dS = 0;
do while( (iter <= maxIter) & (maxd > tol) );
R = Yold - dS; /* dS is Dykstra's correction */
X = ProjS(R); /* project onto S={positive semidefinite} */
dS = X - R;
Y = ProjU(X); /* project onto U={matrices with unit diagonal} */
/* How much has X changed? (Eqn 4.1) */
dx = MatInfNorm(X-Xold) / MatInfNorm(X);
dy = MatInfNorm(Y-Yold) / MatInfNorm(Y);
dxy = MatInfNorm(Y - X) / MatInfNorm(Y);
maxd = max(dx,dy,dxy);
iter = iter + 1;
Xold = X; Yold = Y; /* update matrices */
end;
return( X ); /* X is positive semidefinite */
finish;
KK = NearestCorr(K);
/* for large matrices, might need to correct for numerical rounding errors */
epsilon = 1e-10;
KKK = ProjU( KK/(1+epsilon) ); /* divide off-diagonal elements by 1+epsilon */
* import snps from data;
use work.snps; /* open the data set */
read all var _ALL_ into GT;
close work.snps;
print(GT);
snp1 = GT[,1];
snp2 = GT[,1];
snp3 = GT[,1];
snp4 = GT[,1];
snp5 = GT[,1];
* set parms;
n=100; /* 100 subjects */
sigma2_g = 0.1; /* Parameter 1: residual covariance */
sigma2_e = 0.5;
beta = {1, 0.5, 0.7, 3, 0.7, 0.5}; /* ratio data, value of 1 for intercept*/
* varcov;
G = sigma2_g*KKK ;
R = sigma2_e*I(n);
varcov = G + R ;
* subjectIDs;
treeID = colvec(1:n);
*print(treeID);
* design matrix;
X = J(n,1,1)||GT ; /*col vector of 1 for intercept*/
* simulate trait;
call randseed(1234);
create Mix var {"treeID" "Trait1" "snp1" "snp2" "snp3" "snp4" "snp5"};
/* create dataset testsim */
* Note: if you are trying to simulate random multivariate normal data,
you must use a positive definite matrix.;
zero = j(1, n, 0); /* the zero vector */
eps = RandNormal(1, zero, varcov); /* eps ~ MVN(0,R) */
trait1 = X*beta + eps`; /* fixed effects, correlated errors */
append;
close;
quit;
* analysis;
**************;
proc sgplot data=Mix;
title "distribution parameters";
histogram trait1;
density trait1;
density trait1/type=kernel;
run;
* you clearly see a biodal distribution;
* analysis;
**********************************;
* the kinship data set must have in
- column 1 : the treeIDs
- column 2 : parm (all 1's)
- column 3 : values from 1 to n ;
data kin ;
retain treeID parm row ;
set datak;
treeID = _N_ ;
parm = 1 ;
row = _N_ ;
run;
Proc mixed data=Mix METHOD=REML noprofile covtest itdetails ;
class treeID ;
model trait1 = snp1 snp2 snp3 snp4 snp5 /solution ddfm=KR chisq;
random treeID / type=lin(1) ldata=work.kin ;
parms/ lowerb= 0 0;
ods output covParms=myParms;
run;
```

I do however not have the expected results. I can't see what I did wrong.

Despite the use of Higham's algorithm, I get the notes:

An infinite likelihood is assumed in iteration 1 because of a nonpositive definite

estimated V matrix for Subject 1.

NOTE: Convergence criteria met.

NOTE: Estimated G matrix is not positive definite.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

When you are developing code, don't begin by using the full size problem (100 dimensions). Use a smaller example.

In your program:

1. You don't need NearestCorr (Higham's algorithm) because the AR1Corr function returns a positive definite kinship matrix.

2. I think your main problem is that you are assigning snp1 = snp2 = ... = snp5 = GT[,1]. If these are supposed to be covariates in the mixed model, they can't be equal or you have a singular design matrix. I think the ERROR is coming from this mistake. Also, shouldn't they be CLASS variables in PROC MIXED?

3. You are currently generating one random variate from MVN(0, varcov. I assume this should be 100 variates:

eps = RandNormal(100, zero, varCov).

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks a lot Rick, this was really helpful, I found my mistakes.

In your program:

1. You don't need NearestCorr (Higham's algorithm) because the AR1Corr function returns a positive definite kinship matrix.

**Here it is not needed, but I might need it when I am using the true similarity matrix**

2. I think your main problem is that you are assigning snp1 = snp2 = ... = snp5 = GT[,1]. If these are supposed to be covariates in the mixed model, they can't be equal or you have a singular design matrix. I think the ERROR is coming from this mistake. Also, shouldn't they be CLASS variables in PROC MIXED?

**I overlooked this, it should be snp2=GT[,2], snp3=GT[,3],...Now, I get the expected results **

**I assume a dosage effect of the alleles, snp is either 0 (no allele effect, snp=1 effect of 1 allele, snp=2, effect of 2 alleles), hence they are not class variables.**

3. You are currently generating one random variate from MVN(0, varcov. I assume this should be 100 variates:eps = RandNormal(100, zero, varCov).

**I think eps = RandNormal(1, zero, R) is ok. Trait has dimension 100x1, X(100x6), beta(6x1), and thus eps(100x1). This is similar to the example 12.3.2.2. in your book on simulation.**

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 25. Read more here about **why** you should contribute and **what is in it** for you!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.