## Covariance vs. Correlation matrices for Simulations with RandNormal in PROC IML

My goal has been to take the correlation matrix from an existing (empirical) multivariate dataset and use this to generate a centered and standardized (mean=0, SD=1) simulated dataset. The code I use to do so is copied below.

I have been using the correlation matrix of my real data as the input to RANDNORMAL and when I do so my output dataset looks exactly as one would imagine, i.e. means around 0 and SD of 1, with same correlation structure as the original dataset.

However I realize RandNormal was originally intended to accept the covariance matrix, not the correlation matrix, as its input. When I used the covariance matrix as input to randnormal I find some unexpected results - the standard deviation of my simulation now suddenly varies quite a bit, from 0.39-1.09, though my means still hover around 0 and the simulated correlation matrix is as expected.

My question is why does variability in my simulated data seem to increase with the use of the covariance matrix, and how can I account for this? I am concerned that the data generated with the correlation matrix  may yield unexpected linear dependencies.

Here is the code I use, which I obtained both from this forum and from The Do Loop blog (http://blogs.sas.com/content/iml/😞

proc iml;
call randseed(4321);
/* specify population mean and covariance */
use simfin.covmat;          *  <------here I either use the correlation or covariance matrix. The cov matrix is poorly standardized.;
read all var _num_ into Cov[c=varNames]; /* save var names */
close simfin.corrmat;
Mean = j(nrow(Cov),1,0); /* zero vector */

N = 500; /* sample size */
NumSamples = 1; /* number of samples/replicates */

X = RandNormal(N*NumSamples, Mean, Cov);
ID = colvec(repeat(T(1:NumSamples), 1, N)); /* 1,1,1,...,2,2,2,...,3,3,3,... */
Z = ID || X;
varNames = "ID" || varNames; /* comncatenate "ID" to var names */
create MVN from Z[c=varNames];
append from Z;
close MVN;
quit;

1 ACCEPTED SOLUTION

Accepted Solutions

## Re: Covariance vs. Correlation matrices for Simulations with RandNormal in PROC IML

> My question is why does variability in my simulated data seem to increase with the use of the covariance matrix,

> and how can I account for this? I am concerned that the data generated with the correlation matrix  may

> yield unexpected linear dependencies.

The variability increases if the variances of the variables are greater than one. (Equivalently, when the diagonal elements of the covariance matrix are greater than one.)  Whether you get the same of different results in your simulation depends on the statistic that you are computing. Some statistics are "scale invariant," which means that the computation involves dividing by the variance of the variables. The sampling distribution for these statistics is the same no matter how you scale the data. Other statistics will change when the scale of the problem changes..

As an example, the following statements run a linear regression on the weight and height of students. If you measure weight in kilograms, you get different values for many statistics: sum of squares, MSE, and parameter estimates and their standard errors.  However, there are other statsitcs that do not change: R-square, coefficient of variation, F statstics, t statistics, and p-values.

``````data class2;
set sashelp.class;
kilos = 0.45359237*weight;  /* convert from pounds to kilos */
run;

ods graphics off;
proc reg data=class2;
var weight height kilos;
pounds: model weight = height / covb;
kilos:  model kilos = height / covb;
run;
``````

So this fact doesn't really have anything to do with simulation, it's just a result of the fact that some statsitcs are scale-invariant whereas others aren't.

Can using a covariance affect the probability of generating data that is nearly linearly dependent? Mathematically, the answer is no, but numerically the answer is yes. This can occur if there are extreme scalings (several orders of magnitude) in certain directions. For example, in the following code the  covariance matrix is scaled so that the random vectors it generates are almost in a 2-D linear subspace.

``````proc iml;
/* correlation */
R = {1.00 0.25 0.90,
0.25 1.00 0.50,
0.90 0.50 1.00 };

/* convert to covariance:
http://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html
Specify standard deviations of each variable */
var = {0.01  100  100}; /* variances */
D = diag(var);
S = D*R*D;              /* covariance matrix */

call randseed(4321);
mean = j(1, ncol(R), 0);
X = randnormal(5, mean, R);  /* use corr */
Y = randnormal(5, mean, S);  /* use cov */
print X, Y;``````

The correlation is what we called "well-conditioned." The covariance matrix is "ill-conditioned" which means that it is numerically close to being a rank-2 matrix.

How can you account for this? The only way to make this issue vanish is to standarize the data, which is equivalent to using a correlation matrix to generate the data.

2 REPLIES 2

## Re: Covariance vs. Correlation matrices for Simulations with RandNormal in PROC IML

> My question is why does variability in my simulated data seem to increase with the use of the covariance matrix,

> and how can I account for this? I am concerned that the data generated with the correlation matrix  may

> yield unexpected linear dependencies.

The variability increases if the variances of the variables are greater than one. (Equivalently, when the diagonal elements of the covariance matrix are greater than one.)  Whether you get the same of different results in your simulation depends on the statistic that you are computing. Some statistics are "scale invariant," which means that the computation involves dividing by the variance of the variables. The sampling distribution for these statistics is the same no matter how you scale the data. Other statistics will change when the scale of the problem changes..

As an example, the following statements run a linear regression on the weight and height of students. If you measure weight in kilograms, you get different values for many statistics: sum of squares, MSE, and parameter estimates and their standard errors.  However, there are other statsitcs that do not change: R-square, coefficient of variation, F statstics, t statistics, and p-values.

``````data class2;
set sashelp.class;
kilos = 0.45359237*weight;  /* convert from pounds to kilos */
run;

ods graphics off;
proc reg data=class2;
var weight height kilos;
pounds: model weight = height / covb;
kilos:  model kilos = height / covb;
run;
``````

So this fact doesn't really have anything to do with simulation, it's just a result of the fact that some statsitcs are scale-invariant whereas others aren't.

Can using a covariance affect the probability of generating data that is nearly linearly dependent? Mathematically, the answer is no, but numerically the answer is yes. This can occur if there are extreme scalings (several orders of magnitude) in certain directions. For example, in the following code the  covariance matrix is scaled so that the random vectors it generates are almost in a 2-D linear subspace.

``````proc iml;
/* correlation */
R = {1.00 0.25 0.90,
0.25 1.00 0.50,
0.90 0.50 1.00 };

/* convert to covariance:
http://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html
Specify standard deviations of each variable */
var = {0.01  100  100}; /* variances */
D = diag(var);
S = D*R*D;              /* covariance matrix */

call randseed(4321);
mean = j(1, ncol(R), 0);
X = randnormal(5, mean, R);  /* use corr */
Y = randnormal(5, mean, S);  /* use cov */
print X, Y;``````

The correlation is what we called "well-conditioned." The covariance matrix is "ill-conditioned" which means that it is numerically close to being a rank-2 matrix.

How can you account for this? The only way to make this issue vanish is to standarize the data, which is equivalent to using a correlation matrix to generate the data.

## Re: Covariance vs. Correlation matrices for Simulations with RandNormal in PROC IML

I'm not sure that this is a scaling issue, per se, as the dissimilar properties of the simulated datasets are reproduced if I standardize my dataset before calculating the correlation and covariance  matrices from which I simulate.

From The DO Loop