BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pcur
Fluorite | Level 6

My goal has been to take the correlation matrix from an existing (empirical) multivariate dataset and use this to generate a centered and standardized (mean=0, SD=1) simulated dataset. The code I use to do so is copied below.

 

I have been using the correlation matrix of my real data as the input to RANDNORMAL and when I do so my output dataset looks exactly as one would imagine, i.e. means around 0 and SD of 1, with same correlation structure as the original dataset. 

 

However I realize RandNormal was originally intended to accept the covariance matrix, not the correlation matrix, as its input. When I used the covariance matrix as input to randnormal I find some unexpected results - the standard deviation of my simulation now suddenly varies quite a bit, from 0.39-1.09, though my means still hover around 0 and the simulated correlation matrix is as expected. 

 

My question is why does variability in my simulated data seem to increase with the use of the covariance matrix, and how can I account for this? I am concerned that the data generated with the correlation matrix  may yield unexpected linear dependencies. 

 

 

 

Here is the code I use, which I obtained both from this forum and from The Do Loop blog (http://blogs.sas.com/content/iml/😞

 

proc iml;
call randseed(4321);
/* specify population mean and covariance */
use simfin.covmat;          *  <------here I either use the correlation or covariance matrix. The cov matrix is poorly standardized.;
read all var _num_ into Cov[c=varNames]; /* save var names */
close simfin.corrmat;
Mean = j(nrow(Cov),1,0); /* zero vector */

N = 500; /* sample size */
NumSamples = 1; /* number of samples/replicates */

X = RandNormal(N*NumSamples, Mean, Cov);
ID = colvec(repeat(T(1:NumSamples), 1, N)); /* 1,1,1,...,2,2,2,...,3,3,3,... */
Z = ID || X;
varNames = "ID" || varNames; /* comncatenate "ID" to var names */
create MVN from Z[c=varNames];
append from Z;
close MVN;
quit;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

> My question is why does variability in my simulated data seem to increase with the use of the covariance matrix,

> and how can I account for this? I am concerned that the data generated with the correlation matrix  may

> yield unexpected linear dependencies. 

 

 

The variability increases if the variances of the variables are greater than one. (Equivalently, when the diagonal elements of the covariance matrix are greater than one.)  Whether you get the same of different results in your simulation depends on the statistic that you are computing. Some statistics are "scale invariant," which means that the computation involves dividing by the variance of the variables. The sampling distribution for these statistics is the same no matter how you scale the data. Other statistics will change when the scale of the problem changes.. 

 

As an example, the following statements run a linear regression on the weight and height of students. If you measure weight in kilograms, you get different values for many statistics: sum of squares, MSE, and parameter estimates and their standard errors.  However, there are other statsitcs that do not change: R-square, coefficient of variation, F statstics, t statistics, and p-values.

 

data class2;
set sashelp.class;
kilos = 0.45359237*weight;  /* convert from pounds to kilos */
run;

ods graphics off;
proc reg data=class2;
var weight height kilos;
pounds: model weight = height / covb;
kilos:  model kilos = height / covb;
run;

So this fact doesn't really have anything to do with simulation, it's just a result of the fact that some statsitcs are scale-invariant whereas others aren't.

 

Can using a covariance affect the probability of generating data that is nearly linearly dependent? Mathematically, the answer is no, but numerically the answer is yes. This can occur if there are extreme scalings (several orders of magnitude) in certain directions. For example, in the following code the  covariance matrix is scaled so that the random vectors it generates are almost in a 2-D linear subspace.  

 

proc iml;
/* correlation */
R = {1.00 0.25 0.90,
     0.25 1.00 0.50,
     0.90 0.50 1.00 };

/* convert to covariance:
   http://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html
   Specify standard deviations of each variable */
var = {0.01  100  100}; /* variances */
D = diag(var);      
S = D*R*D;              /* covariance matrix */

call randseed(4321);
mean = j(1, ncol(R), 0); 
X = randnormal(5, mean, R);  /* use corr */
Y = randnormal(5, mean, S);  /* use cov */
print X, Y;

The correlation is what we called "well-conditioned." The covariance matrix is "ill-conditioned" which means that it is numerically close to being a rank-2 matrix.

 

How can you account for this? The only way to make this issue vanish is to standarize the data, which is equivalent to using a correlation matrix to generate the data.

View solution in original post

2 REPLIES 2
Rick_SAS
SAS Super FREQ

> My question is why does variability in my simulated data seem to increase with the use of the covariance matrix,

> and how can I account for this? I am concerned that the data generated with the correlation matrix  may

> yield unexpected linear dependencies. 

 

 

The variability increases if the variances of the variables are greater than one. (Equivalently, when the diagonal elements of the covariance matrix are greater than one.)  Whether you get the same of different results in your simulation depends on the statistic that you are computing. Some statistics are "scale invariant," which means that the computation involves dividing by the variance of the variables. The sampling distribution for these statistics is the same no matter how you scale the data. Other statistics will change when the scale of the problem changes.. 

 

As an example, the following statements run a linear regression on the weight and height of students. If you measure weight in kilograms, you get different values for many statistics: sum of squares, MSE, and parameter estimates and their standard errors.  However, there are other statsitcs that do not change: R-square, coefficient of variation, F statstics, t statistics, and p-values.

 

data class2;
set sashelp.class;
kilos = 0.45359237*weight;  /* convert from pounds to kilos */
run;

ods graphics off;
proc reg data=class2;
var weight height kilos;
pounds: model weight = height / covb;
kilos:  model kilos = height / covb;
run;

So this fact doesn't really have anything to do with simulation, it's just a result of the fact that some statsitcs are scale-invariant whereas others aren't.

 

Can using a covariance affect the probability of generating data that is nearly linearly dependent? Mathematically, the answer is no, but numerically the answer is yes. This can occur if there are extreme scalings (several orders of magnitude) in certain directions. For example, in the following code the  covariance matrix is scaled so that the random vectors it generates are almost in a 2-D linear subspace.  

 

proc iml;
/* correlation */
R = {1.00 0.25 0.90,
     0.25 1.00 0.50,
     0.90 0.50 1.00 };

/* convert to covariance:
   http://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html
   Specify standard deviations of each variable */
var = {0.01  100  100}; /* variances */
D = diag(var);      
S = D*R*D;              /* covariance matrix */

call randseed(4321);
mean = j(1, ncol(R), 0); 
X = randnormal(5, mean, R);  /* use corr */
Y = randnormal(5, mean, S);  /* use cov */
print X, Y;

The correlation is what we called "well-conditioned." The covariance matrix is "ill-conditioned" which means that it is numerically close to being a rank-2 matrix.

 

How can you account for this? The only way to make this issue vanish is to standarize the data, which is equivalent to using a correlation matrix to generate the data.

pcur
Fluorite | Level 6

Thanks, I appreciate your helpful comments. 

 

I'm not sure that this is a scaling issue, per se, as the dissimilar properties of the simulated datasets are reproduced if I standardize my dataset before calculating the correlation and covariance  matrices from which I simulate.

 

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 2 replies
  • 1967 views
  • 0 likes
  • 2 in conversation