## Dataset creation

I tried to create a dataset of 200 individuals. There are 4 variables in the dataset, y, x1, x2, and x3. The value of y is predicted from x1, x2, and x3. There are also some correlations among x1, x2, and x3. I created the following SAS syntax. However, I got error messages and could not create the dataset. Would you please help me to see what went wrong here? Thank you in advance!

/* Set the number of individuals */
%let num_individuals = 200;

/* Set the correlation matrix */
%let correlation_matrix = 1, 0.5, 0.3,
0.5, 1, 0.2,
0.3, 0.2, 1;

/* Create the dataset */
data my_dataset;
array x x1-x3;
call streaminit(12345); /* Set the seed for random number generation */

/* Generate correlated values for x1, x2, and x3 */

do i = 1 to &num_individuals;
x = rand("Multinormal", 0, &_correlation_matrix); /* Generate correlated values */
x1 = x;
x2 = x;
x3 = x;

/* Calculate the value of y using x1, x2, and x3 */
y = 2 * x1 + 3 * x2 - 4 * x3 + rand("Normal", 0, 0.5); /* Add some random noise to the prediction */

output; /* Output the current observation */
end;

keep y x1 x2 x3; /* Keep only the specified variables */
run;

/* Print the dataset */
proc print data=my_dataset;
run;

I got the following error message:

214 %let num_individuals = 200;
215
216 /* Set the correlation matrix */
217 %let correlation_matrix = 1, 0.5, 0.3,
218 0.5, 1, 0.2,
219 0.3, 0.2, 1;
220
221 /* Create the dataset */
222 data my_dataset;
223 array x x1-x3;
224 call streaminit(12345); /* Set the seed for random number generation */
225
226 /* Generate correlated values for x1, x2, and x3 */
227 do i = 1 to &num_individuals;
228 x = rand("Multinormal", 0, &_correlation_matrix); /* Generate correlated values */
-
22
WARNING: Apparent symbolic reference _CORRELATION_MATRIX not resolved.
ERROR: Illegal reference to the array x.
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, INPUT, PUT.

229 x1 = x;
230 x2 = x;
231 x3 = x;
232
233 /* Calculate the value of y using x1, x2, and x3 */
234 y = 2 * x1 + 3 * x2 - 4 * x3 + rand("Normal", 0, 0.5); /* Add some random noise to the
234! prediction */
235
236 output; /* Output the current observation */
237 end;
238 keep y x1 x2 x3; /* Keep only the specified variables */
239 run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.MY_DATASET may be incomplete. When this step was stopped there were
0 observations and 4 variables.
WARNING: Data set WORK.MY_DATASET was not replaced because this step was stopped.

1 ACCEPTED SOLUTION

Accepted Solutions

## Re: Dataset creation

If you don't have access to IML, you can use this technique (Simulate multivariate normal data in SAS by using PROC SIMNORMAL) described by Rick Wicklin.  It uses a DATA step, plus proc simnormal to generate a multinormal distribution for the independent variables.  In your case it would be something like

``````data havecorr (type='CORR');
input _TYPE_ \$4.  @7 _NAME_ \$4.  @10 x1 x2 x3 ;
datalines;
MEAN       0    0    0
STD        1    1    1
N          200 200 200
CORR   X1  1    0.5  0.3
CORR   X2  0.5  1    0.2
CORR   X3  0.3  0.2  1
run;

proc simnormal data=havecorr outsim=SimMVN
numreal = 200           /* number of realizations = size of sample */
seed = 12345  ;         /* random number seed */
var x1-x3;
run;``````

Then, from dataset SimMVN, you can simulate Y from the generated X values.

Note, per Rick's comment, you can directly generate (using, say, PROC CORR), the HAVECORR dataset from original correlated data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
6 REPLIES 6

## Re: Dataset creation

Hi, I've not debugged the rest of the code but it looks like you've got a typo when calling &correlation_matrix. You need to add an underscore to your let statement

## Re: Dataset creation

Thanks! After I changed the let statement to '%let _correlation_matrix ', I still got the error message:

"ERROR: Illegal reference to the array x."

1 /* Set the number of individuals */
2 %let num_individuals = 200;
3
4
5
6 /* Set the correlation matrix */
7 %let _correlation_matrix = 1, 0.5, 0.3,
8 0.5, 1, 0.2,
9 0.3, 0.2, 1;
10
11 /* Create the dataset */
12 data my_dataset;
13 array x x1-x3;
14 call streaminit(12345); /* Set the seed for random number generation */
15
16 /* Generate correlated values for x1, x2, and x3 */
17
18
19 do i = 1 to &num_individuals;
20 x = rand("Multinormal", 0, &_correlation_matrix); /* Generate correlated values */
ERROR: Illegal reference to the array x.
21 x1 = x;
22 x2 = x;
23 x3 = x;
24
25 /* Calculate the value of y using x1, x2, and x3 */
26 y = 2 * x1 + 3 * x2 - 4 * x3 + rand("Normal", 0, 0.5); /* Add some random noise to the
26 ! prediction */
27
28 output; /* Output the current observation */
29 end;
30
31
32 keep y x1 x2 x3; /* Keep only the specified variables */
33 run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.MY_DATASET may be incomplete. When this step was stopped there were
0 observations and 4 variables.

## Re: Dataset creation

```20 x = rand("Multinormal", 0, &_correlation_matrix); /* Generate correlated values */
ERROR: Illegal reference to the array x.```

I'm not sure where you got this syntax from, but a search of the documentation for SAS does not turn up a random number generator that has the distribution "Multinormal". There is the RANDNORMAL function in PROC IML, if that would be of help to you.

--
Paige Miller

## Re: Dataset creation

I think you're mixing IML and data step code.

Also you have an array labeled X and a variable X which isn't going to work.

## Re: Dataset creation

If you don't have access to IML, you can use this technique (Simulate multivariate normal data in SAS by using PROC SIMNORMAL) described by Rick Wicklin.  It uses a DATA step, plus proc simnormal to generate a multinormal distribution for the independent variables.  In your case it would be something like

``````data havecorr (type='CORR');
input _TYPE_ \$4.  @7 _NAME_ \$4.  @10 x1 x2 x3 ;
datalines;
MEAN       0    0    0
STD        1    1    1
N          200 200 200
CORR   X1  1    0.5  0.3
CORR   X2  0.5  1    0.2
CORR   X3  0.3  0.2  1
run;

proc simnormal data=havecorr outsim=SimMVN
numreal = 200           /* number of realizations = size of sample */
seed = 12345  ;         /* random number seed */
var x1-x3;
run;``````

Then, from dataset SimMVN, you can simulate Y from the generated X values.

Note, per Rick's comment, you can directly generate (using, say, PROC CORR), the HAVECORR dataset from original correlated data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

## Re: Dataset creation

Thank you all for your help. The problem has been solved.

Discussion stats
• 6 replies
• 254 views
• 2 likes
• 5 in conversation