Solved: Missing values produced in simulation

tumul · Posted 07-08-2021 03:09 AM

Hello guys ! I am new to SAS, please help me figure this out.

I have a dataset that looks like this (name : simulation_data) :-

ID col_A col_B col_C
101 2 80 20
102 4 40 22
103 5 20 25
104 3 90 20
105 8 10 22
.
.
.
.
100 data points

Now, On this data I have to do a simulation kind of thing where for each entry in 'col_c' (lets say first entry : 20) I have to select 20 random numbers from col_B and take the sum of those 20 selected enties.

So for all the 100 observations in 'col_C', col_C amount of entries have to be selected from 'col_B', sum of all those selected entries have to be taken and the number obtained in each interation has to be appended in a seperate dataset (or in this dataset itself).

I have tried the first step of the process (i.e. to take col_C number of elements from the simulation_data at random) :

data col_c_data;
set simulation_data nobs = n;

retain k col_C;

if rand ("uniform") < k/n then do;
output;
k = k-1;
end;
n = n-1;

run;

This is a standard code to take "k" random entries from a variable with "n" observation
(code obtained from : https://sasnrd.com/sas-random-sampling-without-replacement/)
but col_c_data produced is an empty one and LOG says that "Missing values were generated as a result of performing an operation on missing values".

apart from the method I am familiar with "proc survey select"

FreelanceReinh · Posted 07-08-2021 04:55 AM

Hello @tumul,

The issue with your DATA step is that you replaced the constant initial value for variable k in the RETAIN statement with the variable name col_C. This changes the meaning of the statement. Now, variables k and col_C are retained (which is redundant for the latter because col_C is read by the SET statement and therefore automatically retained) and k does not receive an initial value, hence it's missing and all calculations involving k are bound to fail with the notorious "Missing values were generated ..." note.

To avoid this problem, consider this variant of your DATA step:

data samples(drop = k n);
call streaminit(27182818);
i = _n_;
set simulation_data nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data point = p;
  if rand ("uniform") < k/n then do;
    output;
    k = k-1;
  end;
  n = n-1;
end;
run;

It creates the random samples, identified by variable i=1, 2, ...

A slight modification of the above DATA step creates the sums s of col_B values by i directly:

data sums(keep = i s);
call streaminit(27182818);
i = _n_;
set simulation_data(keep = col_C) nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data(keep = col_B) point = p;
  if rand ("uniform") < k/n then do;
    s = sum(s, col_B);
    k = k-1;
  end;
  n = n-1;
end;
run;

(The KEEP= dataset options are just to improve performance.)

View solution in original post

FreelanceReinh · Posted 07-08-2021 04:55 AM

Hello @tumul,

The issue with your DATA step is that you replaced the constant initial value for variable k in the RETAIN statement with the variable name col_C. This changes the meaning of the statement. Now, variables k and col_C are retained (which is redundant for the latter because col_C is read by the SET statement and therefore automatically retained) and k does not receive an initial value, hence it's missing and all calculations involving k are bound to fail with the notorious "Missing values were generated ..." note.

To avoid this problem, consider this variant of your DATA step:

data samples(drop = k n);
call streaminit(27182818);
i = _n_;
set simulation_data nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data point = p;
  if rand ("uniform") < k/n then do;
    output;
    k = k-1;
  end;
  n = n-1;
end;
run;

It creates the random samples, identified by variable i=1, 2, ...

A slight modification of the above DATA step creates the sums s of col_B values by i directly:

data sums(keep = i s);
call streaminit(27182818);
i = _n_;
set simulation_data(keep = col_C) nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data(keep = col_B) point = p;
  if rand ("uniform") < k/n then do;
    s = sum(s, col_B);
    k = k-1;
  end;
  n = n-1;
end;
run;

(The KEEP= dataset options are just to improve performance.)

tumul · Posted 07-08-2021 05:15 AM

It works ! now i understood my mistake...thank you for helping me out 😀

Missing values produced in simulation

Re: Missing values produced in simulation

Re: Missing values produced in simulation

Re: Missing values produced in simulation

Missing values produced in simulation

Re: Missing values produced in simulation

Re: Missing values produced in simulation

Re: Missing values produced in simulation

Ready to join fellow brilliant minds for the SAS Hackathon?