BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tumul
Calcite | Level 5

Hello guys ! I am new to SAS, please help me figure this out.

I have a dataset that looks like this (name : simulation_data) :-

ID   col_A   col_B   col_C
101    2        80        20
102    4       40         22
103    5       20         25
104    3       90         20
105    8       10         22
.
.
.
.
100 data points


Now, On this data I have to do a simulation kind of thing where for each entry in 'col_c' (lets say first entry : 20) I have to select 20 random numbers from col_B and take the sum of those 20 selected enties.

 

So for all the 100 observations in 'col_C', col_C amount of entries have to be selected from 'col_B', sum of all those selected entries have to be taken and the number obtained in each interation has to be appended in a seperate dataset (or in this dataset itself).

 

I have tried the first step of the process (i.e. to take col_C number of elements from the simulation_data at random) :

 

data col_c_data;
    set simulation_data nobs = n;

 

    retain k col_C;

    if rand ("uniform") < k/n then do;
        output;
        k = k-1;
    end;
    n = n-1;


run;

 

This is a standard code to take "k" random entries from a variable with "n" observation
(code obtained from : https://sasnrd.com/sas-random-sampling-without-replacement/)
but col_c_data produced is an empty one and LOG says that "Missing values were generated as a result of performing an operation on missing values".

apart from the method I am familiar with "proc survey select"

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @tumul,

 

The issue with your DATA step is that you replaced the constant initial value for variable k in the RETAIN statement with the variable name col_C. This changes the meaning of the statement. Now, variables k and col_C are retained (which is redundant for the latter because col_C is read by the SET statement and therefore automatically retained) and k does not receive an initial value, hence it's missing and all calculations involving k are bound to fail with the notorious "Missing values were generated ..." note.

 

To avoid this problem, consider this variant of your DATA step:

data samples(drop = k n);
call streaminit(27182818);
i = _n_;
set simulation_data nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data point = p;
  if rand ("uniform") < k/n then do;
    output;
    k = k-1;
  end;
  n = n-1;
end;
run;

It creates the random samples, identified by variable i=1, 2, ... 

 

A slight modification of the above DATA step creates the sums s of col_B values by i directly:

data sums(keep = i s);
call streaminit(27182818);
i = _n_;
set simulation_data(keep = col_C) nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data(keep = col_B) point = p;
  if rand ("uniform") < k/n then do;
    s = sum(s, col_B);
    k = k-1;
  end;
  n = n-1;
end;
run;

(The KEEP= dataset options are just to improve performance.)

 

View solution in original post

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hello @tumul,

 

The issue with your DATA step is that you replaced the constant initial value for variable k in the RETAIN statement with the variable name col_C. This changes the meaning of the statement. Now, variables k and col_C are retained (which is redundant for the latter because col_C is read by the SET statement and therefore automatically retained) and k does not receive an initial value, hence it's missing and all calculations involving k are bound to fail with the notorious "Missing values were generated ..." note.

 

To avoid this problem, consider this variant of your DATA step:

data samples(drop = k n);
call streaminit(27182818);
i = _n_;
set simulation_data nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data point = p;
  if rand ("uniform") < k/n then do;
    output;
    k = k-1;
  end;
  n = n-1;
end;
run;

It creates the random samples, identified by variable i=1, 2, ... 

 

A slight modification of the above DATA step creates the sums s of col_B values by i directly:

data sums(keep = i s);
call streaminit(27182818);
i = _n_;
set simulation_data(keep = col_C) nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data(keep = col_B) point = p;
  if rand ("uniform") < k/n then do;
    s = sum(s, col_B);
    k = k-1;
  end;
  n = n-1;
end;
run;

(The KEEP= dataset options are just to improve performance.)

 

tumul
Calcite | Level 5

It works ! now i understood my mistake...thank you for helping me out 😀