Hello guys ! I am new to SAS, please help me figure this out.
I have a dataset that looks like this (name : simulation_data) :-
ID col_A col_B col_C
101 2 80 20
102 4 40 22
103 5 20 25
104 3 90 20
105 8 10 22
.
.
.
.
100 data points
Now, On this data I have to do a simulation kind of thing where for each entry in 'col_c' (lets say first entry : 20) I have to select 20 random numbers from col_B and take the sum of those 20 selected enties.
So for all the 100 observations in 'col_C', col_C amount of entries have to be selected from 'col_B', sum of all those selected entries have to be taken and the number obtained in each interation has to be appended in a seperate dataset (or in this dataset itself).
I have tried the first step of the process (i.e. to take col_C number of elements from the simulation_data at random) :
data col_c_data;
set simulation_data nobs = n;
retain k col_C;
if rand ("uniform") < k/n then do;
output;
k = k-1;
end;
n = n-1;
run;
This is a standard code to take "k" random entries from a variable with "n" observation
(code obtained from : https://sasnrd.com/sas-random-sampling-without-replacement/)
but col_c_data produced is an empty one and LOG says that "Missing values were generated as a result of performing an operation on missing values".
apart from the method I am familiar with "proc survey select"
Hello @tumul,
The issue with your DATA step is that you replaced the constant initial value for variable k in the RETAIN statement with the variable name col_C. This changes the meaning of the statement. Now, variables k and col_C are retained (which is redundant for the latter because col_C is read by the SET statement and therefore automatically retained) and k does not receive an initial value, hence it's missing and all calculations involving k are bound to fail with the notorious "Missing values were generated ..." note.
To avoid this problem, consider this variant of your DATA step:
data samples(drop = k n);
call streaminit(27182818);
i = _n_;
set simulation_data nobs = m;
n = m;
k = col_C;
do p=1 to n;
set simulation_data point = p;
if rand ("uniform") < k/n then do;
output;
k = k-1;
end;
n = n-1;
end;
run;
It creates the random samples, identified by variable i=1, 2, ...
A slight modification of the above DATA step creates the sums s of col_B values by i directly:
data sums(keep = i s);
call streaminit(27182818);
i = _n_;
set simulation_data(keep = col_C) nobs = m;
n = m;
k = col_C;
do p=1 to n;
set simulation_data(keep = col_B) point = p;
if rand ("uniform") < k/n then do;
s = sum(s, col_B);
k = k-1;
end;
n = n-1;
end;
run;
(The KEEP= dataset options are just to improve performance.)
Hello @tumul,
The issue with your DATA step is that you replaced the constant initial value for variable k in the RETAIN statement with the variable name col_C. This changes the meaning of the statement. Now, variables k and col_C are retained (which is redundant for the latter because col_C is read by the SET statement and therefore automatically retained) and k does not receive an initial value, hence it's missing and all calculations involving k are bound to fail with the notorious "Missing values were generated ..." note.
To avoid this problem, consider this variant of your DATA step:
data samples(drop = k n);
call streaminit(27182818);
i = _n_;
set simulation_data nobs = m;
n = m;
k = col_C;
do p=1 to n;
set simulation_data point = p;
if rand ("uniform") < k/n then do;
output;
k = k-1;
end;
n = n-1;
end;
run;
It creates the random samples, identified by variable i=1, 2, ...
A slight modification of the above DATA step creates the sums s of col_B values by i directly:
data sums(keep = i s);
call streaminit(27182818);
i = _n_;
set simulation_data(keep = col_C) nobs = m;
n = m;
k = col_C;
do p=1 to n;
set simulation_data(keep = col_B) point = p;
if rand ("uniform") < k/n then do;
s = sum(s, col_B);
k = k-1;
end;
n = n-1;
end;
run;
(The KEEP= dataset options are just to improve performance.)
It works ! now i understood my mistake...thank you for helping me out 😀
Nearly 200 sessions are now available on demand in the Innovate Hub.
Watch Now →SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.