BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tumul
Calcite | Level 5

Hello guys ! I am new to SAS, please help me figure this out.

I have a dataset that looks like this (name : simulation_data) :-

ID   col_A   col_B   col_C
101    2        80        20
102    4       40         22
103    5       20         25
104    3       90         20
105    8       10         22
.
.
.
.
100 data points


Now, On this data I have to do a simulation kind of thing where for each entry in 'col_c' (lets say first entry : 20) I have to select 20 random numbers from col_B and take the sum of those 20 selected enties.

 

So for all the 100 observations in 'col_C', col_C amount of entries have to be selected from 'col_B', sum of all those selected entries have to be taken and the number obtained in each interation has to be appended in a seperate dataset (or in this dataset itself).

 

I have tried the first step of the process (i.e. to take col_C number of elements from the simulation_data at random) :

 

data col_c_data;
    set simulation_data nobs = n;

 

    retain k col_C;

    if rand ("uniform") < k/n then do;
        output;
        k = k-1;
    end;
    n = n-1;


run;

 

This is a standard code to take "k" random entries from a variable with "n" observation
(code obtained from : https://sasnrd.com/sas-random-sampling-without-replacement/)
but col_c_data produced is an empty one and LOG says that "Missing values were generated as a result of performing an operation on missing values".

apart from the method I am familiar with "proc survey select"

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @tumul,

 

The issue with your DATA step is that you replaced the constant initial value for variable k in the RETAIN statement with the variable name col_C. This changes the meaning of the statement. Now, variables k and col_C are retained (which is redundant for the latter because col_C is read by the SET statement and therefore automatically retained) and k does not receive an initial value, hence it's missing and all calculations involving k are bound to fail with the notorious "Missing values were generated ..." note.

 

To avoid this problem, consider this variant of your DATA step:

data samples(drop = k n);
call streaminit(27182818);
i = _n_;
set simulation_data nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data point = p;
  if rand ("uniform") < k/n then do;
    output;
    k = k-1;
  end;
  n = n-1;
end;
run;

It creates the random samples, identified by variable i=1, 2, ... 

 

A slight modification of the above DATA step creates the sums s of col_B values by i directly:

data sums(keep = i s);
call streaminit(27182818);
i = _n_;
set simulation_data(keep = col_C) nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data(keep = col_B) point = p;
  if rand ("uniform") < k/n then do;
    s = sum(s, col_B);
    k = k-1;
  end;
  n = n-1;
end;
run;

(The KEEP= dataset options are just to improve performance.)

 

View solution in original post

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hello @tumul,

 

The issue with your DATA step is that you replaced the constant initial value for variable k in the RETAIN statement with the variable name col_C. This changes the meaning of the statement. Now, variables k and col_C are retained (which is redundant for the latter because col_C is read by the SET statement and therefore automatically retained) and k does not receive an initial value, hence it's missing and all calculations involving k are bound to fail with the notorious "Missing values were generated ..." note.

 

To avoid this problem, consider this variant of your DATA step:

data samples(drop = k n);
call streaminit(27182818);
i = _n_;
set simulation_data nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data point = p;
  if rand ("uniform") < k/n then do;
    output;
    k = k-1;
  end;
  n = n-1;
end;
run;

It creates the random samples, identified by variable i=1, 2, ... 

 

A slight modification of the above DATA step creates the sums s of col_B values by i directly:

data sums(keep = i s);
call streaminit(27182818);
i = _n_;
set simulation_data(keep = col_C) nobs = m;
n = m;
k = col_C;
do p=1 to n;
  set simulation_data(keep = col_B) point = p;
  if rand ("uniform") < k/n then do;
    s = sum(s, col_B);
    k = k-1;
  end;
  n = n-1;
end;
run;

(The KEEP= dataset options are just to improve performance.)

 

tumul
Calcite | Level 5

It works ! now i understood my mistake...thank you for helping me out 😀

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 748 views
  • 0 likes
  • 2 in conversation