I want to simulate student attrition. I have a bunch of student currently enrolled, assume starting the 2nd semester, 10% of those who are still enrolled drop out each semester. So with 100 students, the number of students dropping out is 10, 9, 8, 7...
I want to create a variable (2,3,4...) indicating which semester the students drop out. This is the dataset I start with
data records;
do ID=1 to 100;
output;
end;
run;
And I want to get results like this below (no randomness at all)
I also need to apply this program to other scenarios where the attrition rate and number of students may differ. This is what I am doing and it works.
%let attr_rate=0.10;
data proj3;
set proj2 nobs=_nobs_; *_nobs_=number of rows;
if _N_<=round(_nobs_*&attr_rate) then last_sem=2;
else if _N_<=round(_nobs_*&attr_rate)+round(_nobs_*(1-&attr_rate)*&attr_rate) then last_sem=3;
else if _N_<=round(_nobs_*&attr_rate)+round(_nobs_*(1-&attr_rate)*&attr_rate)+round(_nobs_*(1-&attr_rate)**2*&attr_rate) then last_sem=4;
run;
But as you can see, the code gets really long as I cover more semesters. And I have up to 13 semesters to cover.
Is there a better and/or smarter way to achieve this?
Thank you very much!
Hello @TC_,
I would use the cumulative number of dropped out students:
%let r=0.1; /* attrition rate */
%let n=100; /* initial number of students */
%let sem1=2; /* first semester considered */
%let semL=13; /* last semester considered */
data want;
ID=1;
do sem=&sem1 to &semL;
do ID=ID to round((1-(1-&r)**(sem-&sem1+1))*&n);
output;
end;
end;
sem=.;
do ID=ID to &n; /* these never drop out */
output;
end;
run;
Very interesting question.
If I understood what you mean.
data records;
do ID=1 to 100;
output;
end;
run;
data temp;
drop=2;n=10; /*initial value*/
do until(n=0);
do i=1 to n;
output;
end;
n+(-1);
drop+1;
end;
drop i;
run;
data want;
merge records temp;
run;
proc freq data=want;
table drop;
run;
Thank you for responding, @Ksharp.
I guess I should have highlighted how I wanted to use the program at the beginning. I need to use the program in other scenarios where the attrition rate might differ. The example of 100 students with 10% dropout was given as an example to make things easier to follow. I might have, say, 57 students with 5% attrition rate; or 165 students with 7% attrition rate. What you provided is more like a way to generate a sequence of numbers with a specific frequency.
OK. You need to change the initial value. Here is an example:
/*165 students with 7% attrition rate*/
data records;
do ID=1 to 165; /*165 students */
output;
end;
run;
data temp;
drop=2;n=12; /* 12=ceil(165*0.07) */
do until(n=0);
do i=1 to n;
output;
end;
n+(-1);
drop+1;
end;
drop i;
run;
data want;
merge records temp;
run;
proc freq data=want;
table drop;
run;
Hello @TC_,
I would use the cumulative number of dropped out students:
%let r=0.1; /* attrition rate */
%let n=100; /* initial number of students */
%let sem1=2; /* first semester considered */
%let semL=13; /* last semester considered */
data want;
ID=1;
do sem=&sem1 to &semL;
do ID=ID to round((1-(1-&r)**(sem-&sem1+1))*&n);
output;
end;
end;
sem=.;
do ID=ID to &n; /* these never drop out */
output;
end;
run;
Thank you very much, @FreelanceReinh. This solution is neat!
A follow-up question. If I want to generate the IDs first in one Data step, and assign sem in another Data step, is there a way to do so with a DO Loop? In other words, if I already have a dataset of students' info and want to assign a dropout semester to them, what's the best way to do so with a DO Loop?
A workaround I have is that I can use your code to create a new dataset, randomly sort the students' info dataset with rng, and merge the two without a BY variable.
Thank you!
@TC_ wrote:
Thank you very much, @FreelanceReinh. This solution is neat!
A follow-up question. If I want to generate the IDs first in one Data step, and assign sem in another Data step, is there a way to do so with a DO Loop? In other words, if I already have a dataset of students' info and want to assign a dropout semester to them, what's the best way to do so with a DO Loop?
A workaround I have is that I can use your code to create a new dataset, randomly sort the students' info dataset with rng, and merge the two without a BY variable.
Thank you!
You're welcome. One possible solution is to adapt the code as follows:
/* Create sample data for demonstration */ data have; id=5; do _n_=1 to 100; id=mod(67*id, 101); other_info=ranuni(1); output; end; run; /* Assign hypothetical dropout semester */ %let r=0.1; /* attrition rate */ %let sem1=2; /* first semester considered */ %let semL=13; /* last semester considered */ data want; if 0 then set have nobs=n; do sem=&sem1 to &semL; do _n_=_n_ to round((1-(1-&r)**(sem-&sem1+1))*n); link rw; end; end; sem=.; do _n_=_n_ to n; /* these never drop out */ link rw; end; rw: set have; output; run;
I interpreted your question a bit differently than @Ksharp and @FreelanceReinh (...which is daring...).
I believe the formula to calculate the survival rate for a student after n semesters with a drop out probability p of 0.1 per semesters would be: S(n) = (1-p)n
I couldn't figure out a way that doesn't require a loop for calculating the drop out semester. Soo.... here I go:
%let n_semesters=13;
%let drop_probability=0.1;
data have;
do semester=1 to &n_semesters;
do i=1 to 10000;
student_id=cats(semester,'_',i);
output;
end;
end;
drop i;
run;
data want;
set have;
if _n_=1 then call streaminit(1230497);
drop_out_flg=0;
do i=1 to semester;
if rand('uniform')<=&drop_probability then
do;
drop_out_flg=1;
drop_out_semester=i;
leave;
end;
end;
run;
%let sv_missing=%sysfunc(getoption(missing,keyword));
options missing=' ';
title 'drop/no drop per semester';
proc tabulate data=want noseps missing;
keylabel n=' ';
class drop_out_flg drop_out_semester;
table drop_out_semester all='Total', drop_out_flg;
run;
title;
options &sv_missing;
/* as a check: calculate survival for a student over all remaining semesters */
%let sv_missing=%sysfunc(getoption(missing,keyword));
options missing=' ';
data survival;
set have;
if _n_=1 then call streaminit(1230497);
drop_out_flg=0;
if rand('uniform')>(1-&drop_probability)**semester then drop_out_flg=1;
run;
title 'survival over all semesters';
proc tabulate data=survival noseps missing;
keylabel n=' ';
class drop_out_flg;
table drop_out_flg;
run;
title;
Thank you, @Patrick. I wanted to use random assignment as well, but my supervisor explicitly said not to do so at this point. Anyway, I like your random assignment. At this point, this is not what I wanted, but it will surely be helpful to keep in mind for my future work.
Thank you!
Should be able to use the NOBS= option of the SET statement to know how many to put into each group.
%let rate=0.10 ;
data want;
retain obs 1 semester 2;
do obs=obs to obs+int(&rate*(nobs-obs));
set records nobs=nobs;
output;
end;
semester+1;
run;
Results:
Thank you, @Tom! This solution also gets the job done and is quite concise. I really wish I could accept two solutions.
Read the entire dataset, using the NOBS= parameter with the drop rate to generate the target drop count for the initial LAST_SEM value. Then as the target for each semester is reached, update the target through the final eligible LAST_SEM, after which LAST_SEM is set to missing.
data have;
do id=1 to 100; output; end;
run;
%let rate=.1;
%let beg_drop=2;
%let end_drop=10;
data want;
retain last_sem &beg_drop _target 0 ;
set have nobs=nrecs;
if _n_=1 then _target=round(&rate * nrecs);
output;
if _n_= _target then do;
last_sem+1;
_target + round(&rate * (nrecs-_n_));
if last_sem>&end_drop then call missing(last_sem,_target);
end;
run;
Of course, this assumes that the nobs of HAVE and the dropout rate will generate at least one dropout in the first eligible time period. If that is a possible issue, then:
data want;
retain last_sem &beg_drop _target 0 ;
set have nobs=nrecs;
if _n_=1 then do;
_target=round(&rate * nrecs);
if _target=0 then call missing(last_sem,_target);
end;
output;
if _n_= _target then do;
last_sem+1;
_target + round(&rate * (nrecs-_n_));
if last_sem>&end_drop then call missing(last_sem,_target);
end;
run;
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.