I want to simulate student attrition. I have a bunch of student currently enrolled, assume starting the 2nd semester, 10% of those who are still enrolled drop out each semester. So with 100 students, the number of students dropping out is 10, 9, 8, 7...
I want to create a variable (2,3,4...) indicating which semester the students drop out. This is the dataset I start with
data records;
do ID=1 to 100;
output;
end;
run;
And I want to get results like this below (no randomness at all)
I also need to apply this program to other scenarios where the attrition rate and number of students may differ. This is what I am doing and it works.
%let attr_rate=0.10;
data proj3;
set proj2 nobs=_nobs_; *_nobs_=number of rows;
if _N_<=round(_nobs_*&attr_rate) then last_sem=2;
else if _N_<=round(_nobs_*&attr_rate)+round(_nobs_*(1-&attr_rate)*&attr_rate) then last_sem=3;
else if _N_<=round(_nobs_*&attr_rate)+round(_nobs_*(1-&attr_rate)*&attr_rate)+round(_nobs_*(1-&attr_rate)**2*&attr_rate) then last_sem=4;
run;
But as you can see, the code gets really long as I cover more semesters. And I have up to 13 semesters to cover.
Is there a better and/or smarter way to achieve this?
Thank you very much!
Hello @TC_,
I would use the cumulative number of dropped out students:
%let r=0.1; /* attrition rate */
%let n=100; /* initial number of students */
%let sem1=2; /* first semester considered */
%let semL=13; /* last semester considered */
data want;
ID=1;
do sem=&sem1 to &semL;
do ID=ID to round((1-(1-&r)**(sem-&sem1+1))*&n);
output;
end;
end;
sem=.;
do ID=ID to &n; /* these never drop out */
output;
end;
run;
Very interesting question.
If I understood what you mean.
data records;
do ID=1 to 100;
output;
end;
run;
data temp;
drop=2;n=10; /*initial value*/
do until(n=0);
do i=1 to n;
output;
end;
n+(-1);
drop+1;
end;
drop i;
run;
data want;
merge records temp;
run;
proc freq data=want;
table drop;
run;
Thank you for responding, @Ksharp.
I guess I should have highlighted how I wanted to use the program at the beginning. I need to use the program in other scenarios where the attrition rate might differ. The example of 100 students with 10% dropout was given as an example to make things easier to follow. I might have, say, 57 students with 5% attrition rate; or 165 students with 7% attrition rate. What you provided is more like a way to generate a sequence of numbers with a specific frequency.
OK. You need to change the initial value. Here is an example:
/*165 students with 7% attrition rate*/
data records;
do ID=1 to 165; /*165 students */
output;
end;
run;
data temp;
drop=2;n=12; /* 12=ceil(165*0.07) */
do until(n=0);
do i=1 to n;
output;
end;
n+(-1);
drop+1;
end;
drop i;
run;
data want;
merge records temp;
run;
proc freq data=want;
table drop;
run;
Hello @TC_,
I would use the cumulative number of dropped out students:
%let r=0.1; /* attrition rate */
%let n=100; /* initial number of students */
%let sem1=2; /* first semester considered */
%let semL=13; /* last semester considered */
data want;
ID=1;
do sem=&sem1 to &semL;
do ID=ID to round((1-(1-&r)**(sem-&sem1+1))*&n);
output;
end;
end;
sem=.;
do ID=ID to &n; /* these never drop out */
output;
end;
run;
Thank you very much, @FreelanceReinh. This solution is neat!
A follow-up question. If I want to generate the IDs first in one Data step, and assign sem in another Data step, is there a way to do so with a DO Loop? In other words, if I already have a dataset of students' info and want to assign a dropout semester to them, what's the best way to do so with a DO Loop?
A workaround I have is that I can use your code to create a new dataset, randomly sort the students' info dataset with rng, and merge the two without a BY variable.
Thank you!
@TC_ wrote:
Thank you very much, @FreelanceReinh. This solution is neat!
A follow-up question. If I want to generate the IDs first in one Data step, and assign sem in another Data step, is there a way to do so with a DO Loop? In other words, if I already have a dataset of students' info and want to assign a dropout semester to them, what's the best way to do so with a DO Loop?
A workaround I have is that I can use your code to create a new dataset, randomly sort the students' info dataset with rng, and merge the two without a BY variable.
Thank you!
You're welcome. One possible solution is to adapt the code as follows:
/* Create sample data for demonstration */ data have; id=5; do _n_=1 to 100; id=mod(67*id, 101); other_info=ranuni(1); output; end; run; /* Assign hypothetical dropout semester */ %let r=0.1; /* attrition rate */ %let sem1=2; /* first semester considered */ %let semL=13; /* last semester considered */ data want; if 0 then set have nobs=n; do sem=&sem1 to &semL; do _n_=_n_ to round((1-(1-&r)**(sem-&sem1+1))*n); link rw; end; end; sem=.; do _n_=_n_ to n; /* these never drop out */ link rw; end; rw: set have; output; run;
I interpreted your question a bit differently than @Ksharp and @FreelanceReinh (...which is daring...).
I believe the formula to calculate the survival rate for a student after n semesters with a drop out probability p of 0.1 per semesters would be: S(n) = (1-p)n
I couldn't figure out a way that doesn't require a loop for calculating the drop out semester. Soo.... here I go:
%let n_semesters=13;
%let drop_probability=0.1;
data have;
do semester=1 to &n_semesters;
do i=1 to 10000;
student_id=cats(semester,'_',i);
output;
end;
end;
drop i;
run;
data want;
set have;
if _n_=1 then call streaminit(1230497);
drop_out_flg=0;
do i=1 to semester;
if rand('uniform')<=&drop_probability then
do;
drop_out_flg=1;
drop_out_semester=i;
leave;
end;
end;
run;
%let sv_missing=%sysfunc(getoption(missing,keyword));
options missing=' ';
title 'drop/no drop per semester';
proc tabulate data=want noseps missing;
keylabel n=' ';
class drop_out_flg drop_out_semester;
table drop_out_semester all='Total', drop_out_flg;
run;
title;
options &sv_missing;
/* as a check: calculate survival for a student over all remaining semesters */
%let sv_missing=%sysfunc(getoption(missing,keyword));
options missing=' ';
data survival;
set have;
if _n_=1 then call streaminit(1230497);
drop_out_flg=0;
if rand('uniform')>(1-&drop_probability)**semester then drop_out_flg=1;
run;
title 'survival over all semesters';
proc tabulate data=survival noseps missing;
keylabel n=' ';
class drop_out_flg;
table drop_out_flg;
run;
title;
Thank you, @Patrick. I wanted to use random assignment as well, but my supervisor explicitly said not to do so at this point. Anyway, I like your random assignment. At this point, this is not what I wanted, but it will surely be helpful to keep in mind for my future work.
Thank you!
Should be able to use the NOBS= option of the SET statement to know how many to put into each group.
%let rate=0.10 ;
data want;
retain obs 1 semester 2;
do obs=obs to obs+int(&rate*(nobs-obs));
set records nobs=nobs;
output;
end;
semester+1;
run;
Results:
Thank you, @Tom! This solution also gets the job done and is quite concise. I really wish I could accept two solutions.
Read the entire dataset, using the NOBS= parameter with the drop rate to generate the target drop count for the initial LAST_SEM value. Then as the target for each semester is reached, update the target through the final eligible LAST_SEM, after which LAST_SEM is set to missing.
data have;
do id=1 to 100; output; end;
run;
%let rate=.1;
%let beg_drop=2;
%let end_drop=10;
data want;
retain last_sem &beg_drop _target 0 ;
set have nobs=nrecs;
if _n_=1 then _target=round(&rate * nrecs);
output;
if _n_= _target then do;
last_sem+1;
_target + round(&rate * (nrecs-_n_));
if last_sem>&end_drop then call missing(last_sem,_target);
end;
run;
Of course, this assumes that the nobs of HAVE and the dropout rate will generate at least one dropout in the first eligible time period. If that is a possible issue, then:
data want;
retain last_sem &beg_drop _target 0 ;
set have nobs=nrecs;
if _n_=1 then do;
_target=round(&rate * nrecs);
if _target=0 then call missing(last_sem,_target);
end;
output;
if _n_= _target then do;
last_sem+1;
_target + round(&rate * (nrecs-_n_));
if last_sem>&end_drop then call missing(last_sem,_target);
end;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.