Solved: How to assign different values by row number or proportion

TC_ · Posted 10-24-2024 02:28 AM

I want to simulate student attrition. I have a bunch of student currently enrolled, assume starting the 2nd semester, 10% of those who are still enrolled drop out each semester. So with 100 students, the number of students dropping out is 10, 9, 8, 7...

I want to create a variable (2,3,4...) indicating which semester the students drop out. This is the dataset I start with

data records;
	do ID=1 to 100;
	output;
	end;
run;

And I want to get results like this below (no randomness at all)

I also need to apply this program to other scenarios where the attrition rate and number of students may differ. This is what I am doing and it works.

%let attr_rate=0.10;
data proj3;
	set proj2 nobs=_nobs_; *_nobs_=number of rows;

	if _N_<=round(_nobs_*&attr_rate) then last_sem=2;
	else if _N_<=round(_nobs_*&attr_rate)+round(_nobs_*(1-&attr_rate)*&attr_rate) then last_sem=3;
	else if _N_<=round(_nobs_*&attr_rate)+round(_nobs_*(1-&attr_rate)*&attr_rate)+round(_nobs_*(1-&attr_rate)**2*&attr_rate) then last_sem=4;

run;

But as you can see, the code gets really long as I cover more semesters. And I have up to 13 semesters to cover.

Is there a better and/or smarter way to achieve this?

Thank you very much!

FreelanceReinh · Posted 10-24-2024 06:07 AM

Hello @TC_,

I would use the cumulative number of dropped out students:

%let r=0.1;   /* attrition rate */
%let n=100;   /* initial number of students */
%let sem1=2;  /* first semester considered */
%let semL=13; /* last semester considered */

data want;
ID=1;
do sem=&sem1 to &semL;
  do ID=ID to round((1-(1-&r)**(sem-&sem1+1))*&n);
    output;
  end;
end;
sem=.;
do ID=ID to &n; /* these never drop out */
  output;
end;
run;

View solution in original post

Ksharp · Posted 10-24-2024 04:49 AM

Very interesting question.

If I understood what you mean.




data records;
	do ID=1 to 100;
	output;
	end;
run;

data temp;
drop=2;n=10;  /*initial value*/
do until(n=0);
do i=1 to n;
 output;
end;
n+(-1);
drop+1;
end;
drop i;
run;

data want;
 merge records temp;
run;

proc freq data=want;
table drop;
run;

TC_ · Posted 10-24-2024 11:15 AM

Thank you for responding, @Ksharp.

I guess I should have highlighted how I wanted to use the program at the beginning. I need to use the program in other scenarios where the attrition rate might differ. The example of 100 students with 10% dropout was given as an example to make things easier to follow. I might have, say, 57 students with 5% attrition rate; or 165 students with 7% attrition rate. What you provided is more like a way to generate a sequence of numbers with a specific frequency.

Ksharp · Posted 10-24-2024 09:02 PM

OK. You need to change the initial value. Here is an example:

/*165 students with 7% attrition rate*/

data records;
	do ID=1 to 165;  /*165 students */
	output;
	end;
run;

data temp;
drop=2;n=12;  /* 12=ceil(165*0.07) */
do until(n=0);
do i=1 to n;
 output;
end;
n+(-1);
drop+1;
end;
drop i;
run;

data want;
 merge records temp;
run;

proc freq data=want;
table drop;
run;

FreelanceReinh · Posted 10-24-2024 06:07 AM

Hello @TC_,

I would use the cumulative number of dropped out students:

%let r=0.1;   /* attrition rate */
%let n=100;   /* initial number of students */
%let sem1=2;  /* first semester considered */
%let semL=13; /* last semester considered */

data want;
ID=1;
do sem=&sem1 to &semL;
  do ID=ID to round((1-(1-&r)**(sem-&sem1+1))*&n);
    output;
  end;
end;
sem=.;
do ID=ID to &n; /* these never drop out */
  output;
end;
run;

TC_ · Posted 10-24-2024 11:44 AM

Thank you very much, @FreelanceReinh. This solution is neat!

A follow-up question. If I want to generate the IDs first in one Data step, and assign sem in another Data step, is there a way to do so with a DO Loop? In other words, if I already have a dataset of students' info and want to assign a dropout semester to them, what's the best way to do so with a DO Loop?

A workaround I have is that I can use your code to create a new dataset, randomly sort the students' info dataset with rng, and merge the two without a BY variable.

Thank you!

FreelanceReinh · Posted 10-24-2024 12:21 PM

@TC_ wrote:

Thank you very much, @FreelanceReinh. This solution is neat!

A follow-up question. If I want to generate the IDs first in one Data step, and assign sem in another Data step, is there a way to do so with a DO Loop? In other words, if I already have a dataset of students' info and want to assign a dropout semester to them, what's the best way to do so with a DO Loop?

A workaround I have is that I can use your code to create a new dataset, randomly sort the students' info dataset with rng, and merge the two without a BY variable.

Thank you!

You're welcome. One possible solution is to adapt the code as follows:

/* Create sample data for demonstration */

data have;
id=5;
do _n_=1 to 100;
  id=mod(67*id, 101);
  other_info=ranuni(1);
  output;
end;
run;

/* Assign hypothetical dropout semester */

%let r=0.1;   /* attrition rate */
%let sem1=2;  /* first semester considered */
%let semL=13; /* last semester considered */

data want;
if 0 then set have nobs=n;
do sem=&sem1 to &semL;
  do _n_=_n_ to round((1-(1-&r)**(sem-&sem1+1))*n);
    link rw;
  end;
end;
sem=.;
do _n_=_n_ to n; /* these never drop out */
  link rw;
end;
rw:
  set have;
  output;
run;

TC_ · Posted 10-25-2024 02:11 AM

Awesome, thank you!

Patrick · Posted 10-24-2024 08:52 AM

I interpreted your question a bit differently than @Ksharp and @FreelanceReinh (...which is daring...).

I believe the formula to calculate the survival rate for a student after n semesters with a drop out probability p of 0.1 per semesters would be: S(n) = (1-p)ⁿ

I couldn't figure out a way that doesn't require a loop for calculating the drop out semester. Soo.... here I go:

%let n_semesters=13;
%let drop_probability=0.1;
data have;
  do semester=1 to &n_semesters;
    do i=1 to 10000;
      student_id=cats(semester,'_',i);
      output;
    end;
  end;
  drop i;
run;

data want;
  set have;
  if _n_=1 then call streaminit(1230497);

  drop_out_flg=0;
  do i=1 to semester;
    if rand('uniform')<=&drop_probability then
      do;
        drop_out_flg=1;
        drop_out_semester=i;
        leave;
      end;
  end;
run;

%let sv_missing=%sysfunc(getoption(missing,keyword));
options missing=' ';
title 'drop/no drop per semester';
proc tabulate data=want noseps missing;
  keylabel n=' ';
  class drop_out_flg drop_out_semester;
  table drop_out_semester all='Total', drop_out_flg;
run;
title;
options &sv_missing;

/* as a check: calculate survival for a student over all remaining semesters */
%let sv_missing=%sysfunc(getoption(missing,keyword));
options missing=' ';
data survival;
  set have;
  if _n_=1 then call streaminit(1230497);

  drop_out_flg=0;
  if rand('uniform')>(1-&drop_probability)**semester then drop_out_flg=1;
run;

title 'survival over all semesters';
proc tabulate data=survival noseps missing;
  keylabel n=' ';
  class drop_out_flg;
  table drop_out_flg;
run;
title;

TC_ · Posted 10-24-2024 11:54 AM

Thank you, @Patrick. I wanted to use random assignment as well, but my supervisor explicitly said not to do so at this point. Anyway, I like your random assignment. At this point, this is not what I wanted, but it will surely be helpful to keep in mind for my future work.

Thank you!

Tom · Posted 10-24-2024 12:14 PM

Should be able to use the NOBS= option of the SET statement to know how many to put into each group.

%let rate=0.10 ;

data want;
  retain obs 1 semester 2;
  do obs=obs to obs+int(&rate*(nobs-obs));
    set records nobs=nobs;
    output;
  end;
  semester+1;
run;

Results:

TC_ · Posted 10-25-2024 02:15 AM

Thank you, @Tom! This solution also gets the job done and is quite concise. I really wish I could accept two solutions.

mkeintz · Posted 10-24-2024 11:42 PM

Read the entire dataset, using the NOBS= parameter with the drop rate to generate the target drop count for the initial LAST_SEM value. Then as the target for each semester is reached, update the target through the final eligible LAST_SEM, after which LAST_SEM is set to missing.

data have;
  do id=1 to 100; output; end;
run;

%let rate=.1;
%let beg_drop=2;
%let end_drop=10;

data want;
  retain last_sem &beg_drop _target 0 ;

  set have nobs=nrecs;
  if _n_=1 then _target=round(&rate * nrecs);
  output;

  if _n_= _target then do;
    last_sem+1;
    _target + round(&rate * (nrecs-_n_));
    if last_sem>&end_drop then call missing(last_sem,_target);
  end;
run;

Of course, this assumes that the nobs of HAVE and the dropout rate will generate at least one dropout in the first eligible time period. If that is a possible issue, then:


data want;
  retain last_sem &beg_drop _target 0 ;

  set have nobs=nrecs;
  if _n_=1 then do;
    _target=round(&rate * nrecs);
    if _target=0 then call missing(last_sem,_target);
  end;
  output;

  if _n_= _target then do;
    last_sem+1;
    _target + round(&rate * (nrecs-_n_));
    if last_sem>&end_drop then call missing(last_sem,_target);
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

TC_ · Posted 10-25-2024 03:04 AM

Thank you for your response and the explanation!

How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Re: How to assign different values by row number or proportion

Registration is open

SAS Training: Just a Click Away