Hello All, I am working on a dataset that has year and month of enrollment of clients into a program (from March 2018 to February 2022). The dataset does not have information on when the clients exited the program. I have been trying to figure out:
1. the number of clients newly enrolled from one year to the next (for e.g., present in 2019 but not in 2018)
2. the number of clients who exited the program in the next year (for e.g., present in 2018 but not in in 2019)
3. number of clients retained (for e.g., present in both 2018 and 2019).
I would appreciate any help! Thanks!
The data looks like:
data enrollment;
input id$ year month;
datalines;
1 2018 10
1 2018 11
1 2018 12
1 2019 1
1 2019 2
2 2019 10
2 2019 11
2 2019 12
3 2018 3
3 2018 4
3 2018 5
3 2019 1
;
run;
As the question evolved here, the name of your input data set keeps changing as do the input data set vales. It makes it hard to refined the answer from previous work. So I'm going to provide both sample data and commented code here to help you apply this to your actual data.
This code creates the input data set:
data prepost;
input id$ year month;
datalines;
1 2018 2
1 2018 3
1 2018 4
1 2019 2
1 2019 3
1 2019 4
2 2018 8
2 2018 9
2 2018 10
3 2018 3
3 2018 4
3 2018 5
3 2018 8
3 2018 9
3 2018 10
;
Which looks like this:
| id | year | month |
|---|---|---|
| 1 | 2018 | 2 |
| 1 | 2018 | 3 |
| 1 | 2018 | 4 |
| 1 | 2019 | 2 |
| 1 | 2019 | 3 |
| 1 | 2019 | 4 |
| 2 | 2018 | 8 |
| 2 | 2018 | 9 |
| 2 | 2018 | 10 |
| 3 | 2018 | 3 |
| 3 | 2018 | 4 |
| 3 | 2018 | 5 |
| 3 | 2018 | 8 |
| 3 | 2018 | 9 |
| 3 | 2018 | 10 |
This code does the summarization:
proc sql;
create table prepost_q1 as
select id
/* Each logical expression produces a 1 if true, 0 if false */
,sum( month in (3,5))=0 /* Not enrolled first */
AND
sum( month in (6,8))>0 /* Enrolled second */ as preq2_newenrol
,sum( month in (3,5))>0 /* Enrolled first */
AND
sum( month in (6,8))=0 /* Not enrolled second */ as preq2_exited
,sum( month in (3,5))>0 /* Enrolled first */
AND
sum( month in (6,8))>0 /* Enrolled second */ as preq2_retained
from work.prepost
where year=2018
group by id
;
quit;
Producing this result:
| id | preq2_newenrol | preq2_exited | preq2_retained |
|---|---|---|---|
| 1 | 0 | 1 | 0 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 0 | 1 |
@SAS_SB wrote:
1. the number of clients newly enrolled from one year to the next (for e.g., present in 2019 but not in 2018)
2. the number of clients who exited the program in the next year (for e.g., present in 2018 but not in in 2019)
3. number of clients retained (for e.g., present in both 2018 and 2019).
So based on your description, month is irrelevant?
Is this data representative of the real problem? In other words, the only two years we are discussing is 2018 and 2019, and so the only code we have to write is to check these two years? Or does the real problem contain more than two years? (If the latter, provide data that spans more than two years)
Thank you @PaigeMiller !
Months could be useful if we want to have a different/specific time period. For e.g., March 2018 to Feb 2019, March 2019 to Feb 2020, etc. In addition, clients might not be in the program for all months of the year.
The example provided was just a truncated dataset. I have other years as well. Please find the revised example dataset as follows. Please let me know if this example is sufficient.
Please note that I have put dots (.) to indicate continuation of the months enrolled within the given year not the missing values.
data work.enrollment;
input id$ year month;
datalines;
1 2018 10
1 2018 11
1 2018 12
1 2019 1
1 2019 2
2 2019 10
2 2019 11
2 2019 12
3 2018 3
3 2018 4
3 2018 5
3 2019 1
3 2019 12
3 2020 1
3 2020 2
4 2018 9
4 2018 10
4 2018 11
4 2018 12
4 2019 1
4 2019 2
5 2018 12
5 2019 1
5 2019 2
. .... .
5 2019 12
5 2020 1
6 2019 1
. .... .
6 2019 12
6 2020 1
. .... .
6 2020 12
6 2021 1
. .... .
6 2021 12
6 2022 1
6 2022 2
;
run;
So continuing on ... I'm still not sure I understand what you want.
If a client is enrolled in January, 2018 but not the rest of the year, and is enrolled in December, 2019 but not the rest of the year ... how is that treated. Is it new enrollment, is it disenrollment, is it retention?
Also, I don't like this:
Months could be useful...
Either we are supposed to be using months in the logic and code, or we are not.
Sorry for being unclear.
So I would like to have the number of individuals with new enrollment, retention, and disenrollment for the periods: from period 1 (March 2018 to Feb 2019) to period 2 (March 2019 to Feb 2020) and from period 2 to period 3 (March 2020 to Feb 2021). So, how many were retained in period 2 from period 1? How many were disenrolled when moving from period 1 to period 2? etc.
"If a client is enrolled in January, 2018 but not the rest of the year, and is enrolled in December, 2019 but not the rest of the year ... how is that treated. Is it new enrollment, is it disenrollment, is it retention?"
For my case it would be retention.
Yes we will use months in the code.
Thanks, again!
I think I understand now, but again I ask for a more representative data set. In particular, I cannot work with data that has dots in it like this:
5 2020 1 6 2019 1 . .... . 6 2019 12 6 2020 1
Right now, there are still only two time periods represented in the part of the data I can use, IDs 1 through 4, those two time periods being March 2018 to Feb 2019 and March 2019 to Feb 2020.
Here is the revised dataset.
Many thanks in advance!
data work.enrollment;
input id$ year month;
datalines;
1 2018 10
1 2018 11
1 2018 12
1 2019 1
1 2019 2
2 2019 10
2 2019 11
2 2019 12
3 2018 3
3 2018 4
3 2018 5
3 2019 1
3 2019 12
3 2020 1
3 2020 2
4 2018 9
4 2018 10
4 2018 11
4 2018 12
4 2019 1
4 2019 2
5 2018 12
5 2019 1
5 2019 2
5 2019 3
5 2019 4
5 2019 5
5 2019 6
5 2019 7
5 2019 8
5 2019 9
5 2019 10
5 2019 11
5 2019 12
5 2020 1
5 2020 2
6 2019 1
6 2019 2
6 2019 3
6 2019 4
6 2019 5
6 2019 6
6 2019 7
6 2019 8
6 2019 9
6 2019 10
6 2019 11
6 2019 12
6 2020 1
6 2020 2
6 2020 3
6 2020 4
6 2020 5
6 2020 6
6 2020 7
6 2020 8
6 2020 9
6 2020 10
6 2020 11
6 2020 12
6 2021 1
6 2021 2
6 2021 3
6 2021 4
6 2021 5
6 2021 6
6 2021 7
6 2021 8
6 2021 19
6 2021 10
6 2021 11
6 2021 12
6 2022 1
6 2022 2
;
run;
Thanks. This is what I have, it works for this data
data enrollment1;
set enrollment;
yymm=mdy(month,1,year);
last_month_of_time_period=intnx('year.3',yymm,0,'e');
year_of_enrollment=year(last_month_of_time_period)-1;
enrolled=1;
format yymm last_month_of_time_period yymmn6.;
run;
data enrollment2;
set enrollment1;
by id year_of_enrollment;
if last.year_of_enrollment;
run;
proc transpose data=enrollment2 out=enrollment3 prefix=yr_;
by id;
var enrolled;
id year_of_enrollment;
run;
data want;
set enrollment3(drop=_name_);
array yr yr_2018-yr_2020;
array new newenrollment_2018-newenrollment_2020;
array dis disenrollment_2018-disenrollment_2020;
array ret retention_2018-retention_2020;
do i=1 to dim(yr);
ret(i)=0;
dis(i)=0;
new(i)=0;
if i=1 and yr(i)=1 then new(i)=1;
if i>1 then do;
if yr(i)=1 and yr(i-1)=. then new(i)=1;
else if yr(i)=. and yr(i-1)=1 then dis(i)=1;
else if yr(i)=1 and yr(i-1)=1 then ret(i)=1;
end;
end;
drop i;
run;
proc summary data=want;
var new: dis: ret:;
output out=percents mean=;
run;
Thank you @PaigeMiller !
I am a beginner learning SAS. So, it will take some time for me to understand the code. I am learning the different SAS statements and options. Meanwhile, if there is any simpler (for learners like me!), it would be super helpful.
Many thanks!
There may be simpler ways, but I have written it relatively step by step, which might make it easier to understand.
The first data set named ENROLLMENT1 computes the last month of the time period (which I think is always Feb) and then finds the corresponding year (so Feb 2019 is really part of the 2018 year)
The second data set named ENROLLMENT2 winnows the data set down to just one record per time period
The third data set ENROLLMENT3 is transpose of ENROLLMENT2, you should look at it to see what I have done. Now that the data is in this form, a series of IF statement in WANT computes wherther or not the record was new enrollment, disenrollment or retention for each year. And then finally PROC SUMMARY computes to percent of each, by year.
Thank you @PaigeMiller. The explanation is helpful.
One quick question, what would be the codes if the end month is other than February? I would like to have several quarters as time periods. For example, March to May 2018 as quarter 1 (q1), June to August 2018 as q2, September to November 2018 as q3, and so on. What is the number of unique clients newly enrolled, exited, and retained from q1 to q2, q2 to q3, and so on?
Thank you, again!
This would require somewhat of a re-write of the code, and I don't have the time right now to do this.
No worries! When you get time, please write down the code. I will be keeping an eye for that.
Thank you!
Hello All,
I am working on a dataset that has year and month of enrollment of clients into a program (from March 2018 to February 2022). The dataset does not have information on when the clients exited the program. I have been trying to figure out:
1. the number of clients newly enrolled from one year to the next (for e.g., present in 2019 but not in 2018)
2. the number of clients who exited the program in the next year (for e.g., present in 2018 but not in in 2019)
3. number of clients retained (for e.g., present in both 2018 and 2019).
I would appreciate any help!
Thanks!
The data looks like:
data enrollment;
input id$ year month;
datalines;
1 2018 10
1 2018 11
1 2018 12
1 2019 1
1 2019 2
2 2019 10
2 2019 11
2 2019 12
3 2018 3
3 2018 4
3 2018 5
3 2019 1
;
run;
With SQL, a little boolean logic can create numeric flags for each ID:
proc sql;
select ID
, SUM(year=2018)=0 and SUM(year=2019)>0 as NewEnrollment
, SUM(year=2018)>0 and SUM(year=2019)=0 as Exited
, SUM(year=2018)>0 and SUM(year=2019)>0 as Retained
from enrollment
group by ID
;
quit;
Producing a report like this:
| id | NewEnrollment | Exited | Retained |
|---|---|---|---|
| 1 | 0 | 1 | 0 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 0 | 1 |
Which can be used as an inline view to get the overall report:
proc sql;
select sum(NewEnrollment) as TotalNew
,sum(Exited) as TotalExited
,sum(Retained) as TotalRetained
from (select ID
, SUM(year=2018)=0 and SUM(year=2019)>0 as NewEnrollment
, SUM(year=2018)>0 and SUM(year=2019)=0 as Exited
, SUM(year=2018)>0 and SUM(year=2019)>0 as Retained
from enrollment
group by ID)
;
quit;
and, violá!
| TotalNew | TotalExited | TotalRetained |
|---|---|---|
| 1 | 1 | 1 |
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.