Hi SAS Users,
I have a question regarding how to set-up a sum function based on the dataset below. I think I need to use an array as well, but I am not certain. If you could offer some input, that'd be great.
My objective is to sum up the number of days the student is enrolled (aka "sum_enroll") using variables "enroll_start" to "enroll_end" to determine the start and end points for summing. I have included a description of the variables along with the desired dataset including the new variable "sum_enroll".
Thank you!
ID = student
enroll_start = day student was enrolled
enroll_end = last day student was enrolled
day1-day10 = a binary variable (1, 0) indicated if they were in attendance on that day
sum_enroll = sum of the number of days the student is enrolled
Note: Variables enroll_start and enroll_end do not necessarily match up the with the first "1" and the last "1" value written for variables day1-day10.
data want;
input ID enroll_start enroll_end sum_enroll day1 day2 day3 day4 day5 day6 day7 day8 day9 day10;
AA day1 day10 10 1 1 1 1 1 1 1 1 1 1
BB day1 day 7 4 1 0 0 0 1 1 1 0 1 0
CC day4 day8 5 0 1 1 1 1 1 1 1 1 1
DD day5 day8 3 0 0 0 1 1 1 1 0 1 0
EE day3 day10 8 0 0 1 1 1 1 1 1 1 1
FF day1 day7 3 0 0 0 0 1 1 1 0 0 0
;;
Simplistic approach would be to subtract enroll_start from enroll_end and add 1 for the end date (presuming the values are actual dates)...
As an aside, I was confused by the value sum_enroll=4 for the second row. Start day1 and end day7 means 7 days, right?
Thanks for your response. The objective isn't quite as simple as subtracting enroll_start from enroll_end and then + 1. The 2nd row is an example of why. Even though the enroll_start = 1 and enroll_end = 7, the answer is 4 as they were only in attendance on day 1, 4, 5 and 6. I think PGstats has answered it though!
data have;
input ID $ enroll_start $ enroll_end $ day1 - day10;
datalines;
AA day1 day10 1 1 1 1 1 1 1 1 1 1
BB day1 day7 1 0 0 0 1 1 1 0 1 0
CC day4 day8 0 1 1 1 1 1 1 1 1 1
DD day5 day8 0 0 0 1 1 1 1 0 1 0
EE day3 day10 0 0 1 1 1 1 1 1 1 1
FF day1 day7 0 0 0 0 1 1 1 0 0 0
;
data want;
array d day1-day10;
set have;
do _n_ = input(substr(enroll_start,4),3.) to input(substr(enroll_end,4),3.);
sum_enroll = sum( sum_enroll, d{_n_});
end;
run;
PG
PGStats,
So I believe your code produces the desired outcome. One question, when I adapt your code for my own dataset, I receive an error message from SAS indicating that my do loop is invalid (i.e. "Invalid Do loop control information . . .). This makes sense as I do have some day1-day10 values that begin with zero or missing values. What do you advise I do to your code to fix this error?
Thank you again for your help!
I suspect the problem is with the data. On your second line of your data there is "day 7" with a space between day and 7. When you read that line, enroll_end is read as "day": it does not contain a number. You can either fix the data or count on the fact that there are always more than one space after enroll_start and enroll_end and change your input statement to:
input ID $ enroll_start &$ enroll_end &$ day1 - day10;
PG
Hi PG,
I think the modified code may work if what you suspect is true.
data bad_data;
input ID $ enroll_start $ enroll_end $ day1 - day10;
datalines;
AA day1 day10 1 1 1 1 1 1 1 1 1 1
BB day1 day7 1 0 0 0 1 1 1 0 1 0
CC day4 day8 0 1 1 1 1 1 1 1 1 1
DD day5 day11 0 0 0 1 1 1 1 0 1 0
EE day3 day10 0 0 1 1 1 1 1 1 1 1
FF day8 day7 0 0 0 0 1 1 1 0 0 0
;
data want problem;
array d day1-day10;
set bad_data;
start=input(compress(enroll_start,,'kd'),2.);
end=input(compress(enroll_end,,'kd'),2.);
if missing(start) or missing(end) or end <start or end >10 then output problem;
else do;
do _n_ = start to end;
sum_enroll = sum( sum_enroll, d{_n_});
end;
output want;
end;
run;
Message was edited by: Linlin
Thanks PG and Linlin for your responses. My variable names are correct (i.e. no spaces between day and 7 - it was just typo in the sample datatset I provided.)
However, you are correct there is a problem with my data. The problem relates to the enroll_start and enroll_end variables. I think the issue is that sometimes my enroll_start and enroll_end may exceed the number of days I have in my dataset.
How do I specify in the below formula that I want (for example) enroll_start = enroll_day+1 or until enroll_start=day10. (Note: the +1 and +6 are just examples.)
enroll_start = enroll_day+1;
enroll_end = enroll_day+6;
So an example of that would look like this:
ID enroll_day enroll_start enroll_end day1-day10
GG day7 day8 day10 0 0 0 0 0 0 0 0 1 0
;;
Using my current formula, I would get enroll_end = day13 but I believe this is the problem with the code as I only have day1-day10.
Does this make sense?
Yes it makes sense. You can simply limit the range of the loop like this:
do _n_ = input(substr(enroll_start,4),3.) to min(input(substr(enroll_end,4),3.), hbound(d));
PG
Another option is to create more days :
data have;
infile datalines missover;
input ID $ enroll_start $ enroll_end $ day1 - day30;
datalines;
...
Extra days will remain missing and be treated as zeros in the statement
sum_enroll = sum( sum_enroll, d{_n_});
PG
Thanks for your help. I did figure it out. I used a combination of your code along with Linlin and I got it to work perfectly!
data want;
set have;
days_enrolled=input(compress(enroll_end,,'kd'),2.)-input(compress(enroll_start,,'kd'),2.)+1;
array _days(*) day1-day10;
do i=1 to dim(_days);
days_attended=sum(days_attended,_days(i));
end;
drop i;
run;
Thanks for your response, Linlin. Your code is pretty close though not exactly what I was looking for. For example in row 2, your code returns a value of 5 for variable 'days_attended' when actually the value should be 4. As I only want to count the days of attendence (i.e. Day 1, 4, 5, 6) within the set range (i.e. enroll_start - enroll_end). Does that make sense?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.