Hi SAS Users,
I have a question regarding how to set-up a sum function based on the dataset below. I think I need to use an array as well, but I am not certain. If you could offer some input, that'd be great.
My objective is to sum up the number of days the student is enrolled (aka "sum_enroll") using variables "enroll_start" to "enroll_end" to determine the start and end points for summing. I have included a description of the variables along with the desired dataset including the new variable "sum_enroll".
Thank you!
ID = student
enroll_start = day student was enrolled
enroll_end = last day student was enrolled
day1-day10 = a binary variable (1, 0) indicated if they were in attendance on that day
sum_enroll = sum of the number of days the student is enrolled
Note: Variables enroll_start and enroll_end do not necessarily match up the with the first "1" and the last "1" value written for variables day1-day10.
data want;
input ID enroll_start enroll_end sum_enroll day1 day2 day3 day4 day5 day6 day7 day8 day9 day10;
AA day1 day10 10 1 1 1 1 1 1 1 1 1 1
BB day1 day 7 4 1 0 0 0 1 1 1 0 1 0
CC day4 day8 5 0 1 1 1 1 1 1 1 1 1
DD day5 day8 3 0 0 0 1 1 1 1 0 1 0
EE day3 day10 8 0 0 1 1 1 1 1 1 1 1
FF day1 day7 3 0 0 0 0 1 1 1 0 0 0
;;
Simplistic approach would be to subtract enroll_start from enroll_end and add 1 for the end date (presuming the values are actual dates)...
As an aside, I was confused by the value sum_enroll=4 for the second row. Start day1 and end day7 means 7 days, right?
Thanks for your response. The objective isn't quite as simple as subtracting enroll_start from enroll_end and then + 1. The 2nd row is an example of why. Even though the enroll_start = 1 and enroll_end = 7, the answer is 4 as they were only in attendance on day 1, 4, 5 and 6. I think PGstats has answered it though!
data have;
input ID $ enroll_start $ enroll_end $ day1 - day10;
datalines;
AA day1 day10 1 1 1 1 1 1 1 1 1 1
BB day1 day7 1 0 0 0 1 1 1 0 1 0
CC day4 day8 0 1 1 1 1 1 1 1 1 1
DD day5 day8 0 0 0 1 1 1 1 0 1 0
EE day3 day10 0 0 1 1 1 1 1 1 1 1
FF day1 day7 0 0 0 0 1 1 1 0 0 0
;
data want;
array d day1-day10;
set have;
do _n_ = input(substr(enroll_start,4),3.) to input(substr(enroll_end,4),3.);
sum_enroll = sum( sum_enroll, d{_n_});
end;
run;
PG
PGStats,
So I believe your code produces the desired outcome. One question, when I adapt your code for my own dataset, I receive an error message from SAS indicating that my do loop is invalid (i.e. "Invalid Do loop control information . . .). This makes sense as I do have some day1-day10 values that begin with zero or missing values. What do you advise I do to your code to fix this error?
Thank you again for your help!
I suspect the problem is with the data. On your second line of your data there is "day 7" with a space between day and 7. When you read that line, enroll_end is read as "day": it does not contain a number. You can either fix the data or count on the fact that there are always more than one space after enroll_start and enroll_end and change your input statement to:
input ID $ enroll_start &$ enroll_end &$ day1 - day10;
PG
Hi PG,
I think the modified code may work if what you suspect is true.
data bad_data;
input ID $ enroll_start $ enroll_end $ day1 - day10;
datalines;
AA day1 day10 1 1 1 1 1 1 1 1 1 1
BB day1 day7 1 0 0 0 1 1 1 0 1 0
CC day4 day8 0 1 1 1 1 1 1 1 1 1
DD day5 day11 0 0 0 1 1 1 1 0 1 0
EE day3 day10 0 0 1 1 1 1 1 1 1 1
FF day8 day7 0 0 0 0 1 1 1 0 0 0
;
data want problem;
array d day1-day10;
set bad_data;
start=input(compress(enroll_start,,'kd'),2.);
end=input(compress(enroll_end,,'kd'),2.);
if missing(start) or missing(end) or end <start or end >10 then output problem;
else do;
do _n_ = start to end;
sum_enroll = sum( sum_enroll, d{_n_});
end;
output want;
end;
run;
Message was edited by: Linlin
Thanks PG and Linlin for your responses. My variable names are correct (i.e. no spaces between day and 7 - it was just typo in the sample datatset I provided.)
However, you are correct there is a problem with my data. The problem relates to the enroll_start and enroll_end variables. I think the issue is that sometimes my enroll_start and enroll_end may exceed the number of days I have in my dataset.
How do I specify in the below formula that I want (for example) enroll_start = enroll_day+1 or until enroll_start=day10. (Note: the +1 and +6 are just examples.)
enroll_start = enroll_day+1;
enroll_end = enroll_day+6;
So an example of that would look like this:
ID enroll_day enroll_start enroll_end day1-day10
GG day7 day8 day10 0 0 0 0 0 0 0 0 1 0
;;
Using my current formula, I would get enroll_end = day13 but I believe this is the problem with the code as I only have day1-day10.
Does this make sense?
Yes it makes sense. You can simply limit the range of the loop like this:
do _n_ = input(substr(enroll_start,4),3.) to min(input(substr(enroll_end,4),3.), hbound(d));
PG
Another option is to create more days :
data have;
infile datalines missover;
input ID $ enroll_start $ enroll_end $ day1 - day30;
datalines;
...
Extra days will remain missing and be treated as zeros in the statement
sum_enroll = sum( sum_enroll, d{_n_});
PG
Thanks for your help. I did figure it out. I used a combination of your code along with Linlin and I got it to work perfectly!
data want;
set have;
days_enrolled=input(compress(enroll_end,,'kd'),2.)-input(compress(enroll_start,,'kd'),2.)+1;
array _days(*) day1-day10;
do i=1 to dim(_days);
days_attended=sum(days_attended,_days(i));
end;
drop i;
run;
Thanks for your response, Linlin. Your code is pretty close though not exactly what I was looking for. For example in row 2, your code returns a value of 5 for variable 'days_attended' when actually the value should be 4. As I only want to count the days of attendence (i.e. Day 1, 4, 5, 6) within the set range (i.e. enroll_start - enroll_end). Does that make sense?
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.