BookmarkSubscribeRSS Feed
sophia_SAS
Obsidian | Level 7

Hi SAS Users,

I have a question regarding how to set-up a sum function based on the dataset below. I think I need to use an array as well, but I am not certain. If you could offer some input, that'd be great.

My objective is to sum up the number of days the student is enrolled (aka "sum_enroll") using variables "enroll_start" to "enroll_end" to determine the start and end points for summing. I have included a description of the variables along with the desired dataset including the new variable "sum_enroll".

Thank you!

ID               = student

enroll_start  = day student was enrolled

enroll_end    = last day student was enrolled

day1-day10  = a binary variable (1, 0) indicated if they were in attendance on that day

sum_enroll   = sum of the number of days the student is enrolled

Note:  Variables enroll_start and enroll_end do not necessarily match up the with the first "1" and the last "1" value written for variables day1-day10.

data want;

input  ID   enroll_start enroll_end sum_enroll day1 day2 day3 day4 day5 day6 day7 day8 day9 day10;

         AA  day1          day10         10             1      1       1      1      1       1     1      1      1       1

         BB  day1          day 7            4             1      0       0      0      1       1     1      0      1       0

         CC  day4          day8             5             0      1       1      1      1      1      1       1      1      1

         DD   day5         day8             3             0      0       0      1      1       1     1       0      1      0

         EE   day3         day10            8            0      0       1      1      1       1     1      1       1      1

         FF   day1         day7              3             0      0       0      0      1       1      1      0       0      0

;;

12 REPLIES 12
DBailey
Lapis Lazuli | Level 10

Simplistic approach would be to subtract enroll_start from enroll_end and add 1 for the end date (presuming the values are actual dates)...

As an aside, I was confused by the value sum_enroll=4 for the second row.  Start day1 and end day7 means 7 days, right?

sophia_SAS
Obsidian | Level 7

Thanks for your response.  The objective isn't quite as simple as subtracting enroll_start from enroll_end and then + 1.  The 2nd row is an example of why.  Even though the enroll_start = 1 and enroll_end = 7, the answer is 4 as they were only in attendance on day 1, 4, 5 and 6.  I think PGstats has answered it though!

PGStats
Opal | Level 21

data have;
input ID $ enroll_start $ enroll_end $ day1 - day10;
datalines;
AA day1 day10 1 1 1 1 1 1 1 1 1 1
BB day1 day7 1 0 0 0 1 1 1 0 1 0
CC day4 day8 0 1 1 1 1 1 1 1 1 1
DD day5 day8 0 0 0 1 1 1 1 0 1 0
EE day3 day10 0 0 1 1 1 1 1 1 1 1
FF day1 day7 0 0 0 0 1 1 1 0 0 0
;

data want;
array d day1-day10;
set have;
do _n_ = input(substr(enroll_start,4),3.) to input(substr(enroll_end,4),3.);
     sum_enroll = sum( sum_enroll, d{_n_});
     end;
run;

PG

PG
sophia_SAS
Obsidian | Level 7

PGStats,

So I believe your code produces the desired outcome.  One question, when I adapt your code for my own dataset, I receive an error message from SAS indicating that my do loop is invalid (i.e. "Invalid Do loop control information . . .). This makes sense as I do have some day1-day10 values that begin with zero or missing values.  What do you advise I do to your code to fix this error?


Thank you again for your help!

PGStats
Opal | Level 21

I suspect the problem is with the data. On your second line of your data there is "day 7" with a space between day and 7. When you read that line, enroll_end is read as "day": it does not contain a number. You can either fix the data or count on the fact that there are always more than one space after enroll_start and enroll_end and change your input statement to:

input ID $ enroll_start &$ enroll_end &$ day1 - day10;

PG

PG
Linlin
Lapis Lazuli | Level 10

Hi PG,

I think the modified code may work if what you suspect is true.

data bad_data;
input ID $ enroll_start $ enroll_end $ day1 - day10;
datalines;
AA day1 day10 1 1 1 1 1 1 1 1 1 1
BB day1 day7 1 0 0 0 1 1 1 0 1 0
CC day4 day8 0 1 1 1 1 1 1 1 1 1
DD day5 day11 0 0 0 1 1 1 1 0 1 0
EE day3 day10 0 0 1 1 1 1 1 1 1 1
FF day8 day7 0 0 0 0 1 1 1 0 0 0
;

data want problem;
array d day1-day10;
set bad_data;
start=input(compress(enroll_start,,'kd'),2.);
end=input(compress(enroll_end,,'kd'),2.);
if missing(start) or missing(end) or end <start or end >10 then output problem;
else do;
do _n_ = start  to  end;
     sum_enroll = sum( sum_enroll, d{_n_});
     end;
  output want;
end;
run;

Message was edited by: Linlin

sophia_SAS
Obsidian | Level 7

Thanks PG and Linlin for your responses.  My variable names are correct (i.e. no spaces between day and 7 - it was just typo in the sample datatset I provided.)

However, you are correct there is a problem with my data.  The problem relates to the enroll_start and enroll_end variables.  I think the issue is that sometimes my enroll_start and enroll_end may exceed the number of days I have in my dataset.

How do I specify in the below formula that I want (for example)  enroll_start  = enroll_day+1 or until enroll_start=day10. (Note: the +1 and +6 are just examples.)

enroll_start = enroll_day+1;

enroll_end = enroll_day+6;

So an example of that would look like this:

ID   enroll_day enroll_start enroll_end day1-day10

GG  day7 day8 day10    0 0 0 0 0 0 0 0 1 0

;;

Using my current formula, I would get enroll_end = day13 but I believe this is the problem with the code as I only have day1-day10.

Does this make sense?

PGStats
Opal | Level 21

Yes it makes sense.  You can simply limit the range of the loop like this:

do _n_ = input(substr(enroll_start,4),3.) to min(input(substr(enroll_end,4),3.), hbound(d));

PG

PG
PGStats
Opal | Level 21

Another option is to create more days :

data have;

infile datalines missover;

input ID $ enroll_start $ enroll_end $ day1 - day30;

datalines;

...

Extra days will remain missing and be treated as zeros in the statement

  sum_enroll = sum( sum_enroll, d{_n_});

 

PG

PG
sophia_SAS
Obsidian | Level 7

Thanks for your help.  I did figure it out. I used a combination of your code along with Linlin and I got it to work perfectly!

Linlin
Lapis Lazuli | Level 10

data want;

  set have;

  days_enrolled=input(compress(enroll_end,,'kd'),2.)-input(compress(enroll_start,,'kd'),2.)+1;

  array _days(*) day1-day10;

  do i=1 to dim(_days);

  days_attended=sum(days_attended,_days(i));

  end;

  drop i;

run;

sophia_SAS
Obsidian | Level 7

Thanks for your response, Linlin.  Your code is pretty close though not exactly what I was looking for.  For example in row 2, your code returns a value of 5 for variable 'days_attended' when actually the value should be 4.  As I only want to count the days of attendence (i.e. Day 1, 4, 5, 6) within the set range (i.e. enroll_start - enroll_end).  Does that make sense?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1629 views
  • 0 likes
  • 4 in conversation