DATA Step, Macro, Functions and more

Summing columns based on start and stop codes

Reply
Frequent Contributor
Posts: 78

Summing columns based on start and stop codes

I have data in the following format:

id date code amt1 amt2
1 6/16/2004 100 90 63
1 6/16/2004 200 350 245
1 6/23/2004 200 350 245
1 6/23/2004 100 90 63
1 6/26/2004 100 90 63
1 6/28/2004 100 90 63
1 6/28/2004 200 350 245
1 6/30/2004 200 350 245
1 6/30/2004 100 90 63
1 7/1/2004 200 350 245
1 7/1/2004 100 95 66.5
1 7/3/2004 300 1275 892.5
1 7/3/2004 300 130 91
1 7/6/2004 500 625 437.5
1 7/6/2004 300 210 94.7
1 7/21/2004 300 60 31.83
1 7/28/2004 300 60 31.83
1 8/10/2004 300 85 30
1 8/10/2004 300 85 55.26
1 8/10/2004 300 350 138.04
2 7/25/2001 300 20 14.25
2 7/25/2001 200 350 175
2 7/25/2001 200 350 15
2 7/31/2001 300 20 0
2 7/31/2001 300 20 14.25
2 8/2/2001 300 20 0
2 8/2/2001 300 20 14.25
2 8/2/2001 100 90 59.1
2 8/2/2001 100 90 0.75
2 8/2/2001 200 350 190
2 8/4/2001 300 20 0
2 8/4/2001 300 20 14.25
2 8/4/2001 200 350 189.25
2 8/4/2001 200 350 0.75
2 8/6/2001 300 20 0
2 8/6/2001 300 20 14.25
2 8/6/2001 200 350 175
2 8/6/2001 200 350 15
2 8/8/2001 300 1275 1020
2 8/8/2001 300 130 125.4
2 8/11/2001 500 600 480
2 12/4/2001 300 20 14.25
2 12/4/2001 100 90 44.85
2 12/4/2001 100 90 15
2 12/8/2001 300 20 0
2 12/8/2001 300 20 14.25
2 12/8/2001 200 350 175
2 12/8/2001 200 350 15
2 12/7/2001 200 350 175
2 12/7/2001 200 350 15
2 12/7/2001 300 20 14.25
2 12/10/2001 300 20 0
2 12/10/2001 300 20 14.25
2 12/10/2001 200 350 189.25
2 12/10/2001 200 350 0.75
2 12/12/2001 300 1275 892.5
2 12/12/2001 300 130 125.4
2 12/15/2001 500 600 420

I need to do something with the data in this format that I only know how to do using arrays.

I want to be able to sum the two amt columns based on anything between codes 100 and 500 including any code that might fall on the same date as code 100 and 500. There can be more than one event for and id, so we could have more than one 100 to 500 total per id. Any codes before or after 100 and 500 can be ignored.

I started using this code, but my program just runs forever.

PROC SORT DATA=TEST; BY ID DESCENDING DATE;

DATA ADD (KEEP = ID DATE AMT1 AMT2) ;
SET TEST;
BY ID;

RETAIN AMT1 AMT2;

IF FIRST.ID THEN DO;
TOTALAMT1=0;
TOTALAMT2=0;
END;

IF CODE = 500 THEN DO UNTIL (CODE=100);
TOTALAMT1=AMT1+TOTALAMT1;
TOTALAMT2=AMT2+TOTALAMT2;
END;
IF LAST.ID THEN OUTPUT;
RUN;

Thanks in advance.
Super Contributor
Super Contributor
Posts: 3,174

Re: Summing columns based on start and stop codes

Your SET statement must be inside the DO loop, as well as other statements around your SET.

Suggest you can debug by adding PUTLOG _ALL_; statements at various locations in the program to better understand what's going on and where/when.

Scott Barry
SBBWorks, Inc. Message was edited by: sbb
SAS Super FREQ
Posts: 8,743

Re: Summing columns based on start and stop codes

Hi:
I am confused by your requirements. You said that
"I want to be able to sum the two amt columns based on anything between codes 100 and 500 including any code that might fall on the same date as code 100 and 500. There can be more than one event for and id, so we could have more than one 100 to 500 total per id. Any codes before or after 100 and 500 can be ignored."


Let's take a concrete example. What if you have a code of 700 that falls on 6/30/2004, as shown below:
[pre]
SCENARIO 1
id date code amt1 amt2
1 6/16/2004 100 90 63
1 6/16/2004 200 350 245
1 6/23/2004 200 350 245
1 6/23/2004 100 90 63
1 6/26/2004 100 90 63
1 6/28/2004 100 90 63
1 6/28/2004 200 350 245
1 6/30/2004 200 350 245
1 6/30/2004 100 90 63
1 6/30/2004 700 77 77 <---same date as code 100 and 200 but code=700
[/pre]

Because it falls on the same day as a code 100 and a code 200, would you want the code 700 added in or not??? You said "including any code that might fall on the same date as code 100 and 500" -- so would this row with the code=700 be in or out??? You also said that "Any codes before or after 100 and 500 can be ignored." And I don't know whether you mean BEFORE or AFTER the earliest date for a code 100 through 500, for example (different data scenario):
[pre]
SCENARIO 2
id date code amt1 amt2
1 6/12/2004 800 88 88 <---date before 6/16 and code=800
1 6/16/2004 100 90 63
1 6/16/2004 200 350 245
1 6/23/2004 200 350 245
1 6/23/2004 100 90 63
1 6/26/2004 100 90 63
1 6/28/2004 100 90 63
1 6/28/2004 200 350 245
1 6/30/2004 200 350 245
1 6/30/2004 100 90 63
[/pre]

Can you clarify???

cynthia
Frequent Contributor
Posts: 78

Re: Summing columns based on start and stop codes

So, I want the start date to be the first code 100 and the stop date to be code 500. So any code in between and including those two dates will be summed. Any code that falls on a date prior to code 100 or any code that falls on a date after 500 will not be included.

Does this make more sense?

Thanks!
SAS Super FREQ
Posts: 8,743

Re: Summing columns based on start and stop codes

Hi:
Will ALL the codes be between the range 100 to 500???? You did not respond to my question about either of the data scenarios. Either you know something about the data that you are not sharing (there could never be a code=700 or a code=800) or ???

My interpretation of what you say, given my 2 data scenarios above is that code=800 (for data scenario 2) would be excluded because it falls on a date BEFORE the first 100 code; but then for data scenario #1, I am uncertain. Even though code=700, the code falls on the same date as the "OK" codes. So the way I interpret your logic, the code=700 row should be included in the total.

Can you address the question I asked about the 2 possible data scenarios? Will the codes ALWAYS be between 100 and 500?????

Here's a different scenario...more like scenario 2, but with a different value for CODE. So in addition to answering my specific questions for Scenario 1 and 2, what would you do with this data scenario (#3):
[pre]
SCENARIO 3

id date code amt1 amt2
1 6/06/2004 333 33 33 <-- code falls between 100 and 500, but date is before first 100 code
1 6/16/2004 100 90 63
1 6/16/2004 200 350 245
1 6/23/2004 200 350 245
1 6/23/2004 100 90 63
1 6/26/2004 100 90 63
1 6/28/2004 100 90 63
1 6/28/2004 200 350 245
1 6/30/2004 200 350 245
1 6/30/2004 100 90 63
[/pre]

cynthia
Frequent Contributor
Posts: 78

Re: Summing columns based on start and stop codes

The code numbers do not matter other than 100 and 500. I think I did address this already. Code 333 would NOT be included because the date is before the 100 code. The code number does not matter if the date falls between 100 and 500. Any code between these two dates should be included.

You are correct in you interpretation for Scenario 2, if the date falls before the code 100 date, it will not be included. If the date of code 700 falls between the dates of codes 100 and 500 then it will be included.

The codes DO NOT MATTER, they will not always fall between 100 and 500, these aren't even real code numbers, I made them up. Sorry for the confusion.
Super User
Posts: 10,500

Re: Summing columns based on start and stop codes

Your code is running forever because of this line:
IF CODE = 500 THEN DO UNTIL (CODE=100);

Since you have nothing changing the value of CODE the first time CODE=500 it goes into the loop and never exits because CODE is always 500.
Frequent Contributor
Posts: 78

Re: Summing columns based on start and stop codes

Thanks, I see now why the code never stops.
Super Contributor
Super Contributor
Posts: 3,174

Re: Summing columns based on start and stop codes

Consider using PROC SUMMARY and have a numeric format to "group" (summarize) related observations (code values), using a FORMAT statement with a CLASS variable. No DATA step required, I would estimate.

Scott Barry
SBBWorks, Inc.
Regular Contributor
Posts: 241

Re: Summing columns based on start and stop codes

This seems to be straight-forward to me. Below sums amounts starting from the obs with code=100 until an obs with code=500 within an id, reading observations in the chronological order within each id group.



   /* test data */


   data one;


     input id date :anydtdte. code amt1 amt2;


   cards;


   1 6/16/2004 100 90 63


   1 6/16/2004 200 350 245


   1 6/23/2004 200 350 245


   1 6/23/2004 100 90 63


   1 6/26/2004 100 90 63


   1 6/28/2004 100 90 63


   1 6/28/2004 200 350 245


   1 6/30/2004 200 350 245


   1 6/30/2004 100 90 63


   1 7/1/2004 200 350 245


   1 7/1/2004 100 95 66.5


   1 7/3/2004 300 1275 892.5


   1 7/3/2004 300 130 91


   1 7/6/2004 500 625 437.5


   1 7/6/2004 300 210 94.7


   1 7/21/2004 300 60 31.83


   1 7/28/2004 300 60 31.83


   1 8/10/2004 300 85 30


   1 8/10/2004 300 85 55.26


   1 8/10/2004 300 350 138.04


   2 7/25/2001 300 20 14.25


   2 7/25/2001 200 350 175


   2 7/25/2001 200 350 15


   2 7/31/2001 300 20 0


   2 7/31/2001 300 20 14.25


   2 8/2/2001 300 20 0


   2 8/2/2001 300 20 14.25


   2 8/2/2001 100 90 59.1


   2 8/2/2001 100 90 0.75


   2 8/2/2001 200 350 190


   2 8/4/2001 300 20 0


   2 8/4/2001 300 20 14.25


   2 8/4/2001 200 350 189.25


   2 8/4/2001 200 350 0.75


   2 8/6/2001 300 20 0


   2 8/6/2001 300 20 14.25


   2 8/6/2001 200 350 175


   2 8/6/2001 200 350 15


   2 8/8/2001 300 1275 1020


   2 8/8/2001 300 130 125.4


   2 8/11/2001 500 600 480


   2 12/4/2001 300 20 14.25


   2 12/4/2001 100 90 44.85


   2 12/4/2001 100 90 15


   2 12/8/2001 300 20 0


   2 12/8/2001 300 20 14.25


   2 12/8/2001 200 350 175


   2 12/8/2001 200 350 15


   2 12/7/2001 200 350 175


   2 12/7/2001 200 350 15


   2 12/7/2001 300 20 14.25


   2 12/10/2001 300 20 0


   2 12/10/2001 300 20 14.25


   2 12/10/2001 200 350 189.25


   2 12/10/2001 200 350 0.75


   2 12/12/2001 300 1275 892.5


   2 12/12/2001 300 130 125.4


   2 12/15/2001 500 600 420


   ;


   run;


 


   proc sort data=one;


     by id date;


   run;


 


   data two;


 


      length id start finish tot1 tot2 8;


      format start finish mmddyy10.;


      keep id--tot2;


 


      do until (last.id);


         link setOne;


         if code = 100 then do


            link init;


            do while (1);


               link doSum;


               if code = 500 | last.id then do;


                  link doOut;


                  leave;


               end;


               link setOne;


            end;


         end;


      end;


      return;


 


      setOne:


        set one;


        by id date;


      return;


 


      init:


        call missing(tot1, tot2);


        start = date;


        finish = .;


      return;


 


      doSum:


        tot1 + amt1;


        tot2 + amt2;


      return;  


 


      doOut:


        finish = date;


        output;


      return;


   run;


 


   /* check */


   proc print data=two noobs;


   run;


   /* on lst


   id         start        finish    tot1      tot2


    1    06/16/2004    07/06/2004    4325    3027.50


    2    08/02/2001    08/11/2001    4015    2283.75


    2    12/04/2001    12/15/2001    4385    2110.50


   */

Frequent Contributor
Posts: 78

Re: Summing columns based on start and stop codes

Yes, this is doing exactly what I need. Thanks so much for your help on this. The link statement is new to me, but definitely something I can use in the future.

I appreciate everyone's help on this.
Super Contributor
Super Contributor
Posts: 3,174

Re: Summing columns based on start and stop codes

Yes, I now see that the accumulation is not based on value-ranges but a specific "condition" occurrence that warrants a "break" to start a new observation.

So, an approach that still leverages simplified summarization logic would be to increment a count variable when a new observation-group is warranted, then use PROC SUMMARY (instead of DATA step accumulation logic) to collapse/summarize.

Scott Barry
SBBWorks, Inc. Message was edited by: sbb
Frequent Contributor
Posts: 78

Re: Summing columns based on start and stop codes

I have run into one more unanticipated problem to deal with. There are some instances with one particular code that I only want the amt1 to be summed one time, but want all of amt2 to be summed when an even occurs on the same date. Here is an example using the same data of one id:

id date code amt1 amt2
2 12/4/2001 300 20 14.25
2 12/4/2001 100 90 44.85
2 12/4/2001 100 90 15
2 12/8/2001 300 20 0
2 12/8/2001 300 20 14.25
2 12/8/2001 200 350 175
2 12/8/2001 200 350 15
2 12/7/2001 200 350 175
2 12/7/2001 200 350 15
2 12/7/2001 300 20 14.25
2 12/10/2001 300 20 0
2 12/10/2001 300 20 14.25
2 12/10/2001 200 350 189.25
2 12/10/2001 200 350 0.75

2 12/12/2001 300 1275 892.5
2 12/12/2001 300 130 125.4
2 12/15/2001 500 600 420

for the example in bold, I only want to sum one instance of amt1 that falls on the same date. So I would need the raw data to look like this:

id date code amt1 amt2
2 12/4/2001 300 20 14.25
2 12/4/2001 100 90 44.85
2 12/4/2001 100 90 15
2 12/8/2001 300 20 0
2 12/8/2001 300 20 14.25
2 12/8/2001 200 350 175
2 12/8/2001 200 350 15
2 12/7/2001 200 350 175
2 12/7/2001 200 350 15
2 12/7/2001 300 20 14.25
2 12/10/2001 300 20 0
2 12/10/2001 300 20 14.25
2 12/10/2001 200 350 189.25
2 12/10/2001 200 0 0.75

2 12/12/2001 300 1275 892.5
2 12/12/2001 300 130 125.4
2 12/15/2001 500 600 420

I only need to do this for one code and I this is something I should do prior to creating the totals.
Ask a Question
Discussion stats
  • 12 replies
  • 335 views
  • 0 likes
  • 5 in conversation