DATA Step, Macro, Functions and more

Sum - only last three dates by ID

Reply
New Contributor
Posts: 3

Sum - only last three dates by ID

[ Edited ]

hi Guyz

 

I need solution to this problem. I have this dataset below,I only want to calculate the total points for the last three dates for each of the ID variables. I have tried what i know but still giving me wrong values. Not to summerise all by the ID but only to summerise the last three dates when sorted in ascending date.

 

DATA DATES;
INPUT ID : $8. VISIT : DATE9. POINT : 8.;
FORMAT VISIT DATE9.;
DATALINES;
600 20MAY2016 20
600 21JUN2016 30
600 23JUL2016 50
600 30AUG2016 50
600 21SEP2016 30
600 31OCT2016 12
600 21NOV2016 23
600 22DEC2016 22
600 20JAN2017 30
100 20MAY2016 78
100 21JUN2016 54
100 23JUL2016 56
100 30AUG2016 56
100 21SEP2016 89
100 31OCT2016 46
100 21NOV2016 78
100 22DEC2016 78
100 20JAN2017 34
100 22FEB2017 65
100 24MAR2017 45
100 30APR2017 45
200 20MAY2016 90
200 21JUN2016 78
200 23JUL2016 45
200 30AUG2016 45
200 21SEP2016 54
200 31OCT2016 67
200 21NOV2016 78
200 22DEC2016 67
200 20JAN2017 56
200 22FEB2017 43
200 24MAR2017 66
200 30APR2017 65
300 20MAY2016 78
300 21JUN2016 66
300 23JUL2016 55
300 30AUG2016 44
300 21SEP2016 33
300 31OCT2016 11
300 21NOV2016 56
300 22DEC2016 43
300 20JAN2017 78
300 22FEB2017 55
300 24MAR2017 55
300 30APR2017 33
400 20MAY2016 87
400 21JUN2016 67
400 23JUL2016 68
400 30AUG2016 43
400 21SEP2016 46
400 31OCT2016 32
400 21NOV2016 32
400 22DEC2016 44
400 20JAN2017 44
400 22FEB2017 43
400 24MAR2017 54
400 30APR2017 66
500 20MAY2016 69
500 21JUN2016 54
500 23JUL2016 67
500 30AUG2016 67
500 21SEP2016 54
500 31OCT2016 43
500 21NOV2016 43
500 22DEC2016 67
500 20JAN2017 56
500 22FEB2017 56
500 24MAR2017 56
500 30APR2017 43
;
RUN;
PROC Star
Posts: 7,364

Re: SUM

[ Edited ]

Here is one way:

data want (drop=count count2);
  do until (last.id);
    set dates;
    by id notsorted;
    if first.id then count=1;
    else count+1;
  end;
  do until (last.id);
    set dates;
    by id notsorted;
    if first.id then do;
      count2=1;
      sum=0;
    end;
    else count2+1;
    if count2 ge count-2 then sum+point;
    if last.id then output;
  end;
run;

Art, CEO, AnalystFinder.com

New Contributor
Posts: 3

Re: SUM

Thanks guys,do you guyz know how i can automate the solution such that,it gives the user flexibility to select number of months he or she wants to sum up.If the user wants last 3 or 4 or5 month,he or she can do that without having to change the whole code.Kind of macro variable

DATA DATES;
INPUT id:$8. VISITSmiley Very HappyATE9. POINT:8.;
FORMAT VISIT DATE9.;
DATALINES;
600 20MAY2016 20
600 21JUN2016 30
600 23JUL2016 50
600 30AUG2016 50
600 21SEP2016 30
600 31OCT2016 12
600 21NOV2016 23
600 22DEC2016 22
600 20JAN2017 30
100 20MAY2016 78
100 21JUN2016 54
100 23JUL2016 56
100 30AUG2016 56
100 21SEP2016 89
100 31OCT2016 46
100 21NOV2016 78
100 22DEC2016 78
100 20JAN2017 34
100 22FEB2017 65
100 24MAR2017 45
100 30APR2017 45
200 20MAY2016 90
200 21JUN2016 78
200 23JUL2016 45
200 30AUG2016 45
200 21SEP2016 54
200 31OCT2016 67
200 21NOV2016 78
200 22DEC2016 67
200 20JAN2017 56
200 22FEB2017 43
200 24MAR2017 66
200 30APR2017 65
300 20MAY2016 78
300 21JUN2016 66
300 23JUL2016 55
300 30AUG2016 44
300 21SEP2016 33
300 31OCT2016 11
300 21NOV2016 56
300 22DEC2016 43
300 20JAN2017 78
300 22FEB2017 55
300 24MAR2017 55
300 30APR2017 33
400 20MAY2016 87
400 21JUN2016 67
400 23JUL2016 68
400 30AUG2016 43
400 21SEP2016 46
400 31OCT2016 32
400 21NOV2016 32
400 22DEC2016 44
400 20JAN2017 44
400 22FEB2017 43
400 24MAR2017 54
400 30APR2017 66
500 20MAY2016 69
500 21JUN2016 54
500 23JUL2016 67
500 30AUG2016 67
500 21SEP2016 54
500 31OCT2016 43
500 21NOV2016 43
500 22DEC2016 67
500 20JAN2017 56
500 22FEB2017 56
500 24MAR2017 56
500 30APR2017 43
;
RUN;

PROC Star
Posts: 7,364

Re: SUM

Using the method I suggested only one number would have to be represented as a macro variable, namely:

   if count2 ge count-2 then sum+point;

The -2 represents the number of months minus one.

 

Art, CEO, AnalystFinder.com

 

PROC Star
Posts: 63

Re: SUM


abasiamaokonna0 wrote:

Thanks guys,do you guyz know how i can automate the solution such that,it gives the user flexibility to select number of months he or she wants to sum up.If the user wants last 3 or 4 or5 month,he or she can do that without having to change the whole code.Kind of macro variable

 


I have this suggestion - I put in a check to see if there actually were data points enough for a given ID, the actual number of data points used is in the variable COUNT. I was too lazy to sort your data, so I put in a NOTSORTED in the BY statement. You will see that if you use a high value (e.g. 30) for &N, you will get the same output as from PROC SUMMARY.

%let n=3;
data want;
  do _N_=1 by 1 until(last.id);
    set dates;
    by id notsorted;
    array points(0:%eval(&n-1)) _temporary_;
    points(mod(_N_,&n))=point;
    end;
  count=min(_N_,&n); /* the number of data points actually available for this ID, if less than &n */
  /* null the datapoints from previous IDs, if any */ 
do _N_=_N_+1 to &n; points(mod(_N_,&n))=.; end; sum=sum(of points(*)); run;
Super User
Posts: 10,538

Re: SUM


abasiamaokonna0 wrote:

Thanks guys,do you guyz know how i can automate the solution such that,it gives the user flexibility to select number of months he or she wants to sum up.If the user wants last 3 or 4 or5 month,he or she can do that without having to change the whole code.Kind of macro variable

 


How will you collect the desired interval data from the user? Will they supply "number of last months" or an interval (start date and end date)?

What is the desired result if there are fewer than the requested "months" of data available?

What is the desired result if the requested Interval has one or both endpoints outside of the available data?

What is the desired result if the request is for a period where there are no records for one of the ID values?

 

Also with your data intervals more than a little irregular you might even have to define "month"

Valued Guide
Posts: 797

Re: SUM

[ Edited ]

Assuming the data are grouped by ID (and sorted by date within ID), there is no need for do loops, or extra proc's.  Just maintain a temporary array carrying the most recent 3 (or some other number) of POINTS values, as in:

 

DATA DATES;
INPUT ID : $8. VISIT : DATE9. POINT : 8.;
FORMAT VISIT DATE9.;
DATALINES;
600 20MAY2016 20
600 21JUN2016 30
600 23JUL2016 50
600 30AUG2016 50
600 21SEP2016 30
600 31OCT2016 12
600 21NOV2016 23
600 22DEC2016 22
600 20JAN2017 30
100 20MAY2016 78
100 21JUN2016 54
100 23JUL2016 56
100 30AUG2016 56
100 21SEP2016 89
100 31OCT2016 46
100 21NOV2016 78
100 22DEC2016 78
100 20JAN2017 34
100 22FEB2017 65
100 24MAR2017 45
100 30APR2017 45
200 20MAY2016 90
200 21JUN2016 78
200 23JUL2016 45
200 30AUG2016 45
200 21SEP2016 54
200 31OCT2016 67
200 21NOV2016 78
200 22DEC2016 67
200 20JAN2017 56
200 22FEB2017 43
200 24MAR2017 66
200 30APR2017 65
300 20MAY2016 78
300 21JUN2016 66
300 23JUL2016 55
300 30AUG2016 44
300 21SEP2016 33
300 31OCT2016 11
300 21NOV2016 56
300 22DEC2016 43
300 20JAN2017 78
300 22FEB2017 55
300 24MAR2017 55
300 30APR2017 33
400 20MAY2016 87
400 21JUN2016 67
400 23JUL2016 68
400 30AUG2016 43
400 21SEP2016 46
400 31OCT2016 32
400 21NOV2016 32
400 22DEC2016 44
400 20JAN2017 44
400 22FEB2017 43
400 24MAR2017 54
400 30APR2017 66
500 20MAY2016 69
500 21JUN2016 54
500 23JUL2016 67
500 30AUG2016 67
500 21SEP2016 54
500 31OCT2016 43
500 21NOV2016 43
500 22DEC2016 67
500 20JAN2017 56
500 22FEB2017 56
500 24MAR2017 56
500 30APR2017 43
;
RUN;


%let ndate=3;
data want (keep=id total_points);
  set dates;
  by id notsorted;

  array pts {&ndate} _temporary_;
  if first.id then call missing(of pts{*});
  pts{mod(_N_,&ndate)+1}=point;
  if last.id;
  total_points=sum(of pts{*});
run;

 

This program does the following:

  1.  Assumes the data are grouped by ID (but the NOTSORTED parameter in the BY statement allows the ID groups to be in any order.
  2. The TEMPORARY array
    1. Is the same size as the number of trailing dates wanted.  Just change the macrovar NDATES to get different trailing window sizes.
    2. has its values automatically retained from observation to observation.
  3. The mod(_N_,3) function is the remainder of the observation number divided by 3 (for NDATES=3), resulting in a 0, 1, or 2.  Adding 1 yields a way to put POINT in elements 1, 2, or 3 of the PTS 3-element array.
PROC Star
Posts: 63

Re: SUM

mkeintz,

Just a couple of points:

  • The loop in my suggestion automatically gives the _N_ value the number of data points read from that ID, meaning that you will not have to calculate that otherwise in case the user entered a numeric value greater than the smallest number of observations in a by group.
  • by zero-basing the array (array pts (0:%eval(&ndate-1)) _temporary_Smiley Wink you do not have to subtract 1 from the output of the MOD() function every time. I think this saves CPU, and makes the program simpler. 

 

mkeintz wrote:

Assuming the data are grouped by ID (and sorted by date within ID), there is no need for do loops, or extra proc's.  Just maintain a temporary array carrying the most recent 3 (or some other number) of POINTS values, as in:

 

DATA DATES;
INPUT ID : $8. VISIT : DATE9. POINT : 8.;
FORMAT VISIT DATE9.;
DATALINES;
600 20MAY2016 20
600 21JUN2016 30
600 23JUL2016 50
600 30AUG2016 50
600 21SEP2016 30
600 31OCT2016 12
600 21NOV2016 23
600 22DEC2016 22
600 20JAN2017 30
100 20MAY2016 78
100 21JUN2016 54
100 23JUL2016 56
100 30AUG2016 56
100 21SEP2016 89
100 31OCT2016 46
100 21NOV2016 78
100 22DEC2016 78
100 20JAN2017 34
100 22FEB2017 65
100 24MAR2017 45
100 30APR2017 45
200 20MAY2016 90
200 21JUN2016 78
200 23JUL2016 45
200 30AUG2016 45
200 21SEP2016 54
200 31OCT2016 67
200 21NOV2016 78
200 22DEC2016 67
200 20JAN2017 56
200 22FEB2017 43
200 24MAR2017 66
200 30APR2017 65
300 20MAY2016 78
300 21JUN2016 66
300 23JUL2016 55
300 30AUG2016 44
300 21SEP2016 33
300 31OCT2016 11
300 21NOV2016 56
300 22DEC2016 43
300 20JAN2017 78
300 22FEB2017 55
300 24MAR2017 55
300 30APR2017 33
400 20MAY2016 87
400 21JUN2016 67
400 23JUL2016 68
400 30AUG2016 43
400 21SEP2016 46
400 31OCT2016 32
400 21NOV2016 32
400 22DEC2016 44
400 20JAN2017 44
400 22FEB2017 43
400 24MAR2017 54
400 30APR2017 66
500 20MAY2016 69
500 21JUN2016 54
500 23JUL2016 67
500 30AUG2016 67
500 21SEP2016 54
500 31OCT2016 43
500 21NOV2016 43
500 22DEC2016 67
500 20JAN2017 56
500 22FEB2017 56
500 24MAR2017 56
500 30APR2017 43
;
RUN;


%let ndate=3;
data want (keep=id total_points);
  set dates;
  by id notsorted;

  array pts {&ndate} _temporary_;
  if first.id then call missing(of pts{*});
  pts{mod(_N_,&ndate)+1}=point;
  if last.id;
  total_points=sum(of pts{*});
run;

 

This program does the following:

  1.  Assumes the data are grouped by ID (but the NOTSORTED parameter in the BY statement allows the ID groups to be in any order.
  2. The TEMPORARY array
    1. Is the same size as the number of trailing dates wanted.  Just change the macrovar NDATES to get different trailing window sizes.
    2. has its values automatically retained from observation to observation.
  3. The mod(_N_,3) function is the remainder of the observation number divided by 3 (for NDATES=3), resulting in a 0, 1, or 2.  Adding 1 yields a way to put POINT in elements 1, 2, or 3 of the PTS 3-element array.

 

Valued Guide
Posts: 797

Re: SUM

@s_lassen  - Somehow I overlooked your response, which preceded mine substantially.  Wouldn't have been so expansive in my explanation if I had seen it.

 

Yes, by all means let array have an index range of 0 to ndate-1 instead of 1 to ndate.  I didn't want to overburden the OP with what I suspected would be more new knowledge than wanted.

 

However, I'm not sure  your program as presented does what is intended.  In particular the "sum=sum(of points{*})" follows (rather than precedes) resetting of the array to missing.

PROC Star
Posts: 1,566

Re: SUM

[ Edited ]

A running total can be maintained without an array.

This will work even if there are fewer records than need to be summed.


%let n=3;
data WANT(keep= GRP_ID SUM);
  set DATES end=LASTOBS ;
  GRP_ID=lag(ID);
  if _N_>1 & ID ne GRP_ID then do;
    output;  
    call missing(SUM);
  end;
  SUM + sum(POINT, - (lag&n(ID)=ID) * lag&n(POINT) );
  if LASTOBS then output;  
run;

 

PROC Star
Posts: 63

Re: SUM


mkeintz wrote:

@s_lassen  - Somehow I overlooked your response, which preceded mine substantially.  Wouldn't have been so expansive in my explanation if I had seen it.

 

Yes, by all means let array have an index range of 0 to ndate-1 instead of 1 to ndate.  I didn't want to overburden the OP with what I suspected would be more new knowledge than wanted.

 

However, I'm not sure  your program as presented does what is intended.  In particular the "sum=sum(of points{*})" follows (rather than precedes) resetting of the array to missing.


Actually, the "setting of the array to missing" only concerns the elements that were not set by this ID's observations, in case there were fewer observations than wanted.

Super User
Posts: 17,912

Re: Sum - only last three dates by ID

@abasiamaokonna0 I edited your post for some small insignificant grammatical errors, but also to format your code. 

Please use the insert code (6/7th) icons in the Rich Text Editor to insert your code, otherwise the forum can garble your code, especially if it has colons. I also changed your subject line to be more descriptive. 

PROC Star
Posts: 1,566

Re: Sum - only last three dates by ID

[ Edited ]

Here is another way:


data WANT;
  merge DATES 
        DATES (firstobs=2 rename=( ID=ID2 POINT=POINT2))
        DATES (firstobs=3 rename=( ID=ID3 POINT=POINT3))
        DATES (firstobs=4 rename=( ID=ID4 POINT=POINT4));
  if ID3 ne ID4; %* Check that ID3 is the last ID, i.e. ID4 is a new ID;
  SUM=sum(POINT, POINT2, POINT3);
  output;
  keep ID SUM;
run;

This assumes there are always at least 3 dates per ID, and no check is made to ensure this.

 

ID SUM
600 75
100 155
200 174
300 143
400 163
500 155
PROC Star
Posts: 172

Re: Sum - only last three dates by ID

[ Edited ]

Another fun way:

/*Assuming the number of records in each group is divisible by 3 as it is in your sample to conveniently make it sets of three*/

data want;

     do until(last.id);

           do _n_=1 by 1 until(mod(_n_,3)=0);

                set dates;

                by id notsorted;

                array temp{3} _temporary_;

                temp(_n_)=point;

 

                if last.id then

                     do;

                           sum=sum(of temp(*));

                           output;

                     end;

           end;

     end;

run;

 

Regards,

Naveen Srinivasan

Super User
Posts: 9,687

Re: Sum - only last three dates by ID

DATA DATES;
INPUT ID : $8. VISIT : DATE9. POINT : 8.;
FORMAT VISIT DATE9.;
DATALINES;
600 20MAY2016 20
600 21JUN2016 30
600 23JUL2016 50
600 30AUG2016 50
600 21SEP2016 30
600 31OCT2016 12
600 21NOV2016 23
600 22DEC2016 22
600 20JAN2017 30
100 20MAY2016 78
100 21JUN2016 54
100 23JUL2016 56
100 30AUG2016 56
100 21SEP2016 89
100 31OCT2016 46
100 21NOV2016 78
100 22DEC2016 78
100 20JAN2017 34
100 22FEB2017 65
100 24MAR2017 45
100 30APR2017 45
200 20MAY2016 90
200 21JUN2016 78
200 23JUL2016 45
200 30AUG2016 45
200 21SEP2016 54
200 31OCT2016 67
200 21NOV2016 78
200 22DEC2016 67
200 20JAN2017 56
200 22FEB2017 43
200 24MAR2017 66
200 30APR2017 65
300 20MAY2016 78
300 21JUN2016 66
300 23JUL2016 55
300 30AUG2016 44
300 21SEP2016 33
300 31OCT2016 11
300 21NOV2016 56
300 22DEC2016 43
300 20JAN2017 78
300 22FEB2017 55
300 24MAR2017 55
300 30APR2017 33
400 20MAY2016 87
400 21JUN2016 67
400 23JUL2016 68
400 30AUG2016 43
400 21SEP2016 46
400 31OCT2016 32
400 21NOV2016 32
400 22DEC2016 44
400 20JAN2017 44
400 22FEB2017 43
400 24MAR2017 54
400 30APR2017 66
500 20MAY2016 69
500 21JUN2016 54
500 23JUL2016 67
500 30AUG2016 67
500 21SEP2016 54
500 31OCT2016 43
500 21NOV2016 43
500 22DEC2016 67
500 20JAN2017 56
500 22FEB2017 56
500 24MAR2017 56
500 30APR2017 43
;
RUN;
data temp;
 merge dates dates(keep=id rename=(id=_id) firstobs=4);
 if id ne _id;
run;
proc summary data=temp nway;
class id;
var point;
output out=want sum=;
run;
Ask a Question
Discussion stats
  • 15 replies
  • 278 views
  • 1 like
  • 10 in conversation