Solved: Select maximum date among the subject

KalaBhairava · Posted 08-10-2022 07:45 AM

I have data like table-1 and I want it like table-2

I want maximum date for all the records among the one subject.

Table-1

subject	date
1	3/8/2022	0
2	3/9/2022	0
2	3/10/2022	1
2	3/11/2022	1
3	3/12/2022	0

Table-2

subject	date
1	3/8/2022	0
2	3/11/2022	0
2	3/11/2022	1
2	3/11/2022	1
3	3/12/2022	0

PeterClemmensen · Posted 08-10-2022 08:11 AM

Or

data have;
input subject date :mmddyy10. dummy;
format date mmddyy10.;
datalines;
1 3/8/2022  0 
2 3/9/2022  0 
2 3/10/2022 1 
2 3/11/2022 1 
3 3/12/2022 0 
;

data want(drop = d);
   do _N_ = 1 by 1 until (last.subject);
      set have;
      by subject;
      if date > d then d = date;
   end;
   do _N_ = 1 to _N_;
      set have;
      date = d;
      output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 08-10-2022 08:02 AM

data have;
input subject date :mmddyy10. dummy;
format date mmddyy10.;
datalines;
1 3/8/2022  0 
2 3/9/2022  0 
2 3/10/2022 1 
2 3/11/2022 1 
3 3/12/2022 0 
;

proc sql;
   create table want as
   select subject
        , max(date) as date format = mmddyy10.
        , dummy
   from have
   group by subject
   ;
quit;

Result

subject  date        dummy
1        03/08/2022  0 
2        03/11/2022  1 
2        03/11/2022  0 
2        03/11/2022  1 
3        03/12/2022  0

The DATA to DATA Step Macro
Blog: SASnrd

Anne_A · Posted 08-10-2022 08:07 AM

Hi KalaBhairava,

Here's a quick solution:

proc sort data = table1 out=temp;
	by subject descending date;
run;

data table2 (drop = h_d);
	set temp;
	by subject;
	retain h_d;
	if first.subject then h_d = date;
	date = h_d;
run;

HTH,

Cheers.

PeterClemmensen · Posted 08-10-2022 08:11 AM

Or

data have;
input subject date :mmddyy10. dummy;
format date mmddyy10.;
datalines;
1 3/8/2022  0 
2 3/9/2022  0 
2 3/10/2022 1 
2 3/11/2022 1 
3 3/12/2022 0 
;

data want(drop = d);
   do _N_ = 1 by 1 until (last.subject);
      set have;
      by subject;
      if date > d then d = date;
   end;
   do _N_ = 1 to _N_;
      set have;
      date = d;
      output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

mkeintz · Posted 08-10-2022 09:59 AM

If the data are sorted by subject:

data want;
  set have (in=firstpass) have (in=secondpass);
  by subject;
  retain maxdate;
  if first.subject then maxdate=date;
  else if firstpass then maxdate=max(date,maxdate);
  if secondpass;
run;

But if the data are not sorted, and sorting is too expensive then you should use @FreelanceReinh 's suggestion. (I've struck out my original content below - it was wrong).

data want;
  set have;
  if _n_=1 then do;
    if 0 then set have (rename=(date=maxdate));
    declare hash h (dataset:'have (keep=subject date rename=(date=maxdate))',multidata:'Y',ordered:'D');
      h.definekey('subject');
      h.definedata('maxdate');
      h.definedone();
  end;
  h.find();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 08-10-2022 02:24 PM

Thanks, @mkeintz. I always like the elegance of your trademark "firstpass-secondpass" solutions.

I think this technique could even help in the "not sorted" case to support the hash object approach. Here's my attempt to combine the two:

data want(drop=rc);
  set have (in=firstpass) have (in=secondpass);
  if _n_=1 then do;
    dcl hash h();
    h.definekey('subject');
    h.definedata('maxdate');
    h.definedone();
    maxdate=.;
  end;
  rc=h.find();
  if firstpass & date>maxdate then h.replace(key: subject, data: date);
  if secondpass;
  format maxdate mmddyy10.;
run;

mkeintz · Posted 08-10-2022 07:13 PM

@FreelanceReinh

I appreciate your positive comment about the firstpass/secondpass paradigm.

But if the data is sorted, there is likely an efficiency difference compared to your suggestion. When you use it with a BY statement, as in

data want;
    set have (in=firstpass) have (in=secondpass);
    by subject;
    .....

the data are interleaved, meaning a given ID group is read twice before proceeding to the next ID group. It'll be more efficient because that BY group data is likely sitting in memory allocated as a disk buffer after the firstpass. So the secondpass likely reads from memory rather than over the disk channel from external storage.

But if you have to do it without the BY statement, as in:

data want;
    set have (in=firstpass) have (in=secondpass);
    .... other code ....

it reads the ENTIRE dataset for firstpass before rereading during secondpass. You'll almost certainly be forced to implement a lot more disk activity.

Of course, my solution for unsorted data using the hash object statement declare hash h (dataset:'have'.....) would have the same disadvantage.

In that case, your solution would likely be a bit better because your hash object would take a lot less memory, having only one "row" per SUBJECT.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 08-11-2022 03:41 AM

Thanks. Very good point about reading data from memory vs. disk. Yes, for sorted data I wouldn't have suggested this approach.

Spoiler

Actually, my main motivation for creating this solution was to "rescue" your hash object idea, which I think is unlikely to work in the original form because the "ordered:'D'" argument tag refers to the key values only, i.e., won't sort the dates (or has this changed in a SAS release newer than 9.4M5?). Adding 'maxdate' to the key portion of h would hamper the intended lookup based only on subject.

Actually, my main motivation for creating this solution was to "rescue" your hash object idea, which I think is unlikely to work in the original form because the "ordered:'D'" argument tag refers to the key values only, i.e., won't sort the dates (or has this changed in a SAS release newer than 9.4M5?). Adding 'maxdate' to the key portion of h would hamper the intended lookup based only on subject.

Select maximum date among the subject

Re: Select maximum date among the subject

Re: Select maximum date among the subject

Re: Select maximum date among the subject

Re: Select maximum date among the subject

Re: Select maximum date among the subject

Re: Select maximum date among the subject

Re: Select maximum date among the subject

Re: Select maximum date among the subject

Ready to join fellow brilliant minds for the SAS Hackathon?