Re: Select Row with the max Value per group-Find efficient program

Ronein · Posted 03-20-2023 05:04 PM

Hello

I have a very big data set which contain 502,899,724 rows.

In the data set there are 90 columns.

Each customer ID (Field ID) appear in multiple rows.

Only 3 fields are interesting for me- ID, date,status.

The data set is not sorted.

The task is to create a new data set with one row per customer ID.

The criteria is to choose the row with the maximum date value.

I know 4 ways to do this task but the problem is that when I run it in real life with the big data set then the process run long time and cannot finish.

I would like to ask please If anyone can show me a clever way to deal with this problem.

I will appreciate If you can show a code that solve this issue.

Please note that the raw data set doesn't have Index and is not sorted

I also want to ask- from the 4 ways I showed here which way is most efficient with related to run time?


data have;
Input ID date : date9. status;
format date date9.;
cards;
1 19MAR2023 1
1 17MAR2023 0
1 18MAR2023 0
1 15MAR2023 1
1 14MAR2023 0
1 12MAR2023 0
2 11MAR2023 0
2 13MAR2023 0
2 15MAR2023 0
2 14MAR2023 0
2 12MAR2023 0
3 18MAR2023 1
3 17MAR2023 0
;
Run;



proc sort data=have;
by ID date;
Run;
data want_Way1;
set want_Way3;
by ID;
IF last.ID;
Run;


proc sql;
create table want_Way2 as
select a.*
from have as a
where  date=(select max(date)  FROM have as b Where b.ID=a.ID)
;
quit;


proc sql;
create table want_Way3 as
select a.*
from have as a
inner join
(select ID,
        max(date) as date
		From have 
		group by ID) as b
on a.ID=b.ID  and a.date=b.date
;
quit;


proc sql;
create table want_Way4 as
select a.*
from have  as a
LEFT  Join have as b
on a.ID=b.ID and a.date<b.date
where b.ID is null
;
quit;

have index.

Tom · Posted 03-20-2023 06:02 PM

Why do you have such a large dataset that is not sorted? It will be pretty hard to use for any analysis.

Why are you taking ALL of the variables if you only want 3 of them? That makes any of those programs take much longer than you need.

I would suggest first trying PROC SUMMARY and see if that works.

proc summary nway data=have(keep=id date status);
  class id; 
  output out=want(drop=_type_ _freq_) idgroup(max(date) out[1] (date status)=);
run;

It might complain if there are too many levels of ID to fit into memory.

If so then try a divide and conquer approach using OBS= and FIRSTOBS= dataset options.

proc summary nway data=have(keep=id date status firstobs=1 obs=1000000);
  class id; 
  output part1=want(drop=_type_ _freq_) idgroup(max(date) out[1] (date status)=);
run;
proc summary nway data=have(keep=id date status firstobs=1000001 obs=2000000);
  class id; 
  output part2=want(drop=_type_ _freq_) idgroup(max(date) out[1] (date status)=);
run;
...
data want;
  set part1-part503 ;
  by id date ;
  if last.id;
run;

ballardw · Posted 03-20-2023 06:03 PM

If you want the maximum date per id with that many observations I wouldn't touch anything with SQL.

You say "The task is to create a new data set with one row per customer ID." But do not describe anything about what is required for the third variable of interest status.

You do not say anything about how many unique values of ID you have but I suspect that you may have more than the Class statement in Proc Summary would accept.

You don't say a bit about where the thing failed. You may be running out of work space with the Sort step as unless you use some options you need about 3 times as much disk space as the set occupies to process and may well exceed available or allotted work space. If your job fails at the the Proc Sort then TAGSORT may help.

My approach would be :

Proc sort data=have tagsort;
   by id;
run;

proc summary data=have ;
   by id;
   var date;
   output out=want (drop=_:) max= ;
run;

If you want the STATUS variable value that is associated with the max date, not stated anywhere in your description then

proc summary data=have ;
   by id;
   var date;
   output out=want (drop=_:) max=
        maxid(date(status))=status;
run;

Ronein · Posted 03-20-2023 08:40 PM

The task is to find the most recent customer status so technically for each customer ID need to select the row with maximum date.(Please note that it cannot happen that for same customer ID have multiple rows with same date)

Ronein · Posted 03-20-2023 08:45 PM

Please note that I MUST use out statement in sort procedure because I am not able to sort the raw data set,

Will it also add running time?

ballardw · Posted 03-20-2023 09:39 PM

@Ronein wrote:

Please note that I MUST use out statement in sort procedure because I am not able to sort the raw data set,

Will it also add running time?

Shouldn't, but if you don't need all the variables just keep the ones you want in the output set.

Patrick · Posted 03-20-2023 10:26 PM

I'd expect the proc sort/data step to perform best because it only requires one sort of the data. Make sure to only keep the variables of interest. ...and I assume that compression is set as a default else add it as a dataset option.

proc sort data=have(keep=id date status) out=inter(compress=binary);
  by id date;
run;

IF you've got sufficient memory to hold one row per ID in-memory then using a data step hash will likely outperform any other approach as it doesn't require any physical sort of the data.

data _null_;
  if _n_=1 then
    do;
      if 0 then set have(keep=date rename=(date=_date));
      length _status 3;
      dcl hash h1();
      h1.defineKey('id');
      h1.defineData('id','_date','_status');
      h1.defineDone();
    end;

  set have end=_last;

  if h1.find() ne 0 then h1.add(key:id, data:id, data:date, data:status);
  else
    do;
      if _date<date then h1.replace(key:id, data:id, data:date, data:status);
    end;
  
  if _last then h1.output(dataset:'want(rename=(_date=date _status=status) compress=binary)');

run;

Ksharp · Posted 03-21-2023 07:43 AM

Same code with Patrick.

/*
Since you only care about three variables,
hope you have memory big enough to run Hash Table.
*/
data have;
Input ID date : date9. status;
format date date9.;
cards;
1 19MAR2023 1
1 17MAR2023 0
1 18MAR2023 0
1 15MAR2023 1
1 14MAR2023 0
1 12MAR2023 0
2 11MAR2023 0
2 13MAR2023 0
2 15MAR2023 0
2 14MAR2023 0
2 12MAR2023 0
3 18MAR2023 1
3 17MAR2023 0
;
Run;

data _null_;
if _n_=1 then do;
 if 0 then set have;
 declare hash h(hashexp:20);
 h.definekey('id');
 h.definedata('id','date','status');
 h.definedone();
end;
set have end=last;
_date=date;_status=status;
if h.find()=0 then do;if _date>date then h.replace(key:id,data:id,data:_date,data:_status);end;
 else h.add();
if last then h.output(dataset:'want');
run;

Registration is open

SAS Training: Just a Click Away