Solved: Group / clustering observations by date

cchubbard1963 · Posted 05-22-2020 03:28 PM

I have a large data set with this structure:

ID	DATE
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015

I would like to define a new "clustering" variable, such that observations with the same ID value and which occur within 0 to 7 days of each other have the same value of the CLUSTER variable. If the difference is greater than 7 days, then the CLUSTER variable should increase by 1, a new cluster created, and the day count resets. The result I want for the above would look like this:

ID	DATE	CLUSTER
16602	07/20/2015	1
16602	07/25/2015	1
16602	07/28/2015	2
20302	03/16/2016	3
20302	03/18/2016	3
20302	03/25/2016	4
20302	02/18/2017	5

Any help would be greatly appreciated. Thanks.

LeonidBatkhan · Posted 05-22-2020 10:41 PM

I would do it similar to what mkeintz suggested with a couple small but important enhancements. You need your data set to be sorted by ID and DATE, and also have by ID DATE; in the data step:

data HAVE;
   input ID $1-5 @7 DATE mmddyy10.;
   format DATE mmddyy10.;
   lines;
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015
;

proc sort data=HAVE;
   by ID DATE;
run;

data WANT (drop=FIRSTDATE);
   set HAVE;
   by ID DATE;
   retain FIRSTDATE;
   if first.ID or DATE-FIRSTDATE>7 then
   do;
      FIRSTDATE = DATE;
      CLUSTER+1;
   end;
run;

Hope this helps.

➤ Leonid's SAS blog

View solution in original post

mkeintz · Posted 05-22-2020 08:47 PM

This is a common request. You want to increment the cluster number whenever

You begin a new id
You encounter a date more than 7 days after the starting date of the previous cluster.

To do this in a sas DATA step, you have to keep (i.e. "retain") the starting date of the current cluster, to be compared to the incoming date:

data want (drop=startdate);
  set have; 
  by id;
  retain startdate;
  if first.id=1 or date-7 > startdate then do;
    cluster+1;
    startdate=date;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

LeonidBatkhan · Posted 05-22-2020 10:41 PM

I would do it similar to what mkeintz suggested with a couple small but important enhancements. You need your data set to be sorted by ID and DATE, and also have by ID DATE; in the data step:

data HAVE;
   input ID $1-5 @7 DATE mmddyy10.;
   format DATE mmddyy10.;
   lines;
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015
;

proc sort data=HAVE;
   by ID DATE;
run;

data WANT (drop=FIRSTDATE);
   set HAVE;
   by ID DATE;
   retain FIRSTDATE;
   if first.ID or DATE-FIRSTDATE>7 then
   do;
      FIRSTDATE = DATE;
      CLUSTER+1;
   end;
run;

Hope this helps.

➤ Leonid's SAS blog

Group / clustering observations by date

Re: Group / clustering observations by date

Re: Group / clustering observations by date

Re: Group / clustering observations by date

Group / clustering observations by date

Re: Group / clustering observations by date

Re: Group / clustering observations by date

Re: Group / clustering observations by date

Registration is open

SAS Training: Just a Click Away