topic Re: Group / clustering observations by date in SAS Programming

Group / clustering observations by date

cchubbard1963 — Fri, 22 May 2020 19:28:41 GMT

I have a large data set with this structure:

ID	DATE
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015

I would like to define a new "clustering" variable, such that observations with the same ID value and which occur within 0 to 7 days of each other have the same value of the CLUSTER variable. If the difference is greater than 7 days, then the CLUSTER variable should increase by 1, a new cluster created, and the day count resets. The result I want for the above would look like this:

ID	DATE	CLUSTER
16602	07/20/2015	1
16602	07/25/2015	1
16602	07/28/2015	2
20302	03/16/2016	3
20302	03/18/2016	3
20302	03/25/2016	4
20302	02/18/2017	5

Any help would be greatly appreciated. Thanks.

Re: Group / clustering observations by date

mkeintz — Sat, 23 May 2020 00:49:17 GMT

This is a common request. You want to increment the cluster number whenever

You begin a new id
You encounter a date more than 7 days after the starting date of the previous cluster.

To do this in a sas DATA step, you have to keep (i.e. "retain") the starting date of the current cluster, to be compared to the incoming date:

data want (drop=startdate);
  set have; 
  by id;
  retain startdate;
  if first.id=1 or date-7 > startdate then do;
    cluster+1;
    startdate=date;
  end;
run;

Re: Group / clustering observations by date

LeonidBatkhan — Sat, 23 May 2020 02:43:57 GMT

I would do it similar to what mkeintz suggested with a couple small but important enhancements. You need your data set to be sorted by ID and DATE, and also have by ID DATE; in the data step:

data HAVE;
   input ID $1-5 @7 DATE mmddyy10.;
   format DATE mmddyy10.;
   lines;
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015
;

proc sort data=HAVE;
   by ID DATE;
run;

data WANT (drop=FIRSTDATE);
   set HAVE;
   by ID DATE;
   retain FIRSTDATE;
   if first.ID or DATE-FIRSTDATE>7 then
   do;
      FIRSTDATE = DATE;
      CLUSTER+1;
   end;
run;

Hope this helps.