BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cchubbard1963
Calcite | Level 5

I have a large data set with this structure:

 

IDDATE
1660207/20/2015
1660207/25/2015
1660207/28/2015
2030203/16/2016
2030203/18/2016
2030203/25/2016
2030202/18/2015
 
I would like to define a new "clustering" variable, such that observations with the same ID value and which occur within 0 to 7 days of each other have the same value of the CLUSTER variable.  If the difference is greater than 7 days, then the CLUSTER variable should increase by 1, a new cluster created, and the day count resets.  The result I want for the above would look like this:

IDDATECLUSTER
1660207/20/20151
1660207/25/20151
1660207/28/20152
2030203/16/20163
2030203/18/20163
2030203/25/20164
2030202/18/20175
 
Any help would be greatly appreciated.  Thanks.
1 ACCEPTED SOLUTION

Accepted Solutions
LeonidBatkhan
Lapis Lazuli | Level 10

I would do it similar to what mkeintz suggested with a couple small but important enhancements. You need your data set to be sorted by ID and DATE, and also have by ID DATE; in the data step:

 

data HAVE;
   input ID $1-5 @7 DATE mmddyy10.;
   format DATE mmddyy10.;
   lines;
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015
;

proc sort data=HAVE;
   by ID DATE;
run;

data WANT (drop=FIRSTDATE);
   set HAVE;
   by ID DATE;
   retain FIRSTDATE;
   if first.ID or DATE-FIRSTDATE>7 then
   do;
      FIRSTDATE = DATE;
      CLUSTER+1;
   end;
run;

Hope this helps.

 

View solution in original post

2 REPLIES 2
mkeintz
PROC Star

This is a common request.  You want to increment the cluster number whenever

  1. You begin a new id
  2. You encounter a date more than 7 days after the starting date of the previous cluster.

To do this in a sas DATA step, you have to keep (i.e. "retain") the starting date of the current cluster, to be compared to the incoming date:

 

data want (drop=startdate);
  set have; 
  by id;
  retain startdate;
  if first.id=1 or date-7 > startdate then do;
    cluster+1;
    startdate=date;
  end;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LeonidBatkhan
Lapis Lazuli | Level 10

I would do it similar to what mkeintz suggested with a couple small but important enhancements. You need your data set to be sorted by ID and DATE, and also have by ID DATE; in the data step:

 

data HAVE;
   input ID $1-5 @7 DATE mmddyy10.;
   format DATE mmddyy10.;
   lines;
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015
;

proc sort data=HAVE;
   by ID DATE;
run;

data WANT (drop=FIRSTDATE);
   set HAVE;
   by ID DATE;
   retain FIRSTDATE;
   if first.ID or DATE-FIRSTDATE>7 then
   do;
      FIRSTDATE = DATE;
      CLUSTER+1;
   end;
run;

Hope this helps.

 

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 936 views
  • 0 likes
  • 3 in conversation