BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cchubbard1963
Calcite | Level 5

I have a large data set with this structure:

 

IDDATE
1660207/20/2015
1660207/25/2015
1660207/28/2015
2030203/16/2016
2030203/18/2016
2030203/25/2016
2030202/18/2015
 
I would like to define a new "clustering" variable, such that observations with the same ID value and which occur within 0 to 7 days of each other have the same value of the CLUSTER variable.  If the difference is greater than 7 days, then the CLUSTER variable should increase by 1, a new cluster created, and the day count resets.  The result I want for the above would look like this:

IDDATECLUSTER
1660207/20/20151
1660207/25/20151
1660207/28/20152
2030203/16/20163
2030203/18/20163
2030203/25/20164
2030202/18/20175
 
Any help would be greatly appreciated.  Thanks.
1 ACCEPTED SOLUTION

Accepted Solutions
LeonidBatkhan
Lapis Lazuli | Level 10

I would do it similar to what mkeintz suggested with a couple small but important enhancements. You need your data set to be sorted by ID and DATE, and also have by ID DATE; in the data step:

 

data HAVE;
   input ID $1-5 @7 DATE mmddyy10.;
   format DATE mmddyy10.;
   lines;
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015
;

proc sort data=HAVE;
   by ID DATE;
run;

data WANT (drop=FIRSTDATE);
   set HAVE;
   by ID DATE;
   retain FIRSTDATE;
   if first.ID or DATE-FIRSTDATE>7 then
   do;
      FIRSTDATE = DATE;
      CLUSTER+1;
   end;
run;

Hope this helps.

 

View solution in original post

2 REPLIES 2
mkeintz
PROC Star

This is a common request.  You want to increment the cluster number whenever

  1. You begin a new id
  2. You encounter a date more than 7 days after the starting date of the previous cluster.

To do this in a sas DATA step, you have to keep (i.e. "retain") the starting date of the current cluster, to be compared to the incoming date:

 

data want (drop=startdate);
  set have; 
  by id;
  retain startdate;
  if first.id=1 or date-7 > startdate then do;
    cluster+1;
    startdate=date;
  end;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LeonidBatkhan
Lapis Lazuli | Level 10

I would do it similar to what mkeintz suggested with a couple small but important enhancements. You need your data set to be sorted by ID and DATE, and also have by ID DATE; in the data step:

 

data HAVE;
   input ID $1-5 @7 DATE mmddyy10.;
   format DATE mmddyy10.;
   lines;
16602	07/20/2015
16602	07/25/2015
16602	07/28/2015
20302	03/16/2016
20302	03/18/2016
20302	03/25/2016
20302	02/18/2015
;

proc sort data=HAVE;
   by ID DATE;
run;

data WANT (drop=FIRSTDATE);
   set HAVE;
   by ID DATE;
   retain FIRSTDATE;
   if first.ID or DATE-FIRSTDATE>7 then
   do;
      FIRSTDATE = DATE;
      CLUSTER+1;
   end;
run;

Hope this helps.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 441 views
  • 0 likes
  • 3 in conversation