I am trying to look at different lines of data and count how many times a line falls within 30 days of previous lines for the same id and code. I have been trying to do this with a data step but haven't been able to figure it out and am open to any solutions.
Here is my starting data.
data have;
input id code date :date9.;
format date date9.;
datalines;
4350 1 '01JUN2021'd
4350 1 '13JUN2021'd
4456 3 '27JUL2021'd
4456 4 '27NOV2021'd
1234 1 '01JAN2021'd
1234 1 '10JAN2021'd
1234 1 '21JAN2021'd
;
run;
Here is what I'm trying to generate.
data want;
input id code date :date9. count;
format date date9.;
datalines;
4350 1 '01JUN2021'd 0
4350 1 '13JUN2021'd 1
4456 3 '27JUL2021'd 0
4456 4 '27NOV2021'd 0
1234 1 '01JAN2021'd 0
1234 1 '10JAN2021'd 1
1234 1 '21JAN2021'd 2
;
run;
So if I interpret you right, we need to keep a starting date for the group, and start a new group when we go past 30 days of that:
data want;
set have;
by id notsorted code;
retain start_date;
l_date = lag(date);
if first.code then start_date = date;
if not first.code and date - l_date le 30
then do;
if date - start_date le 30
then count + 1;
else do;
count = 1;
start_date = l_date;
end;
end;
else count = 0;
drop l_date start_date;
run;
In a data step, use BY processing and LAG to compare with the previous value.
data want;
set have;
by id notsorted code;
if not first.code and date - lag(date) le 30
then count + 1;
else count = 0;
run;
The SUM statement (count + 1) implies an automatic RETAIN.
This is very close, but it seems to "overcount" in some instances. See below modified example to demonstrate.
data have;
input id code date :date9.;
format date date9.;
datalines;
4350 1 '01JUN2021'd
4350 1 '13JUN2021'd
4456 3 '27JUL2021'd
4456 4 '27NOV2021'd
1234 1 '01JAN2021'd
1234 1 '10JAN2021'd
1234 1 '21JAN2021'd
1234 1 '15FEB2021'd
;
run;
Your code would produce this:
data code_output;
input id code date :date9. count;
format date date9.;
datalines;
4350 1 '01JUN2021'd 0
4350 1 '13JUN2021'd 1
4456 3 '27JUL2021'd 0
4456 4 '27NOV2021'd 0
1234 1 '01JAN2021'd 0
1234 1 '10JAN2021'd 1
1234 1 '21JAN2021'd 2
1234 1 '15FEB2021'd 3
;
run;
What I'm trying to get is this:
data want;
input id code date :date9. count;
format date date9.;
datalines;
4350 1 '01JUN2021'd 0
4350 1 '13JUN2021'd 1
4456 3 '27JUL2021'd 0
4456 4 '27NOV2021'd 0
1234 1 '01JAN2021'd 0
1234 1 '10JAN2021'd 1
1234 1 '21JAN2021'd 2
1234 1 '15FEB2021'd 1
;
run;
This is because the last line is only within 30 days of one of the other lines with matching codes and ids.
So if I interpret you right, we need to keep a starting date for the group, and start a new group when we go past 30 days of that:
data want;
set have;
by id notsorted code;
retain start_date;
l_date = lag(date);
if first.code then start_date = date;
if not first.code and date - l_date le 30
then do;
if date - start_date le 30
then count + 1;
else do;
count = 1;
start_date = l_date;
end;
end;
else count = 0;
drop l_date start_date;
run;
You can save yourself a bit of typing for the data lines in your code:
data have; input id code date :date9.; format date date9.; datalines; 4350 1 01JUN2021 4350 1 13JUN2021 4456 3 27JUL2021 4456 4 27NOV2021 1234 1 01JAN2021 1234 1 10JAN2021 1234 1 21JAN2021 ; run;
the quotes and d are needed to define a literal date in code and not needed, and potentially can cause errors when reading the data.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.