I am trying to look at different lines of data and count how many times a line falls within 30 days of previous lines for the same id and code. I have been trying to do this with a data step but haven't been able to figure it out and am open to any solutions.
Here is my starting data.
data have;
input id code date :date9.;
format date date9.;
datalines;
4350 1 '01JUN2021'd
4350 1 '13JUN2021'd
4456 3 '27JUL2021'd
4456 4 '27NOV2021'd
1234 1 '01JAN2021'd
1234 1 '10JAN2021'd
1234 1 '21JAN2021'd
;
run;
Here is what I'm trying to generate.
data want;
input id code date :date9. count;
format date date9.;
datalines;
4350 1 '01JUN2021'd 0
4350 1 '13JUN2021'd 1
4456 3 '27JUL2021'd 0
4456 4 '27NOV2021'd 0
1234 1 '01JAN2021'd 0
1234 1 '10JAN2021'd 1
1234 1 '21JAN2021'd 2
;
run;
So if I interpret you right, we need to keep a starting date for the group, and start a new group when we go past 30 days of that:
data want;
set have;
by id notsorted code;
retain start_date;
l_date = lag(date);
if first.code then start_date = date;
if not first.code and date - l_date le 30
then do;
if date - start_date le 30
then count + 1;
else do;
count = 1;
start_date = l_date;
end;
end;
else count = 0;
drop l_date start_date;
run;
In a data step, use BY processing and LAG to compare with the previous value.
data want;
set have;
by id notsorted code;
if not first.code and date - lag(date) le 30
then count + 1;
else count = 0;
run;
The SUM statement (count + 1) implies an automatic RETAIN.
This is very close, but it seems to "overcount" in some instances. See below modified example to demonstrate.
data have;
input id code date :date9.;
format date date9.;
datalines;
4350 1 '01JUN2021'd
4350 1 '13JUN2021'd
4456 3 '27JUL2021'd
4456 4 '27NOV2021'd
1234 1 '01JAN2021'd
1234 1 '10JAN2021'd
1234 1 '21JAN2021'd
1234 1 '15FEB2021'd
;
run;
Your code would produce this:
data code_output;
input id code date :date9. count;
format date date9.;
datalines;
4350 1 '01JUN2021'd 0
4350 1 '13JUN2021'd 1
4456 3 '27JUL2021'd 0
4456 4 '27NOV2021'd 0
1234 1 '01JAN2021'd 0
1234 1 '10JAN2021'd 1
1234 1 '21JAN2021'd 2
1234 1 '15FEB2021'd 3
;
run;
What I'm trying to get is this:
data want;
input id code date :date9. count;
format date date9.;
datalines;
4350 1 '01JUN2021'd 0
4350 1 '13JUN2021'd 1
4456 3 '27JUL2021'd 0
4456 4 '27NOV2021'd 0
1234 1 '01JAN2021'd 0
1234 1 '10JAN2021'd 1
1234 1 '21JAN2021'd 2
1234 1 '15FEB2021'd 1
;
run;
This is because the last line is only within 30 days of one of the other lines with matching codes and ids.
So if I interpret you right, we need to keep a starting date for the group, and start a new group when we go past 30 days of that:
data want;
set have;
by id notsorted code;
retain start_date;
l_date = lag(date);
if first.code then start_date = date;
if not first.code and date - l_date le 30
then do;
if date - start_date le 30
then count + 1;
else do;
count = 1;
start_date = l_date;
end;
end;
else count = 0;
drop l_date start_date;
run;
You can save yourself a bit of typing for the data lines in your code:
data have; input id code date :date9.; format date date9.; datalines; 4350 1 01JUN2021 4350 1 13JUN2021 4456 3 27JUL2021 4456 4 27NOV2021 1234 1 01JAN2021 1234 1 10JAN2021 1234 1 21JAN2021 ; run;
the quotes and d are needed to define a literal date in code and not needed, and potentially can cause errors when reading the data.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.