I need to compare multiple dates + inclusion/exclusion and hoping i can have a shorter code without hardcoding
Please run the code below to see the expected result
data dataset1; input id $1-3 start_dt 4-12 m1_dt 13-21 dt1 22-30 dt1_code $31-34 dt2 35-43 dt2_code $44-47 dt3 48-56 dt3_code $57-60 dt4 61-69 dt4_code $70-73; datalines; 911 20200401 20200410 20200408 AAA 20200426 AAA 20200504 EEE 20200603 EEE 912 20200401 20200411 20200409 AAA 20200426 AAA 20200506 AAA 20200508 BBB 913 20200401 20200409 20200408 BBB 20200426 CCC 20200504 EEE 20200506 EEE 914 20200401 20200423 20200421 BBB 20200422 CCC 915 20200401 20200410 20200408 DDD 20200426 DDD 916 20200401 20200410 20200408 DDD 917 20200401 20200410 20200408 AAA 918 20200401 20200410 20200408 CCC ; run; proc sql; create table dataset2 as select distinct id, input(put(start_dt,8.),yymmdd8.) as start_dt format mmddyy10., input(put(m1_dt,8.),yymmdd8.) as m1_dt format mmddyy10., input(put(dt1,8.),yymmdd8.) as dt1 format mmddyy10., dt1_code, input(put(dt2,8.),yymmdd8.) as dt2 format mmddyy10., dt2_code, input(put(dt3,8.),yymmdd8.) as dt3 format mmddyy10., dt3_code, input(put(dt4,8.),yymmdd8.) as dt4 format mmddyy10., dt4_code from dataset1; quit; %let exlude=('DDD','EEE','FFF'); /*expected result*/ proc sql; create table expected_result as select distinct a.* ,case when ((m1_dt is not null and (m1_dt-start_dt)<=30) and (((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or ((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30)) or ((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30))) and (dt1_code not in &exlude. and dt2_code not in &exlude. and dt3_code not in &exlude. and dt4_code not in &exlude.)) then 'yes with 1m and at least 2 dt within 30 days' when ((m1_dt is not null and (m1_dt-start_dt)<=30) and (((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or ((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30)) or ((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30))) and (dt1_code in &exlude. or dt2_code in &exlude. or dt3_code in &exlude. or dt4_code in &exlude.)) then 'yes with exclusion code + 1m and at least 2 dt within 30 days' when (dt1_code in &exlude. or dt2_code in &exlude. or dt3_code in &exlude. or dt4_code in &exlude.) then 'no with exclusion code' when (dt1_code not in &exlude. and dt2_code not in &exlude. and dt3_code not in &exlude. and dt4_code not in &exlude.) then 'no without exclusion code' else 'no' end as identifier from dataset2 a;quit;
First question would be why are your "date" variables not already date values?
What does m1_dt, dt1 and dt2 (etc) names actually mean. It may give us a clue to better file structure.
You really don't provide much in the way of what seems "right" for the data but this may give you a clue to one way to get a count of days less within 30:
data example; input a x y z; count= sum( 0 le x-a le 30,0 le y-a le 30,0 le z-a le 30); datalines; 25 29 45 60 25 . 45 60 25 . . 60 25 69 65 60 25 29 . 60 ;
This sort of works because each comparison will return a value of 1 or 0. Then sums those.
Comment on "within": Typically I would expect an ABSOLUTE value for something like (x-y) to get the actual interval. You are only testing that startdate is less than 30 days prior to dt. So I is there something in your data you have not stated that says dt1 and such are only larger than startdate when present???
Since SAS date values are integer numbers of days you can test if a group of variables "are within 30 days" with the Range function:
if range(dt1,dt2,dt3,dt4) le 30 then <whatever>. This will ignore missing values .
The range function is the result of the largest value subtracting the smallest.
Your exclusion and identifier comments aren't very clear.
I also don't see anything in this that really suggests SQL is a better solution than a data step.
I have to admit, I'm not going to dig through this entire mass of code to see where it can be simplified.
I will throw out my gut feeling that if you do this in a DATA step rather than SQL, then you have the feature called ARRAY and then it doesn't matter how many dtxxx variables you have, you can handle them all in an ARRAY, you program it one time and loop through. No hardcoding. You also can use the WHICHC function to search across all the dtxxx variables.
Another thought is always Maxim 19, "Long beats Wide", which means instead of having dt1 through dt4 on one row, you have separate rows with an indetifier 1 through xxx, and a variable dt. This can greatly simplify your programming and again lets you avoid hard-coding stuff. There are many many examples of Long beats Wide here in the SAS Communities.
Here's a tutorial on using Arrays in SAS
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/
Definitely need arrays and a data step.
If you had a long format then SQL would actually work better but you don't so in this case, a data step would be more efficient.
First question would be why are your "date" variables not already date values?
What does m1_dt, dt1 and dt2 (etc) names actually mean. It may give us a clue to better file structure.
You really don't provide much in the way of what seems "right" for the data but this may give you a clue to one way to get a count of days less within 30:
data example; input a x y z; count= sum( 0 le x-a le 30,0 le y-a le 30,0 le z-a le 30); datalines; 25 29 45 60 25 . 45 60 25 . . 60 25 69 65 60 25 29 . 60 ;
This sort of works because each comparison will return a value of 1 or 0. Then sums those.
Comment on "within": Typically I would expect an ABSOLUTE value for something like (x-y) to get the actual interval. You are only testing that startdate is less than 30 days prior to dt. So I is there something in your data you have not stated that says dt1 and such are only larger than startdate when present???
Since SAS date values are integer numbers of days you can test if a group of variables "are within 30 days" with the Range function:
if range(dt1,dt2,dt3,dt4) le 30 then <whatever>. This will ignore missing values .
The range function is the result of the largest value subtracting the smallest.
Your exclusion and identifier comments aren't very clear.
I also don't see anything in this that really suggests SQL is a better solution than a data step.
So as long as the ID has one m1_dt and at least two dt# within 30 days after the start_dt.
I'm using the exclusion list ('DDD','EEE','FFF') to separate or flag the ids "see the identifier column below"
Expected result:
Below code returns the same result than your desired SQL.
I've converted your case statements to data step logic - and while doing this realized that due to the OR conditions you actually only need to check the day differences for variables dt1 and dt2. ...not sure if that's what you've intended to do - but it's not changing the logic you've provided.
data have;
infile datalines dsd dlm=' ' truncover;
input
@1 id $3.
@5 start_dt yymmdd8.
@14 m1_dt yymmdd8.
@23 dt1 yymmdd8.
@32 dt1_code $3.
@36 dt2 yymmdd8.
@45 dt2_code $3.
@49 dt3 yymmdd8.
@58 dt3_code $3.
@62 dt4 yymmdd8.
@71 dt4_code $3.
;
format start_dt m1_dt dt1-dt4 date9.;
datalines;
911 20200401 20200410 20200408 AAA 20200426 AAA 20200504 EEE 20200603 EEE
912 20200401 20200411 20200409 AAA 20200426 AAA 20200506 AAA 20200508 BBB
913 20200401 20200409 20200408 BBB 20200426 CCC 20200504 EEE 20200506 EEE
914 20200401 20200423 20200421 BBB 20200422 CCC
915 20200401 20200410 20200408 DDD 20200426 DDD
916 20200401 20200410 20200408 DDD
917 20200401 20200410 20200408 AAA
918 20200401 20200410 20200408 CCC
;
proc format;
value case
1='yes with 1m and at least 2 dt within 30 days'
2='yes with exclusion code + 1m and at least 2 dt within 30 days'
3='no with exclusion code'
4='no without exclusion code'
5='no'
;
run;
%let exclude=('DDD','EEE','FFF');
data want;
set have;
_dt_code_flg= (
dt1_code in &exclude or
dt2_code in &exclude or
dt3_code in &exclude or
dt4_code in &exclude
);
if
.<m1_dt-start_dt<=30 and
(
/* .<dt1-start_dt<=30 and*/
/* .<dt2-start_dt<=30 and*/
/* .<dt3-start_dt<=30 and*/
/* .<dt4-start_dt<=30 */
/* or*/
/* .<dt1-start_dt<=30 and*/
/* .<dt2-start_dt<=30 and*/
/* .<dt3-start_dt<=30 */
/* or*/
.<dt1-start_dt<=30 and
.<dt2-start_dt<=30
)
then
do;
if _dt_code_flg=0 then case=1;
else
if _dt_code_flg=1 then case=2;
end;
else
if _dt_code_flg=1 then case=3;
else
if _dt_code_flg=0 then case=4;
else
case=5;
identifier=put(case,case. -l);
drop case _dt_code_flg;
run;
Thank you. This is so helpful
1. Why not use the informat when reading the cards data rather than create dataset2?
2. This
(((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or
((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30)) or
((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30))
)
is the same as
((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30))
3. Not identical, but you might want to write
. < dt1-start_dt <=30
instead of
dt1 is not null and (dt1-start_dt)<=30)
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.