BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
annaleticia
Fluorite | Level 6

I need to compare multiple dates + inclusion/exclusion and hoping i can have a shorter code without hardcoding

  • Must have one m1_dt and at least two dates "dt#" within 30 days after the start_dt
  • Must compare all date variables (dt1 to dt4) and identify if the dates were within 30 days
  • Exclusion Code = 'DDD','EEE','FFF
  • Identifiers 
    1. yes with exlusion code + 1m and at least 2 dt within 30 days
    2. yes with 1m and at least 2 dt within 30 days
    3. no with exclusion code
    4. no without exclusion code

Please run the code below to see the expected result

data dataset1;
input id $1-3 start_dt 4-12 m1_dt 13-21 dt1 22-30 dt1_code $31-34 dt2 35-43 dt2_code $44-47 dt3 48-56 dt3_code $57-60 dt4 61-69 dt4_code $70-73;
datalines;
911 20200401 20200410 20200408 AAA 20200426	AAA 20200504 EEE 20200603 EEE
912	20200401 20200411 20200409 AAA 20200426	AAA	20200506 AAA 20200508 BBB
913	20200401 20200409 20200408 BBB 20200426	CCC	20200504 EEE 20200506 EEE
914	20200401 20200423 20200421 BBB 20200422	CCC
915	20200401 20200410 20200408 DDD 20200426	DDD
916	20200401 20200410 20200408 DDD
917	20200401 20200410 20200408 AAA
918	20200401 20200410 20200408 CCC
;
run;

proc sql;
create table dataset2 as select distinct
id,		
input(put(start_dt,8.),yymmdd8.) as start_dt format mmddyy10.,	
input(put(m1_dt,8.),yymmdd8.) as m1_dt format mmddyy10.,
input(put(dt1,8.),yymmdd8.) as dt1 format mmddyy10.,	
dt1_code,
input(put(dt2,8.),yymmdd8.) as dt2 format mmddyy10.,		
dt2_code,	
input(put(dt3,8.),yymmdd8.) as dt3 format mmddyy10.,	
dt3_code,	
input(put(dt4,8.),yymmdd8.) as dt4 format mmddyy10.,
dt4_code
from dataset1; quit;

%let exlude=('DDD','EEE','FFF');

/*expected result*/
proc sql;
create table expected_result as select distinct
a.*
,case when ((m1_dt is not null and (m1_dt-start_dt)<=30) 
and (((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and 
	(dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or 

	((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and 
	(dt3 is not null and (dt3-start_dt)<=30)) or 

	((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30)))
and (dt1_code not in &exlude. and dt2_code not in &exlude. and dt3_code not in &exlude. and dt4_code not in &exlude.)) 
then 'yes with 1m and at least 2 dt within 30 days' 

when ((m1_dt is not null and (m1_dt-start_dt)<=30) 
and (((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and 
	(dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or 

	((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and 
	(dt3 is not null and (dt3-start_dt)<=30)) or 

	((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30)))
and (dt1_code in &exlude. or dt2_code in &exlude. or dt3_code in &exlude. or dt4_code in &exlude.)) 
then 'yes with exclusion code + 1m and at least 2 dt within 30 days'

when (dt1_code in &exlude. or dt2_code in &exlude. or dt3_code in &exlude. or dt4_code in &exlude.) 
then 'no with exclusion code'

when (dt1_code not in &exlude. and dt2_code not in &exlude. and dt3_code not in &exlude. and dt4_code not in &exlude.) 
then 'no without exclusion code'
else 'no' end as identifier
from dataset2 a;quit;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

First question would be why are your "date" variables not already date values?

 

What does m1_dt, dt1 and dt2 (etc) names actually mean. It may give us a clue to better file structure.

 

You really don't provide much in the way of what seems "right" for the data but this may give you a clue to one way to get a count of days less within 30:

data example;
   input a x y z;
   count= sum( 0 le x-a le 30,0 le y-a le 30,0 le z-a le 30);
datalines;
25 29 45 60
25 .  45 60
25 .  .  60
25 69 65 60
25 29 .  60
;

This sort of works because each comparison will return a value of 1 or 0. Then sums those.

Comment on "within": Typically I would expect an ABSOLUTE value for something like (x-y) to get the actual interval. You are only testing that startdate is less than 30 days prior to dt. So I is there something in your data you have not stated that says dt1 and such are only larger than startdate when present???

 

Since SAS date values are integer numbers of days you can test if a group of variables "are within 30 days" with the Range function:

 

if range(dt1,dt2,dt3,dt4) le 30 then <whatever>. This will ignore missing values .

The range function is the result of the largest value subtracting the smallest.

 

Your exclusion and identifier comments aren't very clear.

 

I also don't see anything in this that really suggests SQL is a better solution than a data step.

View solution in original post

7 REPLIES 7
PaigeMiller
Diamond | Level 26

I have to admit, I'm not going to dig through this entire mass of code to see where it can be simplified.

 

I will throw out my gut feeling that if you do this in a DATA step rather than SQL, then you have the feature called ARRAY and then it doesn't matter how many dtxxx variables you have, you can handle them all in an ARRAY, you program it one time and loop through. No hardcoding. You also can use the WHICHC function to search across all the dtxxx variables.

 

Another thought is always Maxim 19, "Long beats Wide", which means instead of having dt1 through dt4 on one row, you have separate rows with an indetifier 1 through xxx, and a variable dt. This can greatly simplify your programming and again lets you avoid hard-coding stuff. There are many many examples of Long beats Wide here in the SAS Communities.

--
Paige Miller
Reeza
Super User

Here's a tutorial on using Arrays in SAS
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/

 

Definitely need arrays and a data step. 

 

If you had a long format then SQL would actually work better but you don't so in this case, a data step would be more efficient.

ballardw
Super User

First question would be why are your "date" variables not already date values?

 

What does m1_dt, dt1 and dt2 (etc) names actually mean. It may give us a clue to better file structure.

 

You really don't provide much in the way of what seems "right" for the data but this may give you a clue to one way to get a count of days less within 30:

data example;
   input a x y z;
   count= sum( 0 le x-a le 30,0 le y-a le 30,0 le z-a le 30);
datalines;
25 29 45 60
25 .  45 60
25 .  .  60
25 69 65 60
25 29 .  60
;

This sort of works because each comparison will return a value of 1 or 0. Then sums those.

Comment on "within": Typically I would expect an ABSOLUTE value for something like (x-y) to get the actual interval. You are only testing that startdate is less than 30 days prior to dt. So I is there something in your data you have not stated that says dt1 and such are only larger than startdate when present???

 

Since SAS date values are integer numbers of days you can test if a group of variables "are within 30 days" with the Range function:

 

if range(dt1,dt2,dt3,dt4) le 30 then <whatever>. This will ignore missing values .

The range function is the result of the largest value subtracting the smallest.

 

Your exclusion and identifier comments aren't very clear.

 

I also don't see anything in this that really suggests SQL is a better solution than a data step.

annaleticia
Fluorite | Level 6

So as long as the ID has one m1_dt  and at least two dt# within 30 days after the start_dt.

I'm using the exclusion list  ('DDD','EEE','FFF') to separate or flag the ids "see the identifier column below"

  1. yes with exclusion code + 1m and at least 2 dt within 30 days
  2. yes with 1m and at least 2 dt within 30 days
  3. no with exclusion code
  4. no without exclusion code

Expected result:

annaleticia_1-1623960404424.png

 

 

 

Patrick
Opal | Level 21

Below code returns the same result than your desired SQL.

I've converted your case statements to data step logic - and while doing this realized that due to the OR conditions you actually only need to check the day differences for variables dt1 and dt2. ...not sure if that's what you've intended to do - but it's not changing the logic you've provided.

data have;
  infile datalines dsd dlm=' ' truncover;
  input 
    @1 id $3. 
    @5 start_dt yymmdd8. 
    @14 m1_dt yymmdd8.
    @23 dt1 yymmdd8.
    @32 dt1_code $3. 
    @36 dt2 yymmdd8.
    @45 dt2_code $3. 
    @49 dt3 yymmdd8.
    @58 dt3_code $3.
    @62 dt4 yymmdd8.
    @71 dt4_code $3.
    ;
  format start_dt m1_dt dt1-dt4 date9.;
  datalines;
911 20200401 20200410 20200408 AAA 20200426 AAA 20200504 EEE 20200603 EEE
912 20200401 20200411 20200409 AAA 20200426 AAA 20200506 AAA 20200508 BBB
913 20200401 20200409 20200408 BBB 20200426 CCC 20200504 EEE 20200506 EEE
914 20200401 20200423 20200421 BBB 20200422 CCC
915 20200401 20200410 20200408 DDD 20200426 DDD
916 20200401 20200410 20200408 DDD
917 20200401 20200410 20200408 AAA
918 20200401 20200410 20200408 CCC
;

proc format;
  value case
    1='yes with 1m and at least 2 dt within 30 days'
    2='yes with exclusion code + 1m and at least 2 dt within 30 days'
    3='no with exclusion code'
    4='no without exclusion code'
    5='no'
    ;
run;

%let exclude=('DDD','EEE','FFF');
data want;
  set have;
  _dt_code_flg= (
                  dt1_code in &exclude or
                  dt2_code in &exclude or
                  dt3_code in &exclude or
                  dt4_code in &exclude 
                );

  if 
    .<m1_dt-start_dt<=30  and
    (
/*      .<dt1-start_dt<=30    and*/
/*      .<dt2-start_dt<=30    and*/
/*      .<dt3-start_dt<=30    and*/
/*      .<dt4-start_dt<=30   */
/*      or*/
/*      .<dt1-start_dt<=30    and*/
/*      .<dt2-start_dt<=30    and*/
/*      .<dt3-start_dt<=30   */
/*      or*/
      .<dt1-start_dt<=30    and
      .<dt2-start_dt<=30    
    )
    then
    do;
      if _dt_code_flg=0 then case=1;
      else
      if _dt_code_flg=1 then case=2;
    end;

  else
  if _dt_code_flg=1 then case=3;

  else
  if _dt_code_flg=0 then case=4;

  else
    case=5;

  identifier=put(case,case. -l);
  drop case _dt_code_flg;
run;
annaleticia
Fluorite | Level 6

Thank you. This is so helpful

ChrisNZ
Tourmaline | Level 20

1. Why not use the informat when reading the cards data rather than create dataset2?

 

2. This

(((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or
((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30)) or
((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30))
)

is the same as

((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30))

3. Not identical, but you might want to write

. < dt1-start_dt <=30

instead of

dt1 is not null and (dt1-start_dt)<=30)

 

 

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 807 views
  • 4 likes
  • 6 in conversation