Solved: How to shorten my code without hardcoding? Date Manipulation

annaleticia · Posted 06-17-2021 01:56 PM

I need to compare multiple dates + inclusion/exclusion and hoping i can have a shorter code without hardcoding

Must have one m1_dt and at least two dates "dt#" within 30 days after the start_dt
Must compare all date variables (dt1 to dt4) and identify if the dates were within 30 days
Exclusion Code = 'DDD','EEE','FFF
Identifiers
1. yes with exlusion code + 1m and at least 2 dt within 30 days
2. yes with 1m and at least 2 dt within 30 days
3. no with exclusion code
4. no without exclusion code

Please run the code below to see the expected result

data dataset1;
input id $1-3 start_dt 4-12 m1_dt 13-21 dt1 22-30 dt1_code $31-34 dt2 35-43 dt2_code $44-47 dt3 48-56 dt3_code $57-60 dt4 61-69 dt4_code $70-73;
datalines;
911 20200401 20200410 20200408 AAA 20200426	AAA 20200504 EEE 20200603 EEE
912	20200401 20200411 20200409 AAA 20200426	AAA	20200506 AAA 20200508 BBB
913	20200401 20200409 20200408 BBB 20200426	CCC	20200504 EEE 20200506 EEE
914	20200401 20200423 20200421 BBB 20200422	CCC
915	20200401 20200410 20200408 DDD 20200426	DDD
916	20200401 20200410 20200408 DDD
917	20200401 20200410 20200408 AAA
918	20200401 20200410 20200408 CCC
;
run;

proc sql;
create table dataset2 as select distinct
id,		
input(put(start_dt,8.),yymmdd8.) as start_dt format mmddyy10.,	
input(put(m1_dt,8.),yymmdd8.) as m1_dt format mmddyy10.,
input(put(dt1,8.),yymmdd8.) as dt1 format mmddyy10.,	
dt1_code,
input(put(dt2,8.),yymmdd8.) as dt2 format mmddyy10.,		
dt2_code,	
input(put(dt3,8.),yymmdd8.) as dt3 format mmddyy10.,	
dt3_code,	
input(put(dt4,8.),yymmdd8.) as dt4 format mmddyy10.,
dt4_code
from dataset1; quit;

%let exlude=('DDD','EEE','FFF');

/*expected result*/
proc sql;
create table expected_result as select distinct
a.*
,case when ((m1_dt is not null and (m1_dt-start_dt)<=30) 
and (((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and 
	(dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or 

	((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and 
	(dt3 is not null and (dt3-start_dt)<=30)) or 

	((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30)))
and (dt1_code not in &exlude. and dt2_code not in &exlude. and dt3_code not in &exlude. and dt4_code not in &exlude.)) 
then 'yes with 1m and at least 2 dt within 30 days' 

when ((m1_dt is not null and (m1_dt-start_dt)<=30) 
and (((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and 
	(dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or 

	((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and 
	(dt3 is not null and (dt3-start_dt)<=30)) or 

	((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30)))
and (dt1_code in &exlude. or dt2_code in &exlude. or dt3_code in &exlude. or dt4_code in &exlude.)) 
then 'yes with exclusion code + 1m and at least 2 dt within 30 days'

when (dt1_code in &exlude. or dt2_code in &exlude. or dt3_code in &exlude. or dt4_code in &exlude.) 
then 'no with exclusion code'

when (dt1_code not in &exlude. and dt2_code not in &exlude. and dt3_code not in &exlude. and dt4_code not in &exlude.) 
then 'no without exclusion code'
else 'no' end as identifier
from dataset2 a;quit;

ballardw · Posted 06-17-2021 03:32 PM

First question would be why are your "date" variables not already date values?

What does m1_dt, dt1 and dt2 (etc) names actually mean. It may give us a clue to better file structure.

You really don't provide much in the way of what seems "right" for the data but this may give you a clue to one way to get a count of days less within 30:

data example;
   input a x y z;
   count= sum( 0 le x-a le 30,0 le y-a le 30,0 le z-a le 30);
datalines;
25 29 45 60
25 .  45 60
25 .  .  60
25 69 65 60
25 29 .  60
;

This sort of works because each comparison will return a value of 1 or 0. Then sums those.

Comment on "within": Typically I would expect an ABSOLUTE value for something like (x-y) to get the actual interval. You are only testing that startdate is less than 30 days prior to dt. So I is there something in your data you have not stated that says dt1 and such are only larger than startdate when present???

Since SAS date values are integer numbers of days you can test if a group of variables "are within 30 days" with the Range function:

if range(dt1,dt2,dt3,dt4) le 30 then <whatever>. This will ignore missing values .

The range function is the result of the largest value subtracting the smallest.

Your exclusion and identifier comments aren't very clear.

I also don't see anything in this that really suggests SQL is a better solution than a data step.

View solution in original post

PaigeMiller · Posted 06-17-2021 02:49 PM

I have to admit, I'm not going to dig through this entire mass of code to see where it can be simplified.

I will throw out my gut feeling that if you do this in a DATA step rather than SQL, then you have the feature called ARRAY and then it doesn't matter how many dtxxx variables you have, you can handle them all in an ARRAY, you program it one time and loop through. No hardcoding. You also can use the WHICHC function to search across all the dtxxx variables.

Another thought is always Maxim 19, "Long beats Wide", which means instead of having dt1 through dt4 on one row, you have separate rows with an indetifier 1 through xxx, and a variable dt. This can greatly simplify your programming and again lets you avoid hard-coding stuff. There are many many examples of Long beats Wide here in the SAS Communities.

--
Paige Miller

Reeza · Posted 06-17-2021 03:01 PM

Here's a tutorial on using Arrays in SAS
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/

Definitely need arrays and a data step.

If you had a long format then SQL would actually work better but you don't so in this case, a data step would be more efficient.

ballardw · Posted 06-17-2021 03:32 PM

First question would be why are your "date" variables not already date values?

What does m1_dt, dt1 and dt2 (etc) names actually mean. It may give us a clue to better file structure.

You really don't provide much in the way of what seems "right" for the data but this may give you a clue to one way to get a count of days less within 30:

data example;
   input a x y z;
   count= sum( 0 le x-a le 30,0 le y-a le 30,0 le z-a le 30);
datalines;
25 29 45 60
25 .  45 60
25 .  .  60
25 69 65 60
25 29 .  60
;

This sort of works because each comparison will return a value of 1 or 0. Then sums those.

Comment on "within": Typically I would expect an ABSOLUTE value for something like (x-y) to get the actual interval. You are only testing that startdate is less than 30 days prior to dt. So I is there something in your data you have not stated that says dt1 and such are only larger than startdate when present???

Since SAS date values are integer numbers of days you can test if a group of variables "are within 30 days" with the Range function:

if range(dt1,dt2,dt3,dt4) le 30 then <whatever>. This will ignore missing values .

The range function is the result of the largest value subtracting the smallest.

Your exclusion and identifier comments aren't very clear.

I also don't see anything in this that really suggests SQL is a better solution than a data step.

annaleticia · Posted 06-17-2021 04:07 PM

So as long as the ID has one m1_dt and at least two dt# within 30 days after the start_dt.

I'm using the exclusion list ('DDD','EEE','FFF') to separate or flag the ids "see the identifier column below"

yes with exclusion code + 1m and at least 2 dt within 30 days
yes with 1m and at least 2 dt within 30 days
no with exclusion code
no without exclusion code

Expected result:

Patrick · Posted 06-17-2021 05:44 PM

Below code returns the same result than your desired SQL.

I've converted your case statements to data step logic - and while doing this realized that due to the OR conditions you actually only need to check the day differences for variables dt1 and dt2. ...not sure if that's what you've intended to do - but it's not changing the logic you've provided.

data have;
  infile datalines dsd dlm=' ' truncover;
  input 
    @1 id $3. 
    @5 start_dt yymmdd8. 
    @14 m1_dt yymmdd8.
    @23 dt1 yymmdd8.
    @32 dt1_code $3. 
    @36 dt2 yymmdd8.
    @45 dt2_code $3. 
    @49 dt3 yymmdd8.
    @58 dt3_code $3.
    @62 dt4 yymmdd8.
    @71 dt4_code $3.
    ;
  format start_dt m1_dt dt1-dt4 date9.;
  datalines;
911 20200401 20200410 20200408 AAA 20200426 AAA 20200504 EEE 20200603 EEE
912 20200401 20200411 20200409 AAA 20200426 AAA 20200506 AAA 20200508 BBB
913 20200401 20200409 20200408 BBB 20200426 CCC 20200504 EEE 20200506 EEE
914 20200401 20200423 20200421 BBB 20200422 CCC
915 20200401 20200410 20200408 DDD 20200426 DDD
916 20200401 20200410 20200408 DDD
917 20200401 20200410 20200408 AAA
918 20200401 20200410 20200408 CCC
;

proc format;
  value case
    1='yes with 1m and at least 2 dt within 30 days'
    2='yes with exclusion code + 1m and at least 2 dt within 30 days'
    3='no with exclusion code'
    4='no without exclusion code'
    5='no'
    ;
run;

%let exclude=('DDD','EEE','FFF');
data want;
  set have;
  _dt_code_flg= (
                  dt1_code in &exclude or
                  dt2_code in &exclude or
                  dt3_code in &exclude or
                  dt4_code in &exclude 
                );

  if 
    .<m1_dt-start_dt<=30  and
    (
/*      .<dt1-start_dt<=30    and*/
/*      .<dt2-start_dt<=30    and*/
/*      .<dt3-start_dt<=30    and*/
/*      .<dt4-start_dt<=30   */
/*      or*/
/*      .<dt1-start_dt<=30    and*/
/*      .<dt2-start_dt<=30    and*/
/*      .<dt3-start_dt<=30   */
/*      or*/
      .<dt1-start_dt<=30    and
      .<dt2-start_dt<=30    
    )
    then
    do;
      if _dt_code_flg=0 then case=1;
      else
      if _dt_code_flg=1 then case=2;
    end;

  else
  if _dt_code_flg=1 then case=3;

  else
  if _dt_code_flg=0 then case=4;

  else
    case=5;

  identifier=put(case,case. -l);
  drop case _dt_code_flg;
run;

annaleticia · Posted 06-17-2021 10:21 PM

Thank you. This is so helpful

ChrisNZ · Posted 06-17-2021 07:36 PM

1. Why not use the informat when reading the cards data rather than create dataset2?

2. This

(((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30) and (dt4 is not null and (dt4-start_dt)<=30)) or
((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30) and (dt3 is not null and (dt3-start_dt)<=30)) or
((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30))
)

is the same as

((dt1 is not null and (dt1-start_dt)<=30) and (dt2 is not null and (dt2-start_dt)<=30))

3. Not identical, but you might want to write

. < dt1-start_dt <=30

instead of

dt1 is not null and (dt1-start_dt)<=30)

High-Performance SAS Coding - Third Edition

How to shorten my code without hardcoding? Date Manipulation

Re: How to shorten my code without hardcoding? Date Manipulation

Re: How to shorten my code without hardcoding? Date Manipulation

Re: How to shorten my code without hardcoding? Date Manipulation

Re: How to shorten my code without hardcoding? Date Manipulation

Re: How to shorten my code without hardcoding? Date Manipulation

Re: How to shorten my code without hardcoding? Date Manipulation

Re: How to shorten my code without hardcoding? Date Manipulation

Re: How to shorten my code without hardcoding? Date Manipulation

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!