Solved: I'm stuck on this puzzling requirement

buechler66 · Posted 03-05-2019 01:43 PM

This is a tough one. I don't even know if this is possible. My requirement is to identify any instance where an ID switches from away from his PI, but later switches back to that same PI within 180 days. There can be any number of PI switches in between the original visit and the return visit.

So in the sample data below ID 165074 switches from PI 1699130682 to 1255743092 to PI 1043257041. But he never switches back to any of the earlier visited PIs.

However, ID 176368 switches from PI 1720270135 on THRU_DT 18DEC2017 to PI 1487654356, and later switches back to PI 1720270135 on FROM_DT 11MAY2018. And since the difference between these two dates is within 180 days I'd want to set an indicator of Recycle_ID=1 for this ID. The difference in days is calculated as (FROM_DT - THRU_DT of the repeat visits).

Does anyone have any ideas on how to accomplish this? I can't figure out a way to do this. Any help would be really appreciated.

In the example data below I've noted the record that would get the Recycle_Ind set to 1.

* Data sorted by ID and FROM_DT

ID PI FROM_DT THRU_DT
165074 1699130682 14SEP2017 12NOV2017
165074 1699130682 13NOV2017 11JAN2018
165074 1255743092 27FEB2018 27APR2018
165074 1255743092 28APR2018 26JUN2018
165074 1043257041 27JUN2018 25AUG2018
176368 1720270135 20OCT2017 18DEC2017
176368 1487654356 12MAR2018 10MAY2018
176368 1720270135 11MAY2018 09JUL2018 <= RECYCLE_IND=1 (11MAY18-18DEC17) IS <= 180
176368 1987654359 10JUL2018 07SEP2018

ErikLund_Jensen · Posted 03-06-2019 12:07 PM

Hi @buechler66

I am sorry it did'nt work, that was because I assumed that intervals were consecutive, which caused gaps between same values of PI to be marked also. The following code marks a recycle only if there has been (at least) one other PI value in between.

data have;
informat FROM_DT date9. THRU_DT date9.;
format FROM_DT date9. THRU_DT date9.;
input ID PI FROM_DT THRU_DT;
datalines;
123456 1234567890 14SEP2017 12NOV2017
123456 1234567890 15NOV2017 11JAN2018
123456 2345678901 27FEB2018 27APR2018
123456 2345678901 28APR2018 26JUN2018
123456 2345678901 27JUN2018 25AUG2018
234567 3234567890 14SEP2017 12FEB2018
234567 4345678901 20FEB2018 25FEB2018
234567 3234567890 12MAR2018 08MAY2018
234567 3234567890 11MAY2018 12MAY2018
234567 4345678901 13MAY2018 09JUL2018
234567 3234567890 10JUL2018 07SEP2018
;
run;

* Just to make sure input is in concecutive order by ID FROM_DT;
proc sort data=have out=w1;
	by ID FROM_DT;
run;

* Add a sequence number; 
data w2; set w1;
	Seq = _N_;
run;

* sort to group ID's on PI values;
proc sort data=w2 out=w3; 
	by ID PI FROM_DT;
run;

* Pick ID with recurrent PI, not consecutive, but within 180 days;
* Use sequence number to suppress gaps between same PID's;
data w4 (drop=OldPI OldThru OldSeq Seq); set w3; by ID PI;
	OldPI = lag(PI);
	OldThru = lag(THRU_DT);
	OldSeq = lag(Seq);

	if not first.PI and Seq ne (OldSeq+1) then do;
			if OldThru < (FROM_DT-1) and (FROM_DT - OldThru) < 180 then Recycle_ID = 1;
	end;
run;

* Sort to restore input order;
proc sort data=w4 out=want;
	by ID FROM_DT;
run;

View solution in original post

ErikLund_Jensen · Posted 03-05-2019 02:40 PM

Hi @buechler66

This is a problem where then LAG function is useful. It just marks one observation, and you will probably need some further processing to get your output.

data have;
	informat FROM_DT date9. THRU_DT date9.;
	format FROM_DT date9. THRU_DT date9.;
	input ID PI FROM_DT THRU_DT;
	datalines;
165074 1699130682 14SEP2017 12NOV2017
165074 1699130682 13NOV2017 11JAN2018
165074 1255743092 27FEB2018 27APR2018
165074 1255743092 28APR2018 26JUN2018
165074 1043257041 27JUN2018 25AUG2018
176368 1720270135 20OCT2017 18DEC2017
176368 1487654356 12MAR2018 10MAY2018
176368 1720270135 11MAY2018 09JUL2018
176368 1987654359 10JUL2018 07SEP2018
;
run;


* sort to group ID's on PI values;
proc sort data=have out=want; 
	by ID PI FROM_DT;
run;

* pick ID with recurrent PI, not concecutive, but within 180 days;
data w3 (drop=OldPI OldThru); set w1; by ID PI;
	OldPI = lag(PI);
	OldThru = lag(THRU_DT);
	if not first.PI then do;
		if OldThru < (FROM_DT-1) and (FROM_DT - OldThru) < 180 then Recycle_ID = 1;
	end;
run;

buechler66 · Posted 03-06-2019 06:30 AM

I will try this out today and reply back. Thank you.

buechler66 · Posted 03-06-2019 10:46 AM

Unfortunately this logic didn't work. I've included new example data and you can see that it's flagging too many rows with Recycle_Ind=1 . It has to do with this line of code: if OldThru < (FROM_DT-1) and (FROM_DT - OldThru) < 180 then Recycle_ID = 1;

This assumes that there will only be one day's difference between the current FROM_DT and the previous THRU_DT, when in fact there could be a larger gap. I extended the gap by more than 1 day for the records that are marked as incorrect below.

If you have any thoughts on how to correct this I'd love to get this working.

For the input data below the following records are identified:

ID PI FROM_DT THRU_DT

123456 1234567890 14SEP2017 12NOV2017
123456 1234567890 15NOV2017 11JAN2018 INCORRECTLY RECYCLE=1
123456 2345678901 27FEB2018 27APR2018
123456 2345678901 28APR2018 26JUN2018
123456 2345678901 27JUN2018 25AUG2018
234567 3234567890 14SEP2017 12FEB2018
234567 4345678901 20FEB2018 25FEB2018
234567 3234567890 12MAR2018 08MAY2018 RECYCLE=1
234567 3234567890 11MAY2018 12MAY2018 INCORRECTLY RECYCLE=1
234567 4345678901 13MAY2018 09JUL2018 RECYCLE=1
234567 3234567890 10JUL2018 07SEP2018 RECYCLE=1

data have;
informat FROM_DT date9. THRU_DT date9.;
format FROM_DT date9. THRU_DT date9.;
input ID PI FROM_DT THRU_DT;
datalines;
123456 1234567890 14SEP2017 12NOV2017
123456 1234567890 15NOV2017 11JAN2018
123456 2345678901 27FEB2018 27APR2018
123456 2345678901 28APR2018 26JUN2018
123456 2345678901 27JUN2018 25AUG2018
234567 3234567890 14SEP2017 12FEB2018
234567 4345678901 20FEB2018 25FEB2018
234567 3234567890 12MAR2018 08MAY2018
234567 3234567890 11MAY2018 12MAY2018
234567 4345678901 13MAY2018 09JUL2018
234567 3234567890 10JUL2018 07SEP2018
;
run;

* sort to group ID's on PI values;
proc sort data=have out=want;
by ID PI FROM_DT;
run;

* pick ID with recurrent PI, not concecutive, but within 180 days;
data w3 (drop=OldPI OldThru); set wANT; by ID PI;
OldPI = lag(PI);
OldThru = lag(THRU_DT);
if not first.PI then do;
if OldThru < (FROM_DT-1) and (FROM_DT - OldThru) < 180 then Recycle_ID = 1;
end;
run;

ErikLund_Jensen · Posted 03-06-2019 12:07 PM

Hi @buechler66

I am sorry it did'nt work, that was because I assumed that intervals were consecutive, which caused gaps between same values of PI to be marked also. The following code marks a recycle only if there has been (at least) one other PI value in between.

data have;
informat FROM_DT date9. THRU_DT date9.;
format FROM_DT date9. THRU_DT date9.;
input ID PI FROM_DT THRU_DT;
datalines;
123456 1234567890 14SEP2017 12NOV2017
123456 1234567890 15NOV2017 11JAN2018
123456 2345678901 27FEB2018 27APR2018
123456 2345678901 28APR2018 26JUN2018
123456 2345678901 27JUN2018 25AUG2018
234567 3234567890 14SEP2017 12FEB2018
234567 4345678901 20FEB2018 25FEB2018
234567 3234567890 12MAR2018 08MAY2018
234567 3234567890 11MAY2018 12MAY2018
234567 4345678901 13MAY2018 09JUL2018
234567 3234567890 10JUL2018 07SEP2018
;
run;

* Just to make sure input is in concecutive order by ID FROM_DT;
proc sort data=have out=w1;
	by ID FROM_DT;
run;

* Add a sequence number; 
data w2; set w1;
	Seq = _N_;
run;

* sort to group ID's on PI values;
proc sort data=w2 out=w3; 
	by ID PI FROM_DT;
run;

* Pick ID with recurrent PI, not consecutive, but within 180 days;
* Use sequence number to suppress gaps between same PID's;
data w4 (drop=OldPI OldThru OldSeq Seq); set w3; by ID PI;
	OldPI = lag(PI);
	OldThru = lag(THRU_DT);
	OldSeq = lag(Seq);

	if not first.PI and Seq ne (OldSeq+1) then do;
			if OldThru < (FROM_DT-1) and (FROM_DT - OldThru) < 180 then Recycle_ID = 1;
	end;
run;

* Sort to restore input order;
proc sort data=w4 out=want;
	by ID FROM_DT;
run;

buechler66 · Posted 03-06-2019 12:39 PM

Wow. This does appear to work. I'm going to do some testing and I'll follow up.

May I ask, why does the code still contain the condition "OLD_THRU < (CLM_FROM_DT-1)"? I'm curious why this condition is still necessary?

ErikLund_Jensen · Posted 03-06-2019 01:45 PM

Hi @buechler66

I just forgot to remove it, because it does no harm. It had a purpose in my first attempt, where I thought the intervals were consecutive, so a gap meant that there was an interval with another PI in between. But the check on sequence number makes it obsolete.

buechler66 · Posted 03-06-2019 04:54 PM

Thanks so much for taking the time to help me with this requirement. I'm amazed how you are able to figure this stuff out so quickly. I learned a lot by dissecting your code example. Thank you, thank you, thank you!

PGStats · Posted 03-05-2019 06:10 PM

Not obvious indeed. But here is a shot:

data have;
input ID PI (FROM_DT THRU_DT) (:date9.);
format FROM_DT THRU_DT yymmdd10. pi 10.;
datalines;
165074 1699130682 14SEP2017 12NOV2017
165074 1699130682 13NOV2017 11JAN2018
165074 1255743092 27FEB2018 27APR2018
165074 1255743092 28APR2018 26JUN2018
165074 1043257041 27JUN2018 25AUG2018
176368 1720270135 20OCT2017 18DEC2017
176368 1487654356 12MAR2018 10MAY2018
176368 1720270135 11MAY2018 09JUL2018
176368 1987654359 10JUL2018 07SEP2018
;

proc sql;
select 
    *,
    exists (    
        select * 
        from 
            have as b inner join
            have as c 
                on c.id = b.id and c.pi ne b.pi and c.from_dt < b.from_dt
        where   b.id = a.id and 
                b.pi = a.pi and 
                intck("DAY", a.thru_dt, b.from_dt) between 1 and 180 and
                c.thru_dt > a.thru_dt   
        ) as recycle
from have as a;
quit;

                     ID          PI     FROM_DT     THRU_DT   recycle
               ------------------------------------------------------
                 165074  1699130682  2017-09-14  2017-11-12         0
                 165074  1699130682  2017-11-13  2018-01-11         0
                 165074  1255743092  2018-02-27  2018-04-27         0
                 165074  1255743092  2018-04-28  2018-06-26         0
                 165074  1043257041  2018-06-27  2018-08-25         0
                 176368  1720270135  2017-10-20  2017-12-18         1
                 176368  1487654356  2018-03-12  2018-05-10         0
                 176368  1720270135  2018-05-11  2018-07-09         0
                 176368  1987654359  2018-07-10  2018-09-07         0

Would require testing with a complete range of possible cases.

PG

buechler66 · Posted 03-06-2019 06:31 AM

This also looks like it will work. I'll try this out today too. Thank you.

buechler66 · Posted 03-06-2019 10:58 AM

Hi. Unfortunately this isn't working either when I tested it with this second set of test records. Here's what I'm expecting to be flagged based on the requirement. If you run the code below it is flagging differently. Any suggestions based on this?

123456 1234567890 14SEP2017 12NOV2017
123456 1234567890 15NOV2017 11JAN2018
123456 2345678901 27FEB2018 27APR2018
123456 2345678901 28APR2018 26JUN2018
123456 2345678901 27JUN2018 25AUG2018
234567 3234567890 14SEP2017 12FEB2018
234567 4345678901 20FEB2018 25FEB2018
234567 3234567890 12MAR2018 08MAY2018 Recycle_Ind=1
234567 3234567890 11MAY2018 12MAY2018
234567 4345678901 13MAY2018 09JUL2018 Recycle_Ind=1
234567 3234567890 10JUL2018 07SEP2018 Recycle_Ind=1

data have;
input ID PI (FROM_DT THRU_DT) (:date9.);
format FROM_DT THRU_DT yymmdd10. pi 10.;
datalines;
123456 1234567890 14SEP2017 12NOV2017
123456 1234567890 15NOV2017 11JAN2018
123456 2345678901 27FEB2018 27APR2018
123456 2345678901 28APR2018 26JUN2018
123456 2345678901 27JUN2018 25AUG2018
234567 3234567890 14SEP2017 12FEB2018
234567 4345678901 20FEB2018 25FEB2018
234567 3234567890 12MAR2018 08MAY2018
234567 3234567890 11MAY2018 12MAY2018
234567 4345678901 13MAY2018 09JUL2018
234567 3234567890 10JUL2018 07SEP2018
;

proc sql;
select
*,
exists (
select *
from
have as b inner join
have as c
on c.id = b.id and c.pi ne b.pi and c.from_dt < b.from_dt
where b.id = a.id and
b.pi = a.pi and
intck("DAY", a.thru_dt, b.from_dt) between 1 and 180 and
c.thru_dt > a.thru_dt
) as recycle
from have as a;
quit;

PGStats · Posted 03-06-2019 03:06 PM

Please explain

123456 1234567890 14SEP2017 12NOV2017
123456 1234567890 15NOV2017 11JAN2018
123456 2345678901 27FEB2018 27APR2018
123456 2345678901 28APR2018 26JUN2018
123456 2345678901 27JUN2018 25AUG2018
234567 3234567890 14SEP2017 12FEB2018
234567 4345678901 20FEB2018 25FEB2018
234567 3234567890 12MAR2018 08MAY2018 Recycle_Ind=1
234567 3234567890 11MAY2018 12MAY2018 WHY? 11MAY2018 - 12FEB2018 < 180 days
234567 4345678901 13MAY2018 09JUL2018 Recycle_Ind=1
234567 3234567890 10JUL2018 07SEP2018 Recycle_Ind=1

PG

buechler66 · Posted 03-06-2019 04:50 PM

Hi. It's because in that case the PI was visited on two consecutive visits in a row. The requirement is Recycle_Ind=1 when the ID visits different PI in between visits. Since the ID 234567 visited PI 3234567890 two consecutive times the second record should not get marked as a recycled. Sorry if I wasn't clear on that. I do really appreciate your time in providing a solution. You have helped me so often in the past.

mkeintz · Posted 03-05-2019 07:34 PM

If the data is already sorted by ID/FROM_DT then you only need to save the most recent THRU_DT + 180 for each PI, which can be tested against the incoming FROM_DT for the same PI

data have;
	informat FROM_DT date9. THRU_DT date9.;
	format FROM_DT date9. THRU_DT date9.;
	input ID PI FROM_DT THRU_DT;
	datalines;
165074 1699130682 14SEP2017 12NOV2017
165074 1699130682 13NOV2017 11JAN2018
165074 1255743092 27FEB2018 27APR2018
165074 1255743092 28APR2018 26JUN2018
165074 1043257041 27JUN2018 25AUG2018
176368 1720270135 20OCT2017 18DEC2017
176368 1487654356 12MAR2018 10MAY2018
176368 1720270135 11MAY2018 09JUL2018
176368 1987654359 10JUL2018 07SEP2018
run;

data want (drop=_:);
  set have;
  by id pi notsorted;
  if _n_=1 then do;
    _thru_plus_180=.;
    declare hash h ();
      h.definekey('pi');
      h.definedata('_thru_plus_180');
      h.definedone();
  end;
  if h.find()=0 and first.pi=1 and from_dt<=_thru_plus_180 then repeat='Y';
  else repeat=' ';
  if last.pi then h.replace(key:pi,data:thru_dt+180);
run;

The hash object is keyed on the PI values, and the program assumes that no PI is attached to more than one ID. Otherwise you have to modify the above to avoid contaminating tests for one ID with dates from a prior ID, as here:
change the h.definekey method to:
h.definekey('ID','PI');
and change the h.replace method to
h.replace(key:id,key:pi,data:thru_dt+180);
The "SET HAVE; BY ID PI NOTSORTED;" statements tell sas to expect the data in groups, not neccessarily sorted. This is simply a tool to generate the dummy variables first.pi and last.pi. So you can test date comparisons every time a PI group starts, and also update the hash object every time you are at the end of a PI group.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

buechler66 · Posted 03-06-2019 06:34 AM

This looks interesting, but I have no idea what it's doing. This is the first time I've ever seen this 'hash' concept. I'm afraid this is over my head. 😞

I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Re: I'm stuck on this puzzling requirement

Classroom Training Available!