Re: Removing rows that happened within two days of the previous occurr...

catkat96 · Posted 03-01-2021 11:27 AM

It's a bit tricky to explain this, but I'll try my best.

I have a dataset that contains customer ID's, days in which those customers were called and a dummy column when 1 is when we talked to a person and 0 is when we didn't.

Sometimes, we will talk to a person a certain day and also the next day, for example. I would like to keep only 1 observation out of those we actually talked to and are in a range of 2 days between the first one.

Here's an example of what I mean.

data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines4;
001 30/11/2020 0
001 01/12/2020 1
001 30/12/2020 1
001 31/12/2020 1
001 01/01/2021 1
002 01/02/2021 1
002 02/02/2021 1
002 15/02/2021 1
;;;
run;

data Want;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines4;
001 01/12/2020 1
001 01/01/2021 1
002 02/02/2021 1
002 15/02/2021 1
;;;
run;

Want would be the result I expect. As you can see, for customer 001 I removed the rows belonging to 30/12 and 31/12 because there was a contact on 01/02 and, if there is a contact made with a client for 3 or less consecutive days, I want to only keep the last one.

Does that make sense? To be fair, I don't even know where to start to get this, any ideas? I would like to use proc sql for this since it's what I've been learning so far.

Thank you very much in advance.

mkeintz · Posted 03-01-2021 12:28 PM

You state:

As you can see, for customer 001 I removed the rows belonging to 30/01 and 31/01

but I don't see either a 30/01 or 31/01 record.

Also why are you not keeping 31/12/2020?

I believe another way of saying what you want is to keep only records that are not followed on the next day or the day after next by a successful contact. So what do you want to do if you have six consecutive days with successful contacts on days 2, 4, and 6? Do you keep day 6 only?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

catkat96 · Posted 03-01-2021 01:00 PM

Hi, thanks for your answer. I can't see this topic anymore for some reason (whenever I click on it, all the text and everything is gone, I literally don't see anything) so I'll play by memory and I'm replying through my email, so bare with me!

Are there no 30/01 and 31/01 rows on the "Have"? I'm not keeping 31/12 because it was not a successful contact.

"I believe another way of saying what you want is to keep only records that are not followed on the next day or the day after next by a successful contact. " That is absolutely correct.

So for 6 consecutive days like these:
day 1 - 0
day 2 - 1
day 3 - 0
day 4 - 1
day 5 - 0
day 6 - 1

yes I would only keep day 6. It is extremely unlikely this happens on my full dataset but the logic is the same.

catkat96 · Posted 03-01-2021 01:29 PM

you were right, it was a typo. It's fixed now and it makes more sense.

mkeintz · Posted 03-01-2021 02:12 PM

Since your data are sorted by ID, I think you'd be better off with a single data step. It would read through each ID twice. The first pass would identify taboo dates (dates that are 1 or 2 days prior to a successful contact). The second pass would keep all records not included in the taboo dates.

Keeping a _temporary_ array (an array whose values are not output, and whose values are not reset to missing with each incoming observation) indexed by CALLDATE is an easy way to track taboo dates:

data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines4;
001 30/11/2020 0
001 01/12/2020 1  *
001 30/12/2020 1
001 31/12/2020 1
001 01/01/2021 1  *
002 01/02/2021 1
002 02/02/2021 1
002 15/02/2021 1
;;;;
run;

data want;
  set have (where=(contacted=1) in=firstpass)
      have (in=secondpass);
  by id;

  array taboodate {%sysevalf("01jan2020"d):%sysevalf("31dec2021"d)} _temporary_;
  if first.id then call missing(of taboodate{*});

  if firstpass then taboodate{calldate-2}=1;
  if firstpass then taboodate{calldate-1}=1;
  
  if secondpass=1 and taboodate{calldate}^=1;
run;

Note the array statement tells sas to form the lower and upper boundaries based on the internal values of 01jan2020 and 31dec2021 (using the %sysevalf macro function to translate a date literal ("01jan2020"d) to the internal value). Just make sure to use dates that span your calldate range.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 03-02-2021 08:02 AM

If I understood what you mean .

data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines;
001 30/11/2020 0
001 01/12/2020 1
001 30/12/2020 1
001 31/12/2020 1
001 01/01/2021 1
002 01/02/2021 1
002 02/02/2021 1
002 15/02/2021 1
;
run;
data temp; 
 set have;
 by id Contacted notsorted;
 dif=dif(callDate);
 if first.Contacted or dif>1 then group+1;
run;
data want; 
 set temp;
 by group;
 if last.group and  Contacted=1;
run;

catkat96 · Posted 03-09-2021 04:57 AM

Hi Ksharp,

This solution is very good. I do had to modify it a bit but I'm still encountering a mistake, maybe you can help me:

data temp;
 set have1;
 by idnotsorted;
 dif2=dif(callDate);
 if first.contacted or first.id or dif2>1 or dif2<0  then group2+1;
 if first.contacted or first.id and dif2=2  then group2=group2-1;
run;
data want;
 set temp;
/* drop diff group contacted;*/
 by group2;
 if last.group2 and contacted=1 then contacted2=1;
 else contacted2=0;
run;

This breaks in a case where:

data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines;
001 28/01/2021 1
001 30/01/2021 1
001 01/02/2021 1
002 24/11/2020 1
;
run;

The "group2" variable results in:

data Have;
input ID callDate :ddmmyy. Contacted diff group2 cured2;
format callDate ddmmyy10.;
datalines;
001 28/01/2021 1 1 147 1
001 30/01/2021 1 2 148 1 
001 01/02/2021 1 2 149 1
002 24/11/2020 1 -69 150 1
;
run;

when it should only assign a 1 to 001 - 01/02/2021 and 002. Do you have any idea how I could fix this?

Thank you so much

Ksharp · Posted 03-09-2021 05:45 AM

Your code was not right .

you don't have 'contacted' in BY statement.

And make sure you have sorted by ID callDate.

data Have1;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines;
001 28/01/2021 1
001 30/01/2021 1
001 01/02/2021 1
002 24/11/2020 1
;
run;

data temp;
 set have1;
 by id contacted notsorted;
 dif2=dif(callDate);
 if first.contacted  or dif2>1  then group2+1;
run;
data want;
 set temp;
/* drop diff group contacted;*/
 by group2;
 if last.group2 and contacted=1 then contacted2=1;
 else contacted2=0;
run;

catkat96 · Posted 03-09-2021 06:10 AM

That didn't work as well, if you can see the "contacted2" column will only have 1s, which is not the expected outcome.

The thing that worked for me was:

data temp;
 set have1;
 by id notsorted;
 dif2=dif(callDate);
 if first.contacted or first.id or dif2>1 or dif2<0  then group2+1;
 if dif2=2  then group2=group2-1;
run;
data want;
 set temp;
/* drop diff group contacted;*/
 by group2;
 if last.group2 and contacted=1 then contacted2=1;
 else contacted2=0;
run;

Ksharp · Posted 03-09-2021 06:25 AM

"if first.contacted" is not right.
You should have
by id contacted notsorted;

otherwise, log would receive NOTE: contacted is uninitialize .

And what is your logic behind your code. and better post the output you want to see.

catkat96 · Posted 03-09-2021 06:59 AM

I keep having a similar issue on another data where I need to perform the same thing.

The data is this, where PAY2 column is the expected outcome.

data Have2;
input ID callDate :ddmmyy. pay pay2;
format callDate ddmmyy10.;
datalines;
011 29/12/2020 1 1
011 04/01/2021 0 0
011 25/01/2021 1 0
011 27/01/2021 1 1
011 29/01/2021 0 0
;
run;

With this code:

data temp;
 set have2;
 by agreement_number pay notsorted;
 dif5=dif(dialleddate);
 if first.pay or first.id or dif5>1 or dif5<0  then group5+1;
run;

data want;
 set temp;
 by group5 notsorted;
 if last.group5 and Pay=1 then Pay2=1;
 else Pay2=0;
run;

The outcome I am getting is:

data want;
input ID callDate :ddmmyy. pay dif5 group5 pay2;
format callDate ddmmyy10.;
datalines;
11	29/12/2020	1	.	1	1
11	04/01/2021	0	6	2	0
11	25/01/2021	1	21	3	1
11	27/01/2021	1	2	4	1
11	29/01/2021	0	2	5	0
;
run;

As you can see, nothing is changed. I tried a few things (including what I did before) but couldn't figure it out.

mkeintz · Posted 03-09-2021 01:46 PM

The code you show

data temp;
 set have2;
 by agreement_number pay notsorted;
 dif5=dif(dialleddate);
 if first.pay or first.id or dif5>1 or dif5<0  then group5+1;
run;

does a test for first.id. But there is no such variable. So if you expected first.id it to be part of a valid test, it didn't happen.

Did you look at your log? The above should have produced a NOTE..

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 03-10-2021 05:52 AM

Can you explain how to get that PAY2 ?

data Have2;
input ID callDate :ddmmyy. pay ;
format callDate ddmmyy10.;
datalines;
011 29/12/2020 1 1
011 04/01/2021 0 0
011 25/01/2021 1 0
011 27/01/2021 1 1
011 29/01/2021 0 0
;
run;

data want;
set have2;
by id pay notsorted;
if last.pay and pay=1 then pay2=1;
else pay2=0;
run;

Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Re: Removing rows that happened within two days of the previous occurrence

Classroom Training Available!