It's a bit tricky to explain this, but I'll try my best.
I have a dataset that contains customer ID's, days in which those customers were called and a dummy column when 1 is when we talked to a person and 0 is when we didn't.
Sometimes, we will talk to a person a certain day and also the next day, for example. I would like to keep only 1 observation out of those we actually talked to and are in a range of 2 days between the first one.
Here's an example of what I mean.
data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines4;
001 30/11/2020 0
001 01/12/2020 1
001 30/12/2020 1
001 31/12/2020 1
001 01/01/2021 1
002 01/02/2021 1
002 02/02/2021 1
002 15/02/2021 1
;;;
run;
data Want;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines4;
001 01/12/2020 1
001 01/01/2021 1
002 02/02/2021 1
002 15/02/2021 1
;;;
run;
Want would be the result I expect. As you can see, for customer 001 I removed the rows belonging to 30/12 and 31/12 because there was a contact on 01/02 and, if there is a contact made with a client for 3 or less consecutive days, I want to only keep the last one.
Does that make sense? To be fair, I don't even know where to start to get this, any ideas? I would like to use proc sql for this since it's what I've been learning so far.
Thank you very much in advance.
You state:
As you can see, for customer 001 I removed the rows belonging to 30/01 and 31/01
but I don't see either a 30/01 or 31/01 record.
Also why are you not keeping 31/12/2020?
I believe another way of saying what you want is to keep only records that are not followed on the next day or the day after next by a successful contact. So what do you want to do if you have six consecutive days with successful contacts on days 2, 4, and 6? Do you keep day 6 only?
Since your data are sorted by ID, I think you'd be better off with a single data step. It would read through each ID twice. The first pass would identify taboo dates (dates that are 1 or 2 days prior to a successful contact). The second pass would keep all records not included in the taboo dates.
Keeping a _temporary_ array (an array whose values are not output, and whose values are not reset to missing with each incoming observation) indexed by CALLDATE is an easy way to track taboo dates:
data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines4;
001 30/11/2020 0
001 01/12/2020 1 *
001 30/12/2020 1
001 31/12/2020 1
001 01/01/2021 1 *
002 01/02/2021 1
002 02/02/2021 1
002 15/02/2021 1
;;;;
run;
data want;
set have (where=(contacted=1) in=firstpass)
have (in=secondpass);
by id;
array taboodate {%sysevalf("01jan2020"d):%sysevalf("31dec2021"d)} _temporary_;
if first.id then call missing(of taboodate{*});
if firstpass then taboodate{calldate-2}=1;
if firstpass then taboodate{calldate-1}=1;
if secondpass=1 and taboodate{calldate}^=1;
run;
Note the array statement tells sas to form the lower and upper boundaries based on the internal values of 01jan2020 and 31dec2021 (using the %sysevalf macro function to translate a date literal ("01jan2020"d) to the internal value). Just make sure to use dates that span your calldate range.
If I understood what you mean .
data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines;
001 30/11/2020 0
001 01/12/2020 1
001 30/12/2020 1
001 31/12/2020 1
001 01/01/2021 1
002 01/02/2021 1
002 02/02/2021 1
002 15/02/2021 1
;
run;
data temp;
set have;
by id Contacted notsorted;
dif=dif(callDate);
if first.Contacted or dif>1 then group+1;
run;
data want;
set temp;
by group;
if last.group and Contacted=1;
run;
Hi Ksharp,
This solution is very good. I do had to modify it a bit but I'm still encountering a mistake, maybe you can help me:
data temp;
set have1;
by idnotsorted;
dif2=dif(callDate);
if first.contacted or first.id or dif2>1 or dif2<0 then group2+1;
if first.contacted or first.id and dif2=2 then group2=group2-1;
run;
data want;
set temp;
/* drop diff group contacted;*/
by group2;
if last.group2 and contacted=1 then contacted2=1;
else contacted2=0;
run;
This breaks in a case where:
data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines;
001 28/01/2021 1
001 30/01/2021 1
001 01/02/2021 1
002 24/11/2020 1
;
run;
The "group2" variable results in:
data Have;
input ID callDate :ddmmyy. Contacted diff group2 cured2;
format callDate ddmmyy10.;
datalines;
001 28/01/2021 1 1 147 1
001 30/01/2021 1 2 148 1
001 01/02/2021 1 2 149 1
002 24/11/2020 1 -69 150 1
;
run;
when it should only assign a 1 to 001 - 01/02/2021 and 002. Do you have any idea how I could fix this?
Thank you so much
Your code was not right .
you don't have 'contacted' in BY statement.
And make sure you have sorted by ID callDate.
data Have1; input ID callDate :ddmmyy. Contacted ; format callDate ddmmyy10.; datalines; 001 28/01/2021 1 001 30/01/2021 1 001 01/02/2021 1 002 24/11/2020 1 ; run; data temp; set have1; by id contacted notsorted; dif2=dif(callDate); if first.contacted or dif2>1 then group2+1; run; data want; set temp; /* drop diff group contacted;*/ by group2; if last.group2 and contacted=1 then contacted2=1; else contacted2=0; run;
That didn't work as well, if you can see the "contacted2" column will only have 1s, which is not the expected outcome.
The thing that worked for me was:
data temp;
set have1;
by id notsorted;
dif2=dif(callDate);
if first.contacted or first.id or dif2>1 or dif2<0 then group2+1;
if dif2=2 then group2=group2-1;
run;
data want;
set temp;
/* drop diff group contacted;*/
by group2;
if last.group2 and contacted=1 then contacted2=1;
else contacted2=0;
run;
I keep having a similar issue on another data where I need to perform the same thing.
The data is this, where PAY2 column is the expected outcome.
data Have2; input ID callDate :ddmmyy. pay pay2; format callDate ddmmyy10.; datalines; 011 29/12/2020 1 1 011 04/01/2021 0 0 011 25/01/2021 1 0 011 27/01/2021 1 1 011 29/01/2021 0 0 ; run;
With this code:
data temp;
set have2;
by agreement_number pay notsorted;
dif5=dif(dialleddate);
if first.pay or first.id or dif5>1 or dif5<0 then group5+1;
run;
data want;
set temp;
by group5 notsorted;
if last.group5 and Pay=1 then Pay2=1;
else Pay2=0;
run;
The outcome I am getting is:
data want;
input ID callDate :ddmmyy. pay dif5 group5 pay2;
format callDate ddmmyy10.;
datalines;
11 29/12/2020 1 . 1 1
11 04/01/2021 0 6 2 0
11 25/01/2021 1 21 3 1
11 27/01/2021 1 2 4 1
11 29/01/2021 0 2 5 0
;
run;
As you can see, nothing is changed. I tried a few things (including what I did before) but couldn't figure it out.
The code you show
data temp;
set have2;
by agreement_number pay notsorted;
dif5=dif(dialleddate);
if first.pay or first.id or dif5>1 or dif5<0 then group5+1;
run;
does a test for first.id. But there is no such variable. So if you expected first.id it to be part of a valid test, it didn't happen.
Did you look at your log? The above should have produced a NOTE..
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.