Solved: Re: Subset Data / Date Compare and Pick the Oldest

VarunD · Posted 01-24-2019 09:54 AM

Hello,

I want to subset the Input Table by comparing the Service Date of each Customer_ID.

If difference in Service Date of a Customer_ID is within 7 days, then select only the observation with oldest date.

If difference in Service Date is more than 7 days then select all the observations. Please see the Input and Output Tables below and help me with the Code.

Input Table
Customer_ID	Transaction_ID	Service Date
AB1	12345	10/12/18
AB1	12346	10/14/18
AB1	12347	10/15/18
AB2	22233	6/10/18
AB2	22234	9/12/18
AB3	35678	1/5/19

Output Table
Customer_ID	Transaction_ID	Service Date
AB1	12345	10/12/18
AB2	22233	6/10/18
AB2	22234	9/12/18
AB3	35678	1/5/19

Thanks,
Varun

mkeintz · Posted 01-24-2019 10:42 AM

You want to drop a transaction that is within 7 days of the prior transaction (presumably you want this date interval honored only WITHIN id's, not BETWEEN id's - that's @novinosrin's question):

data have;
  input Customer_ID :$3. Transaction_ID Service_Date  :mmddyy8.;
  format service_date date9.;
datalines;;
AB1 12345 10/12/18 
AB1 12346 10/14/18 
AB1 12347 10/15/18 
AB2 22233 6/10/18 
AB2 22234 9/12/18 
AB3 35678 1/5/19 
run;
data want (drop=last_output_date);
  set have ;
  by customer_id;
  if first.customer_id
    or service_date-7 > last_output_date;
  output;
  last_output_date=service_date;
  retain last_output_date;
run;

This program creates a variable LAST_OUTPUT_DATE, preserving the most recent date that passes your criterion. It is RETAINed across observations (not automatically reset to missing at top of data step), making it possible for you to compare the observation in hand to some previously generated value. This is necessary because you can't just compare one date to the immediately prior date - instead you may have to compare it to a date further back than the prior date. That's the situation for id AB1.

The IF statement above has no THEN clause. Therefore it is a subsetting if. Only observations meeting the if condition are processed further.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

novinosrin · Posted 01-24-2019 10:04 AM

I would like you to explain

What if dif<=7 in some transactions and dif>7 in other transactions for the same customer_id?

Customer_ID	Transaction_ID	ServiceDate	dif
AB1	12345	10/12/2018	.
AB1	12346	10/14/2018	2
AB1	12347	10/15/2018	1
AB2	22233	06/10/2018	.
AB2	22234	09/12/2018	94
AB3	35678	01/05/2019	.

VarunD · Posted 01-24-2019 10:51 AM

That's is a great question. I did not think of that before.

Thanks for asking.

If diff is greater than 7 than I would like the Service Date to be compared with the last selected observation and perform the same logic.

Here, I have modified the Input and Output example. Hope this will make it clear and it's the same thing you were asking :

Input Table
Customer_ID	Transaction_ID	Service Date	Diff
AB1	12345	10/12/18	-
AB1	12346	10/14/18	2
AB1	12347	10/15/18	1
AB2	22233	6/10/18	-
AB2	22234	9/12/18	94
AB2	22356	9/15/18	3
AB3	35678	6/6/18	-
AB3	35679	6/13/18	7
AB3	35680	6/15/18	9
AB3	35681	6/18/18	3

Output Table
Customer_ID	Transaction_ID	Service Date
AB1	12345	10/12/18
AB2	22233	6/10/18
AB2	22234	9/12/18
AB3	35678	6/6/18
AB3	35680	6/15/18

Thanks for your quick response,
Varun

SuryaKiran · Posted 01-24-2019 10:36 AM

Hello,

Please provide your sample data in the form of a datastep, this will help more number of people responding with spending their little time.

Check this:

data have;
infile datalines dlm='09'x dsd;
input Customer_ID $	Transaction_ID	Service_Date mmddyy8.;
format Service_Date mmddyy10.;
datalines;
AB1	12345	10/12/18
AB1	12346	10/14/18
AB1	12347	10/15/18
AB2	22233	6/10/18
AB2	22234	9/12/18
AB3	35678	1/5/19
;
run;

proc sort data=have;
by Customer_ID 	Transaction_ID	descending Service_Date;
run;

data have_dif;
set have;
by Customer_ID 	Transaction_ID	descending Service_Date;
dif_dt=dif(Service_Date);
if first.Customer_ID then dif_dt=0;
run;

proc sql;
select Customer_ID ,Transaction_ID,Service_Date
	from have_dif
	where customer_id in (select customer_id from have_dif where dif_dt>7)
union 
select Customer_ID ,Transaction_ID,Service_Date
	from have_dif
	where customer_id not in (select customer_id from have_dif where dif_dt>7)
group by Customer_ID
	having Service_Date=min(Service_Date)
order by customer_id
;
quit;

Thanks,
Suryakiran

VarunD · Posted 01-24-2019 11:58 AM

Thanks for your response! I will ensure that I provide data in datastep for next time.

mkeintz · Posted 01-24-2019 10:42 AM

You want to drop a transaction that is within 7 days of the prior transaction (presumably you want this date interval honored only WITHIN id's, not BETWEEN id's - that's @novinosrin's question):

data have;
  input Customer_ID :$3. Transaction_ID Service_Date  :mmddyy8.;
  format service_date date9.;
datalines;;
AB1 12345 10/12/18 
AB1 12346 10/14/18 
AB1 12347 10/15/18 
AB2 22233 6/10/18 
AB2 22234 9/12/18 
AB3 35678 1/5/19 
run;
data want (drop=last_output_date);
  set have ;
  by customer_id;
  if first.customer_id
    or service_date-7 > last_output_date;
  output;
  last_output_date=service_date;
  retain last_output_date;
run;

This program creates a variable LAST_OUTPUT_DATE, preserving the most recent date that passes your criterion. It is RETAINed across observations (not automatically reset to missing at top of data step), making it possible for you to compare the observation in hand to some previously generated value. This is necessary because you can't just compare one date to the immediately prior date - instead you may have to compare it to a date further back than the prior date. That's the situation for id AB1.

The IF statement above has no THEN clause. Therefore it is a subsetting if. Only observations meeting the if condition are processed further.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

VarunD · Posted 01-24-2019 11:56 AM

This seems to have worked. I will have to verify the output that might take some time. But this looks good. Thanks a lot !

Astounding · Posted 01-24-2019 10:44 AM

"help me with the code" can mean a lot of things, anything from:

I got an error that I don't know how to fix, to
I have no idea what to do, so write it for me

If I were writing the code, I would begin with a simple step. Assuming the data set is in sorted order by Customer_ID, and assuming that Service_Date is an actual SAS date and not a character string:

proc summary data=have;

by customer_ID;

var Service_Date;

output out=ranges (keep=customer_ID First_Date Last_Date) min=first_date max=last_date;

run;

This gives you a SAS data set with one observation per Customer_ID, with the first and last date of service for that customer. If you had such a SAS data set, would you have any idea of what you need to do next? Are you able to outline a plan (not necessarily write the program)?

VarunD · Posted 01-24-2019 11:18 AM

Thanks for you reply. I meant - I have no idea.
Service_Date is not actual SAS Date.
I am hoping to get more than one observation per Customer_ID.

novinosrin · Posted 01-24-2019 11:23 AM

Hi @VarunD I think @mkeintz comprehended your need pretty well. Play with his solution and see if that works. If I were to take a stab, might partially plagiarize on his idea as that datastep gives a beautiful start. If that solution requires tweaking and Mark happens to be busy, I will chime in.

mkeintz · Posted 01-24-2019 11:25 AM

@VarunD wrote:
...
Service_Date is not actual SAS Date.
...

But Service_Date becomes a SAS datea via the INPUT statement below, which tells sas to interpret that field with the MMDDYY8. inbound format. This means that sas converts the input field to a sas datea value (a value recording the number of days after or before 1/1/1960). You can make these numbers easy to visually interpret by assign a FORMAT to the variable, so that you can realize what the date is for that date value.

  input Customer_ID :$3. Transaction_ID Service_Date  :mmddyy8.;
  format service_date date9.;

Once this date value is constructed, then a subtraction between two values is the number of days by which they are separated.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away