I have a claims dataset that looks like this:
person_id claim_date claim_id payment
1 01/01/2019 100 50
1 01/03/2019 101 40
1 02/09/2019 102 0
1 12/20/2019 103 0
2 10/20/2019 201 50
2 10/21/2019 202 40
2 11/30/2019 203 30
3 04/02/2019 301 20
3 04/05/2019 302 20
3 05/20/2019 304 0
3 12/30/2019 305 0
3 12/31/2019 306 50
Note that there is only one claim allowed per date in this dataset, the dataset is already sorted by claim_date, and the dataset only contains claims from 2019.
The goal is to identify the first claim_date, if any, for which both of the following are true: 1) payment equals 0; and 2) all subsequent observations for the person_id also had payment equal to 0. So for person_id = 1, the date would be 02/09/2019; for person_id = 2, no such date exists because all of their claims had non-zero payments; and for person_id = 3, no such date exists because of the positive payment on 12/31/2019.
It's straightforward to identify the first claim-date with payment = 0 for each person_id (the first criteria), but I'm struggling with how to operationalize the second criteria. I'd appreciate any help the community can provide.
HI @chuakp Please post the expected output for the sample too, ty! I mean in other words what the output should look like?
How about:
data want;
set have;
by person_id claim_date;
if first.person_id or not missing(payment) then first_0 = .;
retain first_0;
if first_0=. and payment = 0 then first_0 = claim_date;
keep person_id first_0;
if last.person_id and not missing(first_0);
run;
It's untested, but looks about right.
Thank you. I believe I have to add this code
if payment NE 0 then first_0 = .;
This way, the program finds the first instance of a payment = 0 for a person_id. The program looks at the next observation and determines if payment = 0. If so, then first_0 stays the same. If not, that means that first_0 cannot be the first date in 2019 in which all subsequent claims have zero payments, so we set first_0 back to missing. Without this piece of code, I believe the program would just output the last observation in which payment equaled zero; if there is no such observation, first_0 will be missing.
data want;
set have;
by person_id claim_date;
if first.person_id or not missing(payment) then first_0 = .;
retain first_0;
if first_0=. and payment = 0 then first_0 = claim_date;
if payment NE 0 then first_0 = .;
keep person_id first_0;
if last.person_id and not missing(first_0);
run;
You're right. I was thinking that, but wrote it incorrectly here:
if first.person_id or not missing(payment) then first_0 = .;
Should have been:
if first.person_id or payment ne 0 then first_0 = .;
1. Assuming you want a filtered output to subset only the record that satisfies the stated condition .
You could:
data have;
input person_id claim_date :mmddyy10. claim_id payment;
format claim_date mmddyy10. ;
cards;
1 01/01/2019 100 50
1 01/03/2019 101 40
1 02/09/2019 102 0
1 12/20/2019 103 0
2 10/20/2019 201 50
2 10/21/2019 202 40
2 11/30/2019 203 30
3 04/02/2019 301 20
3 04/05/2019 302 20
3 05/20/2019 304 0
3 12/30/2019 305 0
3 12/31/2019 306 50
;
data want;
do _n_=1 by 1until(last.person_id);
set have;
by person_id;
if t1 then do;
t2=sumabs(payment,t2);
continue;
end;
if payment=0 then t1=claim_date;
end;
do _n_=1 to _n_;
set have;
by person_id;
if not t2 and claim_date=t1 then output;
end;
drop t:;
run;
2. Assuming you want a FLAG that identified that particular record as 1 and other as 0's
data want;
do _n_=1 by 1until(last.person_id);
set have;
by person_id;
if t1 then do;
t2=sumabs(payment,t2);
continue;
end;
if payment=0 then t1=claim_date;
end;
do _n_=1 to _n_;
set have;
by person_id;
Flag= not t2 and claim_date=t1;
output;
end;
drop t:;
run;
If both the assumptions of your expected output isn't what it is, you posting the expected output will greatly help. Ty!
Thanks - I tried this, but it didn't quite work. To answer your question, I am looking for an output dataset with one row per person_id, a column for person_id, and a column for the first date on which all claims had zero payment and on which all subsequent claims had zero payment.
@chuakp wrote:
Thanks - I tried this, but it didn't quite work. To answer your question, I am looking for an output dataset with one row per person_id, a column for person_id, and a column for the first date on which all claims had zero payment and on which all subsequent claims had zero payment.
data have;
input person_id claim_date :mmddyy10. claim_id payment;
format claim_date mmddyy10.;
cards;
1 01/01/2019 100 50
1 01/03/2019 101 40
1 02/09/2019 102 0
1 12/20/2019 103 0
2 10/20/2019 201 50
2 10/21/2019 202 40
2 11/30/2019 203 30
3 04/02/2019 301 20
3 04/05/2019 302 20
3 05/20/2019 304 0
3 12/30/2019 305 0
3 12/31/2019 306 50
;
proc sql;
/* create table want as*/
select l.person_id, min(l.claim_date) as claim_date format=mmddyy10.
from have l
where
payment=0
and not exists
( select *
from have i
where
i.person_id=l.person_id
and i.payment>0
and i.claim_date > l.claim_date
)
group by 1
;
quit;
You have to go through each person_id twice - once to review only the non-zero payments, keeping the last date of this subgroup (last_non_zero). The second time reads ALL the records, setting DUMMY=1 only for the case in which current payment=0 and the single record lag of date matches last_non_zero:
data want (drop=_:);
set have (where=(payment^=0) in=non_zero_payment)
have (in=second_pass);
by person_id;
if non_zero_payment=1 then _last_non_zero=claim_date;
else if first.person_id then _last_non_zero=.;
retain _last_non_zero;
if second_pass;
dummy=ifn(payment=0 and lag(claim_date)=_last_non_zero,1,0);
run;
Edited addition: This program assume data are sorted by person_id/claim_date, but it purposely only used "BY PERSON_ID" in the by-statement.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.