Re: Find first observation for which all subsequent observations have ...

chuakp · Posted 02-03-2020 04:02 PM

I have a claims dataset that looks like this:

person_id claim_date claim_id payment

1 01/01/2019 100 50

1 01/03/2019 101 40

1 02/09/2019 102 0

1 12/20/2019 103 0

2 10/20/2019 201 50

2 10/21/2019 202 40

2 11/30/2019 203 30

3 04/02/2019 301 20

3 04/05/2019 302 20

3 05/20/2019 304 0

3 12/30/2019 305 0

3 12/31/2019 306 50

Note that there is only one claim allowed per date in this dataset, the dataset is already sorted by claim_date, and the dataset only contains claims from 2019.

The goal is to identify the first claim_date, if any, for which both of the following are true: 1) payment equals 0; and 2) all subsequent observations for the person_id also had payment equal to 0. So for person_id = 1, the date would be 02/09/2019; for person_id = 2, no such date exists because all of their claims had non-zero payments; and for person_id = 3, no such date exists because of the positive payment on 12/31/2019.

It's straightforward to identify the first claim-date with payment = 0 for each person_id (the first criteria), but I'm struggling with how to operationalize the second criteria. I'd appreciate any help the community can provide.

novinosrin · Posted 02-03-2020 04:40 PM

HI @chuakp Please post the expected output for the sample too, ty! I mean in other words what the output should look like?

Astounding · Posted 02-03-2020 06:04 PM

How about:

data want;
   set have;
   by person_id claim_date;
   if first.person_id or not missing(payment) then first_0 = .;
   retain first_0;
   if first_0=. and payment = 0 then first_0 = claim_date;
   keep person_id first_0;
   if last.person_id and not missing(first_0);
run;

It's untested, but looks about right.

chuakp · Posted 02-03-2020 08:29 PM

Thank you. I believe I have to add this code

if payment NE 0 then first_0 = .;

This way, the program finds the first instance of a payment = 0 for a person_id. The program looks at the next observation and determines if payment = 0. If so, then first_0 stays the same. If not, that means that first_0 cannot be the first date in 2019 in which all subsequent claims have zero payments, so we set first_0 back to missing. Without this piece of code, I believe the program would just output the last observation in which payment equaled zero; if there is no such observation, first_0 will be missing.

data want;
   set have;
   by person_id claim_date;
   if first.person_id or not missing(payment) then first_0 = .;
   retain first_0;
   if first_0=. and payment = 0 then first_0 = claim_date;
   if payment NE 0 then first_0 = .;
   keep person_id first_0;
   if last.person_id and not missing(first_0);
run;

Astounding · Posted 02-03-2020 09:31 PM

You're right. I was thinking that, but wrote it incorrectly here:

   if first.person_id or not missing(payment) then first_0 = .;

Should have been:

   if first.person_id or payment ne 0 then first_0 = .;

novinosrin · Posted 02-03-2020 06:05 PM

1. Assuming you want a filtered output to subset only the record that satisfies the stated condition .

You could:


data have;
input person_id   claim_date :mmddyy10.    claim_id    payment;
format  claim_date mmddyy10. ;
cards;
1                 01/01/2019   100            50
1                 01/03/2019   101            40
1                 02/09/2019   102            0
1                 12/20/2019   103            0
2                 10/20/2019   201            50
2                 10/21/2019   202            40
2                 11/30/2019   203            30
3                 04/02/2019   301            20
3                 04/05/2019   302            20
3                 05/20/2019   304            0
3                 12/30/2019   305            0
3                 12/31/2019   306            50
;
data want;
 do _n_=1 by 1until(last.person_id);
  set have;
  by person_id;
  if t1 then do; 
   t2=sumabs(payment,t2);
   continue;
  end;
  if payment=0 then t1=claim_date;
 end;
 do _n_=1 to _n_;
  set have;
  by person_id;
  if not t2 and claim_date=t1 then output;
 end;
 drop t:;
run;

2. Assuming you want a FLAG that identified that particular record as 1 and other as 0's

data want;
 do _n_=1 by 1until(last.person_id);
  set have;
  by person_id;
  if t1 then do; 
   t2=sumabs(payment,t2);
   continue;
  end;
  if payment=0 then t1=claim_date;
 end;
 do _n_=1 to _n_;
  set have;
  by person_id;
  Flag= not t2 and claim_date=t1;
  output;
 end;
 drop t:;
run;

If both the assumptions of your expected output isn't what it is, you posting the expected output will greatly help. Ty!

chuakp · Posted 02-03-2020 08:31 PM

Thanks - I tried this, but it didn't quite work. To answer your question, I am looking for an output dataset with one row per person_id, a column for person_id, and a column for the first date on which all claims had zero payment and on which all subsequent claims had zero payment.

Patrick · Posted 02-03-2020 09:23 PM

@chuakp wrote:

Thanks - I tried this, but it didn't quite work. To answer your question, I am looking for an output dataset with one row per person_id, a column for person_id, and a column for the first date on which all claims had zero payment and on which all subsequent claims had zero payment.

data have;
  input person_id claim_date :mmddyy10. claim_id payment;
  format claim_date mmddyy10.;
  cards;
1 01/01/2019 100 50
1 01/03/2019 101 40
1 02/09/2019 102 0
1 12/20/2019 103 0
2 10/20/2019 201 50
2 10/21/2019 202 40
2 11/30/2019 203 30
3 04/02/2019 301 20
3 04/05/2019 302 20
3 05/20/2019 304 0
3 12/30/2019 305 0
3 12/31/2019 306 50
;

proc sql;
/*  create table want as*/
    select l.person_id, min(l.claim_date) as claim_date format=mmddyy10.
    from have l
    where 
      payment=0
      and not exists
        ( select * 
          from have i
          where 
            i.person_id=l.person_id 
            and i.payment>0
            and i.claim_date > l.claim_date
        )
    group by 1
  ;
quit;

mkeintz · Posted 02-03-2020 11:32 PM

You have to go through each person_id twice - once to review only the non-zero payments, keeping the last date of this subgroup (last_non_zero). The second time reads ALL the records, setting DUMMY=1 only for the case in which current payment=0 and the single record lag of date matches last_non_zero:

data want (drop=_:);
  set have (where=(payment^=0) in=non_zero_payment)
      have (in=second_pass);
  by person_id;
  if non_zero_payment=1 then _last_non_zero=claim_date;
  else if first.person_id then _last_non_zero=.;
  retain _last_non_zero;
  if second_pass;
  dummy=ifn(payment=0 and lag(claim_date)=_last_non_zero,1,0);
run;

The SET statement reads each person_id twice. The first pass reads only non-zero payment and the second reads all records.
The _last_non_zero date variable is updated with each non-zero payment. Since _last_non_zero is a retained variable, then it is reset to missing whenever a person_id has ONLY zero payments.
The "if second_pass" is a subsetting if.
The DUMMY= expression tests for current payment=0 and lag of date matching the _last_non_zero date.

Edited addition: This program assume data are sorted by person_id/claim_date, but it purposely only used "BY PERSON_ID" in the by-statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Find first observation for which all subsequent observations have a particular variable value

Re: Find first observation for which all subsequent observations have a particular variable value

Re: Find first observation for which all subsequent observations have a particular variable value

Re: Find first observation for which all subsequent observations have a particular variable value

Re: Find first observation for which all subsequent observations have a particular variable value

Re: Find first observation for which all subsequent observations have a particular variable value

Re: Find first observation for which all subsequent observations have a particular variable value

Re: Find first observation for which all subsequent observations have a particular variable value

Re: Find first observation for which all subsequent observations have a particular variable value

Classroom Training Available!