How to find irrelevant rows and delete them?

IgorR · Posted 02-26-2024 03:40 AM

Hi,
I have Data about PHI claims. that contain next variables:
ID - Id of the client,
Event_Date - The date when claim started,
QTR - the quarter of the Event_date,
Payment_To - The date the claim was paud until,
Monthes_To_Pay - how much monthes the claim survived.

This data is pretty dirty, because there is not always inpyted the correct evemt_day, sometimes this date is wrong or fictive.
I need to clean this up.

Here is the example of data:

Data Have;
Input ID Event_Date QTR Payment_To Monthes_To_Pay ;
datalines;
251595 23/03/2011 01/03/2011 21/09/2011 6
277681 20/06/2004 01/06/2004 30/04/2005 10
277681 15/12/2004 01/12/2004 28/02/2005 2
283044 21/11/2004 01/12/2004 31/07/2006 20
285726 17/01/2008 01/03/2008 31/01/2009 12
311449 01/11/2004 01/12/2004 31/12/2005 13
311449 03/02/2005 01/03/2005 31/07/2005 5
253333 07/11/2005 01/12/2005 31/10/2010 59
253333 12/12/2007 01/12/2007 30/09/2011 45

;
run;

As we can see For IDs 277681 and 311449 there are two rows for each.
the event_date of the second row of each ID is within payment period and Payment_to as well, and so we conclude that event date in the second row is wrong.
I need to find out such rows and delete them because they are not relevant to my data.

For ID 253333 there is another story - we have two event dates, where second date within payment period of first event.
For such rows I need to keep only the first event(it happened first) and assighn the latest Payment_to as it is continious claim.

Want:
ID |Event_Date |QTR |Payment_To |Monthes_To_Pay
-------|---------------|---------------|---------------|--------------
251595 |23/03/2011 |01/03/2011 |21/09/2011 |6
277681 |20/06/2004 |01/06/2004 |30/04/2005 |10
283044 |21/11/2004 |01/12/2004 |31/07/2006 |20
285726 |17/01/2008 |01/03/2008 |31/01/2009 |12
311449 |01/11/2004 |01/12/2004 |31/12/2005 |13
253333 |07/11/2005 |01/12/2005 |30/09/2011 |70

The full table contain up to 100,000 rows.

Any help how can I handele these two scenarios?
Thanks in advance.

JosvanderVelden · Posted 02-26-2024 06:42 AM

I don't fully understand the logic for the 2 scenarios. Please provide additional details. In the 2 scenarios you describe there are always 2 ID(rows). Is that the limit or could there be 3 or more rows with the same ID?
Is the full table sorted in any way? If yes, how?
Would it be OK to sort the dataset have by ID and Event_Date before eliminating rows?

ballardw · Posted 02-26-2024 01:15 PM

You should provide example data in a data step that runs correctly.

Example of running your code:

138  Data Have;
139  Input ID Event_Date QTR Payment_To Monthes_To_Pay ;
140  datalines;

NOTE: Invalid data for Event_Date in line 141 8-17.
NOTE: Invalid data for QTR in line 141 19-28.
NOTE: Invalid data for Payment_To in line 141 30-39.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----
141        251595 23/03/2011 01/03/2011 21/09/2011 6
ID=251595 Event_Date=. QTR=. Payment_To=. Monthes_To_Pay=6 _ERROR_=1 _N_=1
NOTE: Invalid data for Event_Date in line 142 8-17.
NOTE: Invalid data for QTR in line 142 19-28.
NOTE: Invalid data for Payment_To in line 142 30-39.
142        277681 20/06/2004 01/06/2004 30/04/2005 10

Your input statement attempts to read dates a simple numbers erroneously.

If your values are actual dates then you should include an informat to read them properly and format to display them as needed.

Data Have;
Input ID Event_Date :ddmmyy10. QTR :ddmmyy10.  Payment_To :ddmmyy10. Monthes_To_Pay ;
format Event_date QTR Payment_to ddmmyy10. ;
datalines;
251595 23/03/2011 01/03/2011 21/09/2011 6
277681 20/06/2004 01/06/2004 30/04/2005 10
277681 15/12/2004 01/12/2004 28/02/2005 2
283044 21/11/2004 01/12/2004 31/07/2006 20
285726 17/01/2008 01/03/2008 31/01/2009 12
311449 01/11/2004 01/12/2004 31/12/2005 13
311449 03/02/2005 01/03/2005 31/07/2005 5
253333 07/11/2005 01/12/2005 31/10/2010 59
253333 12/12/2007 01/12/2007 30/09/2011 45
;
run;

Which does unfortunately bring up the question of your variable values. Are they actually dates or strings?

mkeintz · Posted 02-26-2024 01:17 PM

Let's make sure I fully understand your request. Here is what I think you mean

If the payment_to date of the current obs follows the event_date of the next obs, then if neccessary, update the payment_to_date (and monthestopay) in the current obs. "If neccessary mean if the next payment_to date comes after the current payment_to date. You'll see the code for this using the data from the MERGE statement.
If the current event_date precedes the previous payment_to date, then delete.

The code below:

Assumes all obs for a given ID to be in a single group.
Obs within each ID to be sorted by EVENT_DATE.
Does not accommodate more than 2 consecutive overlapping observations.

data have;
  input ID Event_Date QTR Payment_To Monthes_To_Pay ;
  informat event_date qtr payment_to ddmmyy10.;
  format  event_date qtr payment_to date9.  ddmmyy10. ;
datalines;
251595 23/03/2011 01/03/2011 21/09/2011 6
277681 20/06/2004 01/06/2004 30/04/2005 10
277681 15/12/2004 01/12/2004 28/02/2005 2
283044 21/11/2004 01/12/2004 31/07/2006 20
285726 17/01/2008 01/03/2008 31/01/2009 12
311449 01/11/2004 01/12/2004 31/12/2005 13
311449 03/02/2005 01/03/2005 31/07/2005 5
253333 07/11/2005 01/12/2005 31/10/2010 59
253333 12/12/2007 01/12/2007 30/09/2011 45
run;


data want;
  set have (keep=id);
  by id notsorted;
  merge /*Current obs */ have 
        /*Next obs    */ have  (firstobs=2 keep=event_date payment_to 
                             rename=(event_date=nxt_ev payment_to=nxt_pt));

  if last.id=0 and nxt_ev <= payment_to then do;  /*Check whether payment_to should be extended?*/
    payment_to=max(payment_to,nxt_pt);
    monthes_to_pay=intck('month',event_date,payment_to);
  end;

  if first.id=0 and event_date<=lag(payment_to) then delete;
run;

BTW, thanks for trying to provide sample data in the form of a DATA step. But please make sure the DATA step works, by testing your code before submitting. When I ran your sample code, it generated missing values for all the date variables, due to the absence informat specifications for the INPUT ting the date variables. I've put the informats in my code above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 02-26-2024 10:11 PM

I have added a final row to your sample data set. As a result the id 253333 has three consecutive obs that overlap to make a single date range. This code will accommodate that situation:

data have;
  input ID Event_Date QTR Payment_To Monthes_To_Pay ;
  informat event_date qtr payment_to ddmmyy10.;
  format  event_date qtr payment_to ddmmyy10. ;
datalines;
251595 23/03/2011 01/03/2011 21/09/2011 6
277681 20/06/2004 01/06/2004 30/04/2005 10
277681 15/12/2004 01/12/2004 28/02/2005 2
283044 21/11/2004 01/12/2004 31/07/2006 20
285726 17/01/2008 01/03/2008 31/01/2009 12
311449 01/11/2004 01/12/2004 31/12/2005 13
311449 03/02/2005 01/03/2005 31/07/2005 5
253333 07/11/2005 01/12/2005 31/10/2010 59
253333 12/12/2007 01/12/2007 30/09/2011 45
253333 12/12/2008 01/12/2008 30/09/2012 45
run;

data want (drop=i n nxt_: extended_:);
  /*Read all overlaps and generate extended payment_to */
  do n=1 by 1 until (last.id or nxt_ev>extended_pt);
    set have (keep=id);
    by id notsorted;
    merge /*Current obs */ have 
          /*Next obs    */ have  (firstobs=2 keep=event_date payment_to 
                           rename=(event_date=nxt_ev payment_to=nxt_pt)) ;
    if extended_pt=. then extended_pt=payment_to;
    if last.id=0 and nxt_ev<=extended_pt then extended_pt=max(extended_pt,nxt_pt);
  end;
  /*Reread the same N obs, output only the first, but with the extended payment_to */
  do i=1 to n;
    set have;
    if i=1 then do;
      payment_to=extended_pt;
      monthes_to_pay=intck('month',event_date,payment_to);
      output;
    end;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How to find irrelevant rows and delete them?

Re: How to find irrelevant rows and delete them?

Re: How to find irrelevant rows and delete them?

Re: How to find irrelevant rows and delete them?

Re: How to find irrelevant rows and delete them?

How to find irrelevant rows and delete them?

Re: How to find irrelevant rows and delete them?

Re: How to find irrelevant rows and delete them?

Re: How to find irrelevant rows and delete them?

Re: How to find irrelevant rows and delete them?

Click image to register for webinar

Classroom Training Available!