Solved: how to delete rows with duplicate data and create new rows

varun11494 · Posted 11-25-2016 02:28 AM

how do i remove the row having the same data and also create new rows by splitting
for example:insurance policy data
Policyno. start_date end_date difference in dates
1 10mar2016 09mar2017 1year

1 10mar2016 09mar2017 1year

2 01apr2016 31jan2018 1year 10months

3 05may2016 04may2017 1 year

3 05may2017 04may2018 1 year

what i want:

1)remove the field having the same data for eg of policyno. 1.I want only one field with unique data.
2)the difference should be not more than one year,if so then split the data into 2 rows or 3 rows depending on the difference for eg
   in policy no.2 difference is 1 year 10 months i want it to be split into 2 rows.
  Policyno.   start_date      end_date       difference in dates
     2           01apr2016   31mar2017         1year

2 01apr2017 31jan2018 10 months(which is less than 1 year)

thankyou in advance

mkeintz · Posted 11-25-2016 12:51 PM

This is a good problem for using an OUTPUT statement inside a do loop, and also as an example of a do loop defined in this structure:

do x=1 by 1 while (some other condition).

Notes:

For multiyear policies, the program below constantly adjusts START_DATE and END_DATE until START_DATE exceeds the original end_date (e_date in the program).
This program assumes no policy exceed 10 years ("N=1 to 10").
The intck function uses the "continuous" method argument, which means it counts date "boudaries" based on the first date argument, not on the normal calendar month boundary, nor on a user-specified "boundary-shift" parameter. Great for getting anniveraries.
The CATX function concatenates the 2nd, 3rd, etc arguments, and uses the first argument as a separator.

data want (drop=e_date n);
  set have;
  by policyno start_date end_date;
  if first.end_date;

  e_date=end_date;
  do N=1 to 10 while (start_date < e_date);
    end_date=min(e_date,intnx('year',start_date,1,'s')-1);
    length date_dif $20;
    date_dif='One Year';
    if end_date<intnx('year',start_date,1,'s')-1 then 
      date_dif=catx(' ',intck('month',start_date,end_date+1,'continuous'),'months');
    output;
    start_date=end_date+1;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PeterClemmensen · Posted 11-25-2016 02:42 AM

What if there were, say 3 years and 4 months between the two dates? 🙂

Would you then want 1 row saying 3years and 1 row saying 4months or 3 rows saying 1year and 1 saying 4 months?

The DATA to DATA Step Macro
Blog: SASnrd

varun11494 · Posted 11-25-2016 03:04 AM

4 entries
With 3 entries of 1 year each and the last entry of the remaining 4 months

Basically the difference of dates should not be greater than 1 year and at
the same time should be unique. Like in this case if policy starts on
1apr2010 to 31july13 i.e 3 years and 4 months. I Want it to be shown like
Policyno. Start date end date diff
X 01apr2010 31mar2011 1year
X 01apr2011 31mar2012 1year
X 01apr2012 31mar2013 1year
X 01apr2013 31july2013 4m

##- Please type your reply above this line. Simple formatting, no
attachments. -##

PeterClemmensen · Posted 11-25-2016 02:57 AM

And how do you define the 1 year between the two dates? Because if you use the YRDIF function which I would usually do in this situation, the difference between the 10mar2016 and 09mar2017 is barely 1 year. See below 🙂

data have;
input Policyno start_date :date9. end_date :date9.;
format start_date end_date date9.;
datalines;
1 10mar2016 09mar2017
1 10mar2016 09mar2017
2 01apr2016 31jan2018
3 05may2016 04may2017
3 05may2017 04may2018
;

data test;
   set have;
   difference_in_years = yrdif(start_date, end_date);
run;

proc print data = test;
run;

The DATA to DATA Step Macro
Blog: SASnrd

LinusH · Posted 11-25-2016 03:15 AM

For 1) PROC SORT w NODUPKEY should do the trick.

For 2), use IF to test the values of your start/end variables, and conditionally do explicit OUTPUT, and assign new values to start/end.

Data never sleeps

varun11494 · Posted 11-25-2016 05:05 AM

THANKYOU FOR YOUR TIME.
I SOLVED THE FIRST PROBLEM BUT COULD YOU HELP ME WITH THE SECOND BY TAKING POLICYNO. 2 AS EXAMPLE.
HOW TO SPLIT IT INTO DUPLICATE ENTRIES WITH DIFFERENT START DATE WITH MAXIMUM GAP OF ONE YEAR

PeterClemmensen · Posted 11-25-2016 05:10 AM

@varun11494 how do you define a years difference? 🙂

The DATA to DATA Step Macro
Blog: SASnrd

varun11494 · Posted 11-25-2016 05:21 AM

thankyou for your time
364 or 365 days
suppose policy start date is 03mar2015 so its end date should be 02mar2016 and the remaining days should be in next entries following the same pattern.
so what i have started is i found out the difference in dates and approximated it to next integer value so if difference is 1.2 or 1.5 or 1.8 years it will show 2 years as difference.from there i want to create duplicate number of rows row matching the difference(approximated to next integer).what i want is just like in this case the difference is 2 so 2 enteries shall create for it.
if the difference would be 3 so it shall automatically create 3 duplicate entries with shift in dates.

Ksharp · Posted 11-25-2016 06:32 AM


data have;
input Policyno   (start_date      end_date) (: date9.) ;
format start_date end_date date9.;
cards;
     1           10mar2016   09mar2017         1year
     1           10mar2016   09mar2017          1year
     2           01apr2016   31jan2018             1year 10months
     3           05may2016  04may2017          1 year
     3           05may2017  04may2018   
;
run;
data temp;
 set have;
 by policyno start_date end_date;
 if first.end_date;
run;
data want;
 set temp;
 yr=yrdif(start_date,end_date);
 _date1=start_date;_date2=end_date;
 do i=1 to int(yr);
  start_date=_date1 ; end_date=intnx('year',_date1,1,'s')-1;
  output;
  _date1=end_date+1;

 end;
 if mod(yr,1) ne 0 then do;
  start_date=_date1;end_date=_date2;
  output;
 end;
 drop _: i yr;
run;

mkeintz · Posted 11-25-2016 12:51 PM

This is a good problem for using an OUTPUT statement inside a do loop, and also as an example of a do loop defined in this structure:

do x=1 by 1 while (some other condition).

Notes:

For multiyear policies, the program below constantly adjusts START_DATE and END_DATE until START_DATE exceeds the original end_date (e_date in the program).
This program assumes no policy exceed 10 years ("N=1 to 10").
The intck function uses the "continuous" method argument, which means it counts date "boudaries" based on the first date argument, not on the normal calendar month boundary, nor on a user-specified "boundary-shift" parameter. Great for getting anniveraries.
The CATX function concatenates the 2nd, 3rd, etc arguments, and uses the first argument as a separator.

data want (drop=e_date n);
  set have;
  by policyno start_date end_date;
  if first.end_date;

  e_date=end_date;
  do N=1 to 10 while (start_date < e_date);
    end_date=min(e_date,intnx('year',start_date,1,'s')-1);
    length date_dif $20;
    date_dif='One Year';
    if end_date<intnx('year',start_date,1,'s')-1 then 
      date_dif=catx(' ',intck('month',start_date,end_date+1,'continuous'),'months');
    output;
    start_date=end_date+1;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

varun11494 · Posted 11-28-2016 02:20 AM

Hi
your coding worked

Can you please explain me step by step how his worked.It would be helpful for me to understand
datalines..

s.no policy no start_date end_date

1 32 04apr2003 03apr2004

2 32 04apr2003 03apr2004

Thankyou for your time

mkeintz · Posted 11-28-2016 07:02 AM

The begining of the program has

set have;

by policyno start_date end_date;

if first.end_date;

The BY statement tells SAS to (1) expect the data to be pre-sorted by policyno start_date end_date, and (2) set automatic dummies for each of the by variables indicaiing whether the record in hand is the first record for a given by value (first.policyno, first.start_date, first.end_date) or the last (last.policyno etc.). And what happens to the first. dummies when policyno changes but the dates don't? The treatment of by values is hierarchical, so if a given by var changes setting its first. dummy, all lower order (to the right) first. dummies are also set - no matter what their value sequence is.

The third statement is a subsetting IF (look up sas "subsetting if") which tells sas to keep only records which are the first instance of a given policyno/start_date/end_date combination. As a result the DO group never processes duplicates.

If you want an explanation of how the DO group works, take a multiyear policy, run the do group. But inside the do group place a set of PUT statements to see what is happening to N, start_date, end_date, and date_dif:

e_date=end_date;

put 'A: ' e_date=yymmddn8.;

do N=1 to 10 while (start_date < e_date);

put / 'B: ' N= start_date= end_date=;
end_date=min(e_date,intnx('year',start_date,1,'s')-1);

   put 'C: ' end_date=;
    length date_dif $20;
    date_dif='One Year';
    if end_date<intnx('year',start_date,1,'s')-1 then
      date_dif=catx(' ',intck('month',start_date,end_date+1,'continuous'),'months');

      put 'D: ' date_dif=;
    output;
    start_date=end_date+1;

put 'E: ' start_date=;
end;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Re: how to delete rows with duplicate data and create new rows

Registration is open

SAS Training: Just a Click Away