DATA Step, Macro, Functions and more

how to delete rows with duplicate data and create new rows

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

how to delete rows with duplicate data and create new rows

how do i remove the row having the same data and also create new rows by splitting
for example:insurance policy data
Policyno.   start_date      end_date       difference in dates
     1           10mar2016   09mar2017         1year

     1           10mar2016   09mar2017          1year

     2           01apr2016   31jan2018             1year 10months

     3           05may2016  04may2017          1 year

     3           05may2017  04may2018          1 year

what i want:

1)remove the field having the same data for eg of policyno. 1.I want only one field with unique data.
2)the difference should be not more than one year,if so then split the data into 2 rows or 3 rows depending on the difference for eg
   in policy no.2 difference is 1 year 10 months i want it to be split into 2 rows.
  Policyno.   start_date      end_date       difference in dates  
     2           01apr2016   31mar2017         1year

     2           01apr2017  31jan2018           10 months(which is less than 1 year)

 

thankyou in advance


Accepted Solutions
Solution
‎11-28-2016 02:15 AM
Valued Guide
Posts: 797

Re: how to delete rows with duplicate data and create new rows

This is a good problem for using an OUTPUT statement inside a do loop, and also as an example of a do loop defined in this structure:

 

   do x=1 by 1 while (some other condition).

 

Notes:

  1. For multiyear policies, the program below constantly adjusts START_DATE and END_DATE until START_DATE exceeds the original end_date (e_date in the program).
  2. This program assumes no policy exceed 10 years ("N=1 to 10").
  3. The intck function uses the "continuous" method argument, which means it counts date "boudaries" based on the first date argument, not on the normal calendar month boundary, nor on a user-specified "boundary-shift" parameter.  Great for getting anniveraries.
  4. The CATX function concatenates the 2nd, 3rd, etc arguments, and uses the first argument as a separator.

 

data want (drop=e_date n);
  set have;
  by policyno start_date end_date;
  if first.end_date;

  e_date=end_date;
  do N=1 to 10 while (start_date < e_date);
    end_date=min(e_date,intnx('year',start_date,1,'s')-1);
    length date_dif $20;
    date_dif='One Year';
    if end_date<intnx('year',start_date,1,'s')-1 then 
      date_dif=catx(' ',intck('month',start_date,end_date+1,'continuous'),'months');
    output;
    start_date=end_date+1;
  end;
run;

 

View solution in original post


All Replies
PROC Star
Posts: 549

Re: how to delete rows with duplicate data and create new rows

What if there were, say 3 years and 4 months between the two dates? Smiley Happy

 

Would you then want 1 row saying 3years and 1 row saying 4months or 3 rows saying 1year and 1 saying 4 months?

Occasional Contributor
Posts: 5

Re: how to delete rows with duplicate data and create new rows

4 entries
With 3 entries of 1 year each and the last entry of the remaining 4 months

Basically the difference of dates should not be greater than 1 year and at
the same time should be unique. Like in this case if policy starts on
1apr2010 to 31july13 i.e 3 years and 4 months. I Want it to be shown like
Policyno. Start date end date diff
X 01apr2010 31mar2011 1year
X 01apr2011 31mar2012 1year
X 01apr2012 31mar2013 1year
X 01apr2013 31july2013 4m

##- Please type your reply above this line. Simple formatting, no
attachments. -##
PROC Star
Posts: 549

Re: how to delete rows with duplicate data and create new rows

And how do you define the 1 year between the two dates? Because if you use the YRDIF function which I would usually do in this situation, the difference between the 10mar2016 and 09mar2017 is barely 1 year. See below Smiley Happy

 

data have;
input Policyno start_date :date9. end_date :date9.;
format start_date end_date date9.;
datalines;
1 10mar2016 09mar2017
1 10mar2016 09mar2017
2 01apr2016 31jan2018
3 05may2016 04may2017
3 05may2017 04may2018
;

data test;
   set have;
   difference_in_years = yrdif(start_date, end_date);
run;

proc print data = test;
run;
Super User
Posts: 5,255

Re: how to delete rows with duplicate data and create new rows

For 1) PROC SORT w NODUPKEY should do the trick.

For 2), use IF to test the values of your start/end variables, and conditionally do explicit OUTPUT, and assign new values to start/end.

Data never sleeps
Occasional Contributor
Posts: 5

Re: how to delete rows with duplicate data and create new rows

THANKYOU FOR YOUR TIME.
I SOLVED THE FIRST PROBLEM BUT COULD YOU HELP ME WITH THE SECOND BY TAKING POLICYNO. 2 AS EXAMPLE.
HOW TO SPLIT IT INTO DUPLICATE ENTRIES WITH DIFFERENT START DATE WITH MAXIMUM GAP OF ONE YEAR

PROC Star
Posts: 549

Re: how to delete rows with duplicate data and create new rows

@varun11494 how do you define a years difference? Smiley Happy

Occasional Contributor
Posts: 5

Re: how to delete rows with duplicate data and create new rows

thankyou for your time
364 or 365 days
suppose policy start date is 03mar2015 so its end date should be 02mar2016 and the remaining days should be in next entries following the same pattern.
so what i have started is i found out the difference in dates and approximated it to next integer value so if difference is 1.2 or 1.5 or 1.8 years it will show 2 years as difference.from there i want to create duplicate number of rows row matching the difference(approximated to next integer).what i want is just like in this case the difference is 2 so 2 enteries shall create for it.
if the difference would be 3 so it shall automatically create 3 duplicate entries with shift in dates.

 

Super User
Posts: 9,671

Re: how to delete rows with duplicate data and create new rows


data have;
input Policyno   (start_date      end_date) (: date9.) ;
format start_date end_date date9.;
cards;
     1           10mar2016   09mar2017         1year
     1           10mar2016   09mar2017          1year
     2           01apr2016   31jan2018             1year 10months
     3           05may2016  04may2017          1 year
     3           05may2017  04may2018   
;
run;
data temp;
 set have;
 by policyno start_date end_date;
 if first.end_date;
run;
data want;
 set temp;
 yr=yrdif(start_date,end_date);
 _date1=start_date;_date2=end_date;
 do i=1 to int(yr);
  start_date=_date1 ; end_date=intnx('year',_date1,1,'s')-1;
  output;
  _date1=end_date+1;

 end;
 if mod(yr,1) ne 0 then do;
  start_date=_date1;end_date=_date2;
  output;
 end;
 drop _: i yr;
run;


Solution
‎11-28-2016 02:15 AM
Valued Guide
Posts: 797

Re: how to delete rows with duplicate data and create new rows

This is a good problem for using an OUTPUT statement inside a do loop, and also as an example of a do loop defined in this structure:

 

   do x=1 by 1 while (some other condition).

 

Notes:

  1. For multiyear policies, the program below constantly adjusts START_DATE and END_DATE until START_DATE exceeds the original end_date (e_date in the program).
  2. This program assumes no policy exceed 10 years ("N=1 to 10").
  3. The intck function uses the "continuous" method argument, which means it counts date "boudaries" based on the first date argument, not on the normal calendar month boundary, nor on a user-specified "boundary-shift" parameter.  Great for getting anniveraries.
  4. The CATX function concatenates the 2nd, 3rd, etc arguments, and uses the first argument as a separator.

 

data want (drop=e_date n);
  set have;
  by policyno start_date end_date;
  if first.end_date;

  e_date=end_date;
  do N=1 to 10 while (start_date < e_date);
    end_date=min(e_date,intnx('year',start_date,1,'s')-1);
    length date_dif $20;
    date_dif='One Year';
    if end_date<intnx('year',start_date,1,'s')-1 then 
      date_dif=catx(' ',intck('month',start_date,end_date+1,'continuous'),'months');
    output;
    start_date=end_date+1;
  end;
run;

 

Occasional Contributor
Posts: 5

Re: how to delete rows with duplicate data and create new rows

Hi
your coding worked

Can you please explain me step by step how his worked.It would be helpful for me to understand
datalines..

  s.no policy no      start_date        end_date

    1       32           04apr2003      03apr2004 

    2       32           04apr2003      03apr2004

 

Thankyou for your time

Valued Guide
Posts: 797

Re: how to delete rows with duplicate data and create new rows

The begining of the program has

 

   set have;

   by policyno start_date end_date;

   if first.end_date;

 

The BY statement tells SAS to (1) expect the data to be pre-sorted by policyno start_date end_date, and (2) set automatic dummies for each of the by variables indicaiing whether the record in hand is the first record for a given by value (first.policyno, first.start_date, first.end_date) or the last (last.policyno etc.).    And what happens to the first. dummies when policyno changes but the dates don't? The treatment of by values is hierarchical, so if a given by var changes setting its first. dummy, all lower order (to the right) first. dummies  are also set - no matter what their value sequence is.

 

The third statement is a subsetting IF (look up sas "subsetting if") which tells sas to keep only records which are the first instance of a given policyno/start_date/end_date combination.  As a result the DO group never processes duplicates.

 

If you want an explanation of how the DO group works, take a multiyear policy, run the do group. But inside the do group place a set of PUT statements to see what is happening to N, start_date, end_date, and date_dif:

 

 

e_date=end_date;

put 'A: ' e_date=yymmddn8.;

  do N=1 to 10 while (start_date < e_date);

    put / 'B: ' N= start_date= end_date=;
    end_date=min(e_date,intnx('year',start_date,1,'s')-1);

    put  'C: '  end_date=;
    length date_dif $20;
    date_dif='One Year';
    if end_date<intnx('year',start_date,1,'s')-1 then
      date_dif=catx(' ',intck('month',start_date,end_date+1,'continuous'),'months');

      put 'D: ' date_dif=;
    output;
    start_date=end_date+1;

    put 'E: ' start_date=;
  end;

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 347 views
  • 2 likes
  • 5 in conversation