BookmarkSubscribeRSS Feed
EEEY
Obsidian | Level 7

Hi there,

 

I have a dataset with customers's insurance data and am trying to find out the coverage for the  year of 2011 based on the all the coverage transaction by end of year of 2011.

 

 

The data is like below. I alaready sorted by cust_id and the coverage_begin_dt descending.. Which means that the first row of each customer is the top of the stack of the transaction.

EMPLIDCOVERAGE_BEGIN_DTPLANElection
111111101Aug2016TTerminated
111111101Jan2011BMF01HElected
111111101Jan2010BMF01HElected
111111123Jun2003BM0004Elected
222222201Jun2009TTerminated
222222214Dec2008BACPElected
333333301Oct2012TTerminated
333333301May2011BMR030Elected
333333301Apr2011TTerminated
333333304Mar1997BM0013Elected
444444401Jan2014BMF02HElected
444444401Mar2011BMF03HElected
444444401Jan2010BMF01HElected
444444403Dec1986BM0004Elected

 

My goal is to get the following result:

EMPLIDCOVERAGE_BEGIN_DTCOVERAGE_END_DTPLANElection
111111101Jan201131Dec2011BMF01HE
2222222   T
333333301May201131Dec2011BMR030E
333333301Jan201131Mar2011BM0013E
444444401Jan201128Feb2011BMF01HE
444444401Mar201131Dec2011BMF03HE

 

I know I might need to use Retain statement but I have been stuck here for days. 😞

 

Can any one help me out with a simple solution?

 

Thanks a lot!

10 REPLIES 10
ballardw
Super User

I'm not sure there will be a "simple" solution but I don't see it as impossible.

However there are a few points that you will need to expand on the logic involved.

 

You probably want to sort by the begin date not descending as it is much easier to "look behind", previous records or from bottom to one above, than the other way around.

 

First it appears that all EMPID have to appear even if there is no 2011 data. Is that correct?

 

Second for EMPID 33333 it appears that there is an assumption the plan information from 1997 is assumed to be in effect until before the begining of April 2011 and so the information for Jan to Mar of 2011 is inserted. Is that correct?

Is there a reason that the order of the results for EMPID 44444 is different than for 33333? The insertion looks to be similar but the order is different. Does the actual order matter or can a Sort be done as needed?

 

 

 

EEEY
Obsidian | Level 7

Thanks Ballardw,

I guess this would not a simple solution as I am stuck on this for days.

 

Regarding your questions:

 

You probably want to sort by the begin date not descending as it is much easier to "look behind", previous records or from bottom to one above, than the other way around. -- You are right. As I need to get the earlier Coverage_begin_dt for each Plan. So should be sorting ascending instead of descending.

 

First it appears that all EMPID have to appear even if there is no 2011 data. Is that correct? - Yes.. Its correct. I removed the transactions after 2011 as I only need to provide the report for the year of 2011. If there is no transactions after the last row, means there was no coverage changes occurred by the 2011 year end.

 

Second for EMPID 33333 it appears that there is an assumption the plan information from 1997 is assumed to be in effect until before the begining of April 2011 and so the information for Jan to Mar of 2011 is inserted. Is that correct? - thats is the data look like. Again, since I only need to report for 2011, if there is no change between 1997-1/1/2011, I will just need to insert a row with coverage begin/end dt as 1/1/2011 and 12/31/2011. The plan for this newly inserted row should be the PLan from 1997.

 

I was thinking that due to the complexity of this assignment, may be using Retain to get the Coverage_begin_dt carried over from the previous row to current row first, then think about the 2011 cutt off secondly.

 

Is there a reason that the order of the results for EMPID 44444 is different than for 33333? The insertion looks to be similar but the order is different. Does the actual order matter or can a Sort be done as needed? - the final order of the result does not matter.

ballardw
Super User

Here's one approach. I won't make any claim about being the slickest. There may be an issue about the Election.

data have;
   informat EMPLID $5.  COVERAGE_BEGIN_DT date9.   PLAN $7. Election $10.;
   format COVERAGE_BEGIN_DT date9.;
   input EMPLID COVERAGE_BEGIN_DT PLAN Election ;
datalines;
1111111 01Aug2016 T Terminated 
1111111 01Jan2011 BMF01H Elected 
1111111 01Jan2010 BMF01H Elected 
1111111 23Jun2003 BM0004 Elected 
2222222 01Jun2009 T Terminated 
2222222 14Dec2008 BACP Elected 
3333333 01Oct2012 T Terminated 
3333333 01May2011 BMR030 Elected 
3333333 01Apr2011 T Terminated 
3333333 04Mar1997 BM0013 Elected 
4444444 01Jan2014 BMF02H Elected 
4444444 01Mar2011 BMF03H Elected 
4444444 01Jan2010 BMF01H Elected 
4444444 03Dec1986 BM0004 Elected 
;
run;

proc sort data= have;
   by EMPLID COVERAGE_BEGIN_DT;
run;

data want;
   set have;
   by emplid;
   LagDt = lag(COVERAGE_BEGIN_DT);
   LagPlan= lag(plan);
   retain outputflag;
   If first.emplid then do;
      /* lagged values inappropriate to use*/
      call missing(LagDt, LagPlan);
      outputflag=0;
   end;
   If year(COVERAGE_BEGIN_DT) = 2011 then do;
      /* since order of output does not matter first output
         to end of year from current*/
      if plan ne 'T' then do;
         /* could set to fixed date but this is how to set to end of year*/
         COVERAGE_END_DT = intnx('year',COVERAGE_BEGIN_DT,0,'E');
         Election='Elected';
         output;
         outputflag=1;
      end;
      /* then use the previous data if available*/
      If Month( COVERAGE_BEGIN_DT)>1 and not missing(LagDt) and LagPlan ne 'T' then do;
         /* Set end day before start, may want to set the end of the month prior
            using intnx('month',COVERAGE_BEGIN_DT,-1,'E')*/
         COVERAGE_END_DT= COVERAGE_BEGIN_DT-1;
         COVERAGE_BEGIN_DT=intnx('year',COVERAGE_BEGIN_DT,0,'B');
         Plan=LagPlan;
         Election='Elected';

         output;
         outputflag=1;

      end;
   end;/* of specified year*/
   if last.emplid and outputflag=0 then do;
      call missing(COVERAGE_END_DT,COVERAGE_BEGIN_DT,Plan);
      output;
   end;
   format COVERAGE_END_DT date9.;
   drop LagDt LagPlan outputflag;
run;
Ksharp
Super User
Assuming there are at most two obs of year 2011 for each EMPLID.



data have;
   informat EMPLID $5.  COVERAGE_BEGIN_DT date9.   PLAN $7. Election $10.;
   format COVERAGE_BEGIN_DT date9.;
   input EMPLID COVERAGE_BEGIN_DT PLAN Election ;
datalines;
1111111 01Aug2016 T Terminated 
1111111 01Jan2011 BMF01H Elected 
1111111 01Jan2010 BMF01H Elected 
1111111 23Jun2003 BM0004 Elected 
2222222 01Jun2009 T Terminated 
2222222 14Dec2008 BACP Elected 
3333333 01Oct2012 T Terminated 
3333333 01May2011 BMR030 Elected 
3333333 01Apr2011 T Terminated 
3333333 04Mar1997 BM0013 Elected 
4444444 01Jan2014 BMF02H Elected 
4444444 01Mar2011 BMF03H Elected 
4444444 01Jan2010 BMF01H Elected 
4444444 03Dec1986 BM0004 Elected 
;
run;
data temp;
 set have(where=(year(COVERAGE_BEGIN_DT)=2011));
 by EMPLID;
 retain found;
 lag=lag(COVERAGE_BEGIN_DT);
 if first.EMPLID then do;
  found=.;
  if COVERAGE_BEGIN_DT = '01jan2011'd then found=1;
  COVERAGE_END_DT=intnx('year',COVERAGE_BEGIN_DT,0,'e');
  output;
end;


if last.EMPLID then do;
if not found then do;
	if year(COVERAGE_BEGIN_DT)=2011 then temp=COVERAGE_BEGIN_DT;
	 else temp=lag;
	COVERAGE_BEGIN_DT='01jan2011'd;
    COVERAGE_END_DT=temp-1;output;
   end;
end;
 format COVERAGE_END_DT	date9.;
 drop found lag temp;
run;
proc sql;
create table x as
 select EMPLID , Election
  from have 
   where EMPLID not in (select distinct EMPLID from temp);
quit;
data xx;
 set x;
 by EMPLID;
 if first.EMPLID;
run;
data want;
 set temp xx;
 by EMPLID;
run;



EEEY
Obsidian | Level 7

Thanks for your input. It works as well as the solution from Ballrdw for my dataset. But there are few scenarios did not work:

 

1. there is few customers with only one observation.

2. in the following case,

EMPLIDCOVERAGE_ELECTCOVERAGE_BEGIN_DTPLAN
555555Elected01Jan1964BM0016
555555Elected01Dec2008BMR039
555555Terminated01Jan2015 
666666Elected01Jan2004BMR039
666666Terminated01Jan2015 

The result I'd like to see would be:

EMPLIDCOVERAGE_ELECTCOVERAGE_BEGIN_DTPLANCOVERAGE_END_DT
555555E01Jan2011BMR03912/31/2011
666666E01Jan2011BMR03912/31/2011

 

But the current program gives me empty row due to the Termination Row.

 

Thank you very much!

 

Ksharp
Super User
Why PLAN is BMR039, not BM0016 ?
And how you explain the blank of 22222 in original data ?

EEEY
Obsidian | Level 7

We have multiple plans available for employees.

Ksharp
Super User
No. I mean which obs I should retaiin.
And why 2222 in your original post have missing value, not like this new one .

ballardw
Super User

I thought of that later. The issue is mainly that you did not SPECIFY all of the cases in your original data or the description. Such as if the data year is past the specfied year and the previous record for a year prior to the specified year has ELECTED then you need to interpolate the values.

 

Insert this code:

  If year(COVERAGE_BEGIN_DT) > 2011 And Year(LagDT)< 2011 and LagPlan ne 'T' then do;
      /* interpolate between*/
      Plan=LagPlan;
      COVERAGE_END_DT   = '31DEC2011'd;
      COVERAGE_BEGIN_DT = '01JAN2011'd;
      Election='Elected';
      OutFlag=1;

   end;/* of interpolate year*/

BEFORE the If last.emplid block.

 

EEEY
Obsidian | Level 7

Thanks Ballardw so much for your idea! I have modified a bit the code you sent. The complete the code I used is as below. I added one scenario with emplid 2222111.

 

data have;
   informat EMPLID $5.  COVERAGE_BEGIN_DT date9.   PLAN $7. Election $10.;
   format COVERAGE_BEGIN_DT date9.;
   input EMPLID COVERAGE_BEGIN_DT PLAN Election ;
datalines;
1111111 01Aug2016 T Terminated
1111111 01Jan2011 BMF01H Elected
1111111 01Jan2010 BMF01H Elected
1111111 23Jun2003 BM0004 Elected
2222222 01Jun2009 T Terminated
2222222 14Dec2008 BACP Elected
3333333 01Oct2012 T Terminated
3333333 01May2011 BMR030 Elected
3333333 01Apr2011 T Terminated
3333333 04Mar1997 BM0013 Elected
4444444 01Jan2014 BMF02H Elected
4444444 01Mar2011 BMF03H Elected
4444444 01Jan2010 BMF01H Elected
4444444 03Dec1986 BM0004 Elected
5555555 01Jan1964 BMF02H Elected
5555555 01Dec2008 BMR039 Elected
5555555 01Jan2015 T Terminated
6666666 01Jan2004 BMR039 Elected
6666666 01Jan2015 T Terminated
6666666 01Jan2004 BMR039 Elected
6666666 01Jan2015 T Terminated
2222111 18Dec2006 BACP Elected
2222111 13Aug2005 BACP Elected
2222111 25Dec2011 BACP Elected
2222111 01Jul2014 BACP Elected
;run;

proc sort data= have;
   by EMPLID COVERAGE_BEGIN_DT;
run;

data want;
set have;
   by emplid;
   LagDt = lag(COVERAGE_BEGIN_DT);
   Lagplan=lag(plan);
   retain outputflag;
   If first.emplid then do;
      /* lagged values inappropriate to use*/
      call missing(LagDt, Lagplan);
      outputflag=0;
           end;

   If year(COVERAGE_BEGIN_DT) in (2011) then do;
      /* since order of output does not matter first output
         to end of year from current*/
      if plan ne 'T' then do;
         /* could set to fixed date but this is how to set to end of year*/
         COVERAGE_END_DT = intnx('year',COVERAGE_BEGIN_DT,0,'E');
         Election='Elected';
         output;
         outputflag=1;
      end;
      /* then use the previous data if available*/
    If Month(COVERAGE_BEGIN_DT)>1 and not missing(LagDt) and Lagplan ne 'T' then do;
         /* Set end day before start, may want to set the end of the month prior
            using intnx('month',COVERAGE_BEGIN_DT,-1,'E')*/
         COVERAGE_END_DT= COVERAGE_BEGIN_DT-1;
         COVERAGE_BEGIN_DT=intnx('year',COVERAGE_BEGIN_DT,0,'B');
         plan=Lagplan;
         Election='Elected';

         output;
         outputflag=1;

      end;
   end;/* of specified year*/
   If year(COVERAGE_BEGIN_DT) > 2011 And Year(LagDT)< 2011 and Lagplan ne 'T' then do;
      /* interpolate between*/
      plan=Lagplan;
      COVERAGE_END_DT   = '01Jan2011'd;
      COVERAGE_BEGIN_DT = '31Dec2011'd;
      Election='Elected';
      output;
      outputflag=1;

   end;/* of interpolate year*/
   if last.emplid and outputflag=0 then do;
      call missing(COVERAGE_END_DT,COVERAGE_BEGIN_DT,plan);
      output;
   end;
   format COVERAGE_END_DT date9.;
   drop LagDt Lagplan outputflag;
run;

proc sort data=want;by emplid coverage_begin_dt ;run;

 

The result I got:

EMPLIDCOVERAGE_BEGIN_DTPLANElectionCOVERAGE_END_DT
1111101Jan2011BMF01HElected31Dec2011
2222101Jan2011BACPElected24Dec2011
2222125Dec2011BACPElected31Dec2011
22222  Terminated 
3333301Jan2011BM0013Elected31Mar2011
3333301May2011BMR030Elected31Dec2011
4444401Jan2011BMF01HElected28Feb2011
4444401Mar2011BMF03HElected31Dec2011
55555  Elected 
5555531Dec2011BMR039Elected01Jan2011
66666  Terminated 
6666631Dec2011BMR039Elected01Jan2011

 

The problem is that I expected the result for 222211 would be 1/1/2011 -12/31/2011. The reason is that, the Plan did not change from observation to observation. I think I need to compare the plan with Lagplan as well.  I tried to add this comparison to the program, but it did not work. 😞 

 

 

 

Thanks a lot!

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2272 views
  • 0 likes
  • 3 in conversation