Hi there,
I have a dataset with customers's insurance data and am trying to find out the coverage for the year of 2011 based on the all the coverage transaction by end of year of 2011.
The data is like below. I alaready sorted by cust_id and the coverage_begin_dt descending.. Which means that the first row of each customer is the top of the stack of the transaction.
EMPLID | COVERAGE_BEGIN_DT | PLAN | Election |
1111111 | 01Aug2016 | T | Terminated |
1111111 | 01Jan2011 | BMF01H | Elected |
1111111 | 01Jan2010 | BMF01H | Elected |
1111111 | 23Jun2003 | BM0004 | Elected |
2222222 | 01Jun2009 | T | Terminated |
2222222 | 14Dec2008 | BACP | Elected |
3333333 | 01Oct2012 | T | Terminated |
3333333 | 01May2011 | BMR030 | Elected |
3333333 | 01Apr2011 | T | Terminated |
3333333 | 04Mar1997 | BM0013 | Elected |
4444444 | 01Jan2014 | BMF02H | Elected |
4444444 | 01Mar2011 | BMF03H | Elected |
4444444 | 01Jan2010 | BMF01H | Elected |
4444444 | 03Dec1986 | BM0004 | Elected |
My goal is to get the following result:
EMPLID | COVERAGE_BEGIN_DT | COVERAGE_END_DT | PLAN | Election |
1111111 | 01Jan2011 | 31Dec2011 | BMF01H | E |
2222222 | T | |||
3333333 | 01May2011 | 31Dec2011 | BMR030 | E |
3333333 | 01Jan2011 | 31Mar2011 | BM0013 | E |
4444444 | 01Jan2011 | 28Feb2011 | BMF01H | E |
4444444 | 01Mar2011 | 31Dec2011 | BMF03H | E |
I know I might need to use Retain statement but I have been stuck here for days. 😞
Can any one help me out with a simple solution?
Thanks a lot!
I'm not sure there will be a "simple" solution but I don't see it as impossible.
However there are a few points that you will need to expand on the logic involved.
You probably want to sort by the begin date not descending as it is much easier to "look behind", previous records or from bottom to one above, than the other way around.
First it appears that all EMPID have to appear even if there is no 2011 data. Is that correct?
Second for EMPID 33333 it appears that there is an assumption the plan information from 1997 is assumed to be in effect until before the begining of April 2011 and so the information for Jan to Mar of 2011 is inserted. Is that correct?
Is there a reason that the order of the results for EMPID 44444 is different than for 33333? The insertion looks to be similar but the order is different. Does the actual order matter or can a Sort be done as needed?
Thanks Ballardw,
I guess this would not a simple solution as I am stuck on this for days.
Regarding your questions:
You probably want to sort by the begin date not descending as it is much easier to "look behind", previous records or from bottom to one above, than the other way around. -- You are right. As I need to get the earlier Coverage_begin_dt for each Plan. So should be sorting ascending instead of descending.
First it appears that all EMPID have to appear even if there is no 2011 data. Is that correct? - Yes.. Its correct. I removed the transactions after 2011 as I only need to provide the report for the year of 2011. If there is no transactions after the last row, means there was no coverage changes occurred by the 2011 year end.
Second for EMPID 33333 it appears that there is an assumption the plan information from 1997 is assumed to be in effect until before the begining of April 2011 and so the information for Jan to Mar of 2011 is inserted. Is that correct? - thats is the data look like. Again, since I only need to report for 2011, if there is no change between 1997-1/1/2011, I will just need to insert a row with coverage begin/end dt as 1/1/2011 and 12/31/2011. The plan for this newly inserted row should be the PLan from 1997.
I was thinking that due to the complexity of this assignment, may be using Retain to get the Coverage_begin_dt carried over from the previous row to current row first, then think about the 2011 cutt off secondly.
Is there a reason that the order of the results for EMPID 44444 is different than for 33333? The insertion looks to be similar but the order is different. Does the actual order matter or can a Sort be done as needed? - the final order of the result does not matter.
Here's one approach. I won't make any claim about being the slickest. There may be an issue about the Election.
data have;
informat EMPLID $5. COVERAGE_BEGIN_DT date9. PLAN $7. Election $10.;
format COVERAGE_BEGIN_DT date9.;
input EMPLID COVERAGE_BEGIN_DT PLAN Election ;
datalines;
1111111 01Aug2016 T Terminated
1111111 01Jan2011 BMF01H Elected
1111111 01Jan2010 BMF01H Elected
1111111 23Jun2003 BM0004 Elected
2222222 01Jun2009 T Terminated
2222222 14Dec2008 BACP Elected
3333333 01Oct2012 T Terminated
3333333 01May2011 BMR030 Elected
3333333 01Apr2011 T Terminated
3333333 04Mar1997 BM0013 Elected
4444444 01Jan2014 BMF02H Elected
4444444 01Mar2011 BMF03H Elected
4444444 01Jan2010 BMF01H Elected
4444444 03Dec1986 BM0004 Elected
;
run;
proc sort data= have;
by EMPLID COVERAGE_BEGIN_DT;
run;
data want;
set have;
by emplid;
LagDt = lag(COVERAGE_BEGIN_DT);
LagPlan= lag(plan);
retain outputflag;
If first.emplid then do;
/* lagged values inappropriate to use*/
call missing(LagDt, LagPlan);
outputflag=0;
end;
If year(COVERAGE_BEGIN_DT) = 2011 then do;
/* since order of output does not matter first output
to end of year from current*/
if plan ne 'T' then do;
/* could set to fixed date but this is how to set to end of year*/
COVERAGE_END_DT = intnx('year',COVERAGE_BEGIN_DT,0,'E');
Election='Elected';
output;
outputflag=1;
end;
/* then use the previous data if available*/
If Month( COVERAGE_BEGIN_DT)>1 and not missing(LagDt) and LagPlan ne 'T' then do;
/* Set end day before start, may want to set the end of the month prior
using intnx('month',COVERAGE_BEGIN_DT,-1,'E')*/
COVERAGE_END_DT= COVERAGE_BEGIN_DT-1;
COVERAGE_BEGIN_DT=intnx('year',COVERAGE_BEGIN_DT,0,'B');
Plan=LagPlan;
Election='Elected';
output;
outputflag=1;
end;
end;/* of specified year*/
if last.emplid and outputflag=0 then do;
call missing(COVERAGE_END_DT,COVERAGE_BEGIN_DT,Plan);
output;
end;
format COVERAGE_END_DT date9.;
drop LagDt LagPlan outputflag;
run;
Assuming there are at most two obs of year 2011 for each EMPLID. data have; informat EMPLID $5. COVERAGE_BEGIN_DT date9. PLAN $7. Election $10.; format COVERAGE_BEGIN_DT date9.; input EMPLID COVERAGE_BEGIN_DT PLAN Election ; datalines; 1111111 01Aug2016 T Terminated 1111111 01Jan2011 BMF01H Elected 1111111 01Jan2010 BMF01H Elected 1111111 23Jun2003 BM0004 Elected 2222222 01Jun2009 T Terminated 2222222 14Dec2008 BACP Elected 3333333 01Oct2012 T Terminated 3333333 01May2011 BMR030 Elected 3333333 01Apr2011 T Terminated 3333333 04Mar1997 BM0013 Elected 4444444 01Jan2014 BMF02H Elected 4444444 01Mar2011 BMF03H Elected 4444444 01Jan2010 BMF01H Elected 4444444 03Dec1986 BM0004 Elected ; run; data temp; set have(where=(year(COVERAGE_BEGIN_DT)=2011)); by EMPLID; retain found; lag=lag(COVERAGE_BEGIN_DT); if first.EMPLID then do; found=.; if COVERAGE_BEGIN_DT = '01jan2011'd then found=1; COVERAGE_END_DT=intnx('year',COVERAGE_BEGIN_DT,0,'e'); output; end; if last.EMPLID then do; if not found then do; if year(COVERAGE_BEGIN_DT)=2011 then temp=COVERAGE_BEGIN_DT; else temp=lag; COVERAGE_BEGIN_DT='01jan2011'd; COVERAGE_END_DT=temp-1;output; end; end; format COVERAGE_END_DT date9.; drop found lag temp; run; proc sql; create table x as select EMPLID , Election from have where EMPLID not in (select distinct EMPLID from temp); quit; data xx; set x; by EMPLID; if first.EMPLID; run; data want; set temp xx; by EMPLID; run;
Thanks for your input. It works as well as the solution from Ballrdw for my dataset. But there are few scenarios did not work:
1. there is few customers with only one observation.
2. in the following case,
EMPLID | COVERAGE_ELECT | COVERAGE_BEGIN_DT | PLAN |
555555 | Elected | 01Jan1964 | BM0016 |
555555 | Elected | 01Dec2008 | BMR039 |
555555 | Terminated | 01Jan2015 | |
666666 | Elected | 01Jan2004 | BMR039 |
666666 | Terminated | 01Jan2015 |
The result I'd like to see would be:
EMPLID | COVERAGE_ELECT | COVERAGE_BEGIN_DT | PLAN | COVERAGE_END_DT |
555555 | E | 01Jan2011 | BMR039 | 12/31/2011 |
666666 | E | 01Jan2011 | BMR039 | 12/31/2011 |
But the current program gives me empty row due to the Termination Row.
Thank you very much!
Why PLAN is BMR039, not BM0016 ? And how you explain the blank of 22222 in original data ?
We have multiple plans available for employees.
No. I mean which obs I should retaiin. And why 2222 in your original post have missing value, not like this new one .
I thought of that later. The issue is mainly that you did not SPECIFY all of the cases in your original data or the description. Such as if the data year is past the specfied year and the previous record for a year prior to the specified year has ELECTED then you need to interpolate the values.
Insert this code:
If year(COVERAGE_BEGIN_DT) > 2011 And Year(LagDT)< 2011 and LagPlan ne 'T' then do;
/* interpolate between*/
Plan=LagPlan;
COVERAGE_END_DT = '31DEC2011'd;
COVERAGE_BEGIN_DT = '01JAN2011'd;
Election='Elected';
OutFlag=1;
end;/* of interpolate year*/
BEFORE the If last.emplid block.
Thanks Ballardw so much for your idea! I have modified a bit the code you sent. The complete the code I used is as below. I added one scenario with emplid 2222111.
data have;
informat EMPLID $5. COVERAGE_BEGIN_DT date9. PLAN $7. Election $10.;
format COVERAGE_BEGIN_DT date9.;
input EMPLID COVERAGE_BEGIN_DT PLAN Election ;
datalines;
1111111 01Aug2016 T Terminated
1111111 01Jan2011 BMF01H Elected
1111111 01Jan2010 BMF01H Elected
1111111 23Jun2003 BM0004 Elected
2222222 01Jun2009 T Terminated
2222222 14Dec2008 BACP Elected
3333333 01Oct2012 T Terminated
3333333 01May2011 BMR030 Elected
3333333 01Apr2011 T Terminated
3333333 04Mar1997 BM0013 Elected
4444444 01Jan2014 BMF02H Elected
4444444 01Mar2011 BMF03H Elected
4444444 01Jan2010 BMF01H Elected
4444444 03Dec1986 BM0004 Elected
5555555 01Jan1964 BMF02H Elected
5555555 01Dec2008 BMR039 Elected
5555555 01Jan2015 T Terminated
6666666 01Jan2004 BMR039 Elected
6666666 01Jan2015 T Terminated
6666666 01Jan2004 BMR039 Elected
6666666 01Jan2015 T Terminated
2222111 18Dec2006 BACP Elected
2222111 13Aug2005 BACP Elected
2222111 25Dec2011 BACP Elected
2222111 01Jul2014 BACP Elected
;run;
proc sort data= have;
by EMPLID COVERAGE_BEGIN_DT;
run;
data want;
set have;
by emplid;
LagDt = lag(COVERAGE_BEGIN_DT);
Lagplan=lag(plan);
retain outputflag;
If first.emplid then do;
/* lagged values inappropriate to use*/
call missing(LagDt, Lagplan);
outputflag=0;
end;
If year(COVERAGE_BEGIN_DT) in (2011) then do;
/* since order of output does not matter first output
to end of year from current*/
if plan ne 'T' then do;
/* could set to fixed date but this is how to set to end of year*/
COVERAGE_END_DT = intnx('year',COVERAGE_BEGIN_DT,0,'E');
Election='Elected';
output;
outputflag=1;
end;
/* then use the previous data if available*/
If Month(COVERAGE_BEGIN_DT)>1 and not missing(LagDt) and Lagplan ne 'T' then do;
/* Set end day before start, may want to set the end of the month prior
using intnx('month',COVERAGE_BEGIN_DT,-1,'E')*/
COVERAGE_END_DT= COVERAGE_BEGIN_DT-1;
COVERAGE_BEGIN_DT=intnx('year',COVERAGE_BEGIN_DT,0,'B');
plan=Lagplan;
Election='Elected';
output;
outputflag=1;
end;
end;/* of specified year*/
If year(COVERAGE_BEGIN_DT) > 2011 And Year(LagDT)< 2011 and Lagplan ne 'T' then do;
/* interpolate between*/
plan=Lagplan;
COVERAGE_END_DT = '01Jan2011'd;
COVERAGE_BEGIN_DT = '31Dec2011'd;
Election='Elected';
output;
outputflag=1;
end;/* of interpolate year*/
if last.emplid and outputflag=0 then do;
call missing(COVERAGE_END_DT,COVERAGE_BEGIN_DT,plan);
output;
end;
format COVERAGE_END_DT date9.;
drop LagDt Lagplan outputflag;
run;
proc sort data=want;by emplid coverage_begin_dt ;run;
The result I got:
EMPLID | COVERAGE_BEGIN_DT | PLAN | Election | COVERAGE_END_DT |
11111 | 01Jan2011 | BMF01H | Elected | 31Dec2011 |
22221 | 01Jan2011 | BACP | Elected | 24Dec2011 |
22221 | 25Dec2011 | BACP | Elected | 31Dec2011 |
22222 | Terminated | |||
33333 | 01Jan2011 | BM0013 | Elected | 31Mar2011 |
33333 | 01May2011 | BMR030 | Elected | 31Dec2011 |
44444 | 01Jan2011 | BMF01H | Elected | 28Feb2011 |
44444 | 01Mar2011 | BMF03H | Elected | 31Dec2011 |
55555 | 31Dec2011 | BMR039 | Elected | 01Jan2011 |
66666 | 31Dec2011 | BMR039 | Elected | 01Jan2011 |
The problem is that I expected the result for 222211 would be 1/1/2011 -12/31/2011. The reason is that, the Plan did not change from observation to observation. I think I need to compare the plan with Lagplan as well. I tried to add this comparison to the program, but it did not work. 😞
Thanks a lot!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.