Re: Calculate Kth Percentile of Sorted Group Based on Previous Observa...

acemanhattan · Posted 10-13-2017 04:05 PM

Business Problem: In health insurance, at the end of the month, we look at the claims we've paid for a line of business that month and have to estimate the total claims that we will eventually pay for that month. In other words, since all claims don't get processed/paid immediately, based on claims we've paid for a month we must estimate what we will ultimately pay (usually over the next few years) for that month.

Desired SAS Outcome: I'd like to back test our current method of estimation against a method where we would look back over all previous observations for the specific grouping (state, plan, line of business, incurred month), find out what the, say, 10th percentile of "completeness" is (completeness = (paid in first month)/total ultimately paid), and simply base our initial estimate off of that percentile.

Data characteristics: The important characteristics of the data are State, Plan, Line of Business, Incurred Month, and Completion Factor (CF=Paid in 1st month/total paid a year later).

Essentially, the goal is, for every observation, to return the kth percentile of the previous observations, where I've specified the kth percentile in advance. Unfortunately, other than being able to manipulate my data to the point I've gotten it to now, I don't know how to proceed from here.

The data I have is like this, where incmo is the month the claim was incurred and CF is the percentage of total claims paid in the 1st month:

state lob_detail1 incmo CF
State_1 LGU9 201601 0.136734383127355
State_1 LGU9 201602 0.210437011265246
State_1 LGU9 201603 0.19413087349308
State_1 LGU9 201604 0.184407005224493
State_2 LGU9 201601 0.215287799314405
State_2 LGU9 201602 0.182587287537283
State_2 LGU9 201603 0.280714325937497
State_2 LGU9 201604 0.254551473091634
State_3 LGU9 201601 0.208081961799548
State_3 LGU9 201602 0.270773341437457
State_3 LGU9 201603 0.248910270935405
State_3 LGU9 201604 0.225117457605182
State_3 LGUR 201601 0.146944831323517
State_3 LGUR 201602 0.249596790488284
State_3 LGUR 201603 0.295016498212651
State_3 LGUR 201604 0.285342162980124
State_4 LGU9_ANH 201601 0.176374999741337
State_4 LGU9_ANH 201602 0.165015113345909
State_4 LGU9_ANH 201603 0.185049039686394
State_4 LGU9_ANH 201604 0.206630218055315
State_4 LGU9 201601 0.215764666628871
State_4 LGU9 201602 0.20782685761294
State_4 LGU9 201603 0.185587675180547
State_4 LGU9 201604 0.239442682041965

The data I want, where I've arbitrarily calculated the 10th percentile based on the groups previous CF observations, is like this:

state lob_detail1 incmo CF 10TH_PRCNT
State_1 LGU9 201601 0.136734383127355
State_1 LGU9 201602 0.210437011265246 0.136734383127355
State_1 LGU9 201603 0.19413087349308 0.144104645941144
State_1 LGU9 201604 0.184407005224493 0.1482136812005
State_2 LGU9 201601 0.215287799314405
State_2 LGU9 201602 0.182587287537283 0.215287799314405
State_2 LGU9 201603 0.280714325937497 0.185857338714995
State_2 LGU9 201604 0.254551473091634 0.189127389892707
State_3 LGU9 201601 0.208081961799548
State_3 LGU9 201602 0.270773341437457 0.208081961799548
State_3 LGU9 201603 0.248910270935405 0.214351099763339
State_3 LGU9 201604 0.225117457605182 0.216247623626719
State_3 LGUR 201601 0.146944831323517 0.213192610541238
State_3 LGUR 201602 0.249596790488284 0.171399683513929
State_3 LGUR 201603 0.295016498212651 0.177513396561532
State_3 LGUR 201604 0.285342162980124 0.183627109609136
State_4 LGU9_ANH 201601 0.176374999741337
State_4 LGU9_ANH 201602 0.165015113345909 0.176374999741337
State_4 LGU9_ANH 201603 0.185049039686394 0.166151101985452
State_4 LGU9_ANH 201604 0.206630218055315 0.167287090624995
State_4 LGU9 201601 0.215764666628871
State_4 LGU9 201602 0.20782685761294 0.215764666628871
State_4 LGU9 201603 0.185587675180547 0.208620638514533
State_4 LGU9 201604 0.239442682041965 0.190035511667026

In advance, thank you very much for your help.

To add clarification:

Observations don't represent individual claims here. They correspond for aggregated claims data for a given line of business in a specific incurred month (incmo).

So, for the first observation, the total claims payment made in 201601 for claims incurred in 201601 on the given line of business, was 13.67% of total claims payments made from 201601-201612 for claims incurred in 201601.

For the second observation, the total claims payment made in 201602 for claims incurred in 201602 on the given line of business, was 21% of total claims payments made from 201602-201701 for claims incurred in 201602.

Reeza · Posted 10-13-2017 04:16 PM

You can calculate percentiles in PROC UNIVARIATE and/or get probability plots - CDF.

That allows you calculate the difference.

I don't understand your data though. You usually have data with a claim date and paid/submission date. Those are what are needed to calculate this type of analysis and seeing what you have here I'm not sure how to solve your stated problem. Please simplify your sample data to reflect only whats needed.

acemanhattan · Posted 10-13-2017 04:32 PM

@Reeza

Observations don't represent individual claims here. They correspond for aggregated claims data for a given line of business in a specific incurred month (incmo).

So, for the first observation, the total claims payment made in 201601 for claims incurred in 201601 on the given line of business, was 13.67% of total claims payments made from 201601-201612 for claims incurred in 201601 on that line of business.

For the second observation, the total claims payment made in 201602 for claims incurred in 201602 on the given line of business, was 21% of total claims payments made from 201602-201701 for claims incurred in 201602 on that line of business.

ballardw · Posted 10-13-2017 06:08 PM

Can you show the calculation you used to get 0.144104645941144 on this row? Since I only see 3 values up to that point I think you are using a different definition of percentile then I am used to.

State_1 LGU9 201603 0.19413087349308 0.144104645941144

acemanhattan · Posted 10-13-2017 06:17 PM

@ballardw

I actually just used Excel's PERCENTILE.INC() function to produce the data.

I believe what it does is says that .136734 (from previous observation) is the 0th percentile, and then it linearly interpolates between .136734 and .210437 to get the 10th percentile.

Honestly, I just used that calculation to have some sort of reasonable output; if there's a built in function that's more robust or sophisticated that calculates it a little differently in SAS, I'm totally okay with that.

Reeza · Posted 10-13-2017 06:25 PM

Can you post the data you have as a starting point?

acemanhattan · Posted 10-13-2017 06:47 PM

@Reeza

The data starts from millions of rows of claims data, and gets aggregated to, more or less, what you see below (basically, bucketing by line of business and month the claims were incurred). PAID_0 is what was paid in aggregate during the incurred month for that line of business, UIC_11 is the actual total paid after 12 months (Ultimate Incurred Claims) for that line and incurred month. Then, what I'm calling, completion factor (CF) for the first month would simply be PAID_0/UIC_11.

At any given observation, I want to look at the CFs for the incurred months that proceeded it (for the relevant grouping) and be able to say what the kth percentile (I'll test different k's) of those CFs is. Then what I will do is take PAID_0/kth_percentile to predict what UIC_11 would be. I will later compare our current method of estimating against this simple method of estimating that I'm describing here.

state lob_detail1 incmo PAID_0 UIC_11
State_1 LGU9 201601 52522.2331334333 384118.697376312
State_1 LGU9 201602 79224.5659670165 376476.388305847
State_1 LGU9 201603 89073.8553223388 458834.052098951
State_1 LGU9 201604 69705.212143928 377996.552023988
State_2 LGU9 201601 244516.245877061 1135764.52848576
State_2 LGU9 201602 219085.415292354 1199894.13418291
State_2 LGU9 201603 372887.598950525 1328352.57946027
State_2 LGU9 201604 319201.644677661 1253976.81184408
State_3 LGU9 201601 169107.013493253 812694.248125937
State_3 LGU9 201602 223892.78035982 826864.192653673
State_3 LGU9 201603 261290.847076462 1049739.1131934
State_3 LGU9 201604 219798.977511244 976374.643928036
State_3 LGUR 201601 12979.0554722639 88326.0428785607
State_3 LGUR 201602 21877.3733133433 87650.8598950525
State_3 LGUR 201603 22726.9617691154 77036.240029985
State_3 LGUR 201604 23083.0509745127 80896.0398050975
State_4 LGU9_ANH 201601 17909.3590704648 101541.369790105
State_4 LGU9_ANH 201602 22045.9145427286 133599.366116942
State_4 LGU9_ANH 201603 23200.1476761619 125372.969865067
State_4 LGU9_ANH 201604 35204.0277361319 170372.117241379
State_4 LGU9 201601 164997.991004498 764712.747376312
State_4 LGU9 201602 183273.077961019 881854.636431784
State_4 LGU9 201603 206703.428785607 1113777.77961019
State_4 LGU9 201604 233976.941529235 977173.073463268

Reeza · Posted 10-13-2017 07:00 PM

If you have the individual claim data why bother with this approach at all. It may be simpler but it's unlikely to be as accurate. And processing several million records is trivial in SAS. I did this exact analysis, with approximately 30 million records per year with 10 years of data. It didn't take more than a day for the full programming, code and results. The actual data simplifies this problem drastically. You obviously want to verify these against any rules, ie in state claims are valid to be paid within 6 months, out of state/country are 365 days.

Because you only have a completion factor and you don't know how much was paid over time, all you can really say is if the rate is constant over time or if it changes seasonally or such.

acemanhattan · Posted 10-13-2017 07:37 PM

@Reeza for context, I set reserves for a very large Health Insurer using the same methodology that has been used for years here; here is the methodology we just used to set reserves for claims that were incurred in September:

(1) Ask ourselves where we set reserves last month for claims incurred in the month of August.

(2) Ask ourselves what the budget slope was from August to September.

(3) Follow the budget slope from where we set in August to come up with the estimated reserve in September.

There's no reason that methodology should be accurate, it's wholly predicated on how good of a job we did setting the budget, and everyone pretty much knows it, but they use it anyways.

In general, I'm interested in learning how to be as accurate as possible with respect to this exercise, and I'd love to hear your insight about how to do that. However, in this specific case all I want to do is make a simple tiny step toward a greater deal of sophistication by putting a defensible lower/upper bound on how incomplete our claims are after 1 month of payment. If I come to my overlords and say something to the effect of "hey, it looks like for this line the 10th percentile of completion factors is x, and the 90th percentile of completion factors is y, yet you want me to set reserves such that completion is less than x (greater than y), maybe we should rethink this", I have a pretty good chance of getting buy-in; on the other hand, if I try to sell them on something much more sophisticated than that, at this point, there's pretty much no chance they'll be interested.

So, this is why what I want to do is look back over payments in month 1 relative to total payments, and be able to talk about the kth percentile at a given point in time (at a given point in time so that I can back-test and show whether or not this method of estimation is better than how we're currently estimating).

Reeza · Posted 10-13-2017 08:27 PM

How do STATES come into play here? And the other variables? From what you've said so far, month and the amounts make sense, but are the other variables relevant in some manner?

acemanhattan · Posted 10-13-2017 09:01 PM

@Reeza

Good question.

Our business is setup such that there are structural differences in the way we pay claims depending on what state the claim came from, so, at a state level, I'd expect to see differences and would want to evaluate states separately. Then we have differences driven by the way membership (both number of bodies and the "type" of member) differs across our lines of business. (e.g. I'd expect a line of business with 100,000 members to have different payment patterns than one with 8,000 members).

Relative to the dataset I posted most recently (state, lob_detail1, incmo, etc), I'd want to evaluate each distinct combination of STATE and LOB_DETAIL1 across its range of incmos. So, for example, with STATE = STATE_4 and LOB_DETAIL1=LGU9_ANH and INCMO=201603, I'd want to return the kth percentile of CF where I'm looking at the CF values for INCMO 201601-201602 (this example is made obscure by the fact that I've included so few months. For most of these lines of business we'll be looking back over 10 years of claims history).

Reeza · Posted 10-14-2017 08:56 PM

I'm not sure you have enough data (at least presented) to do what you want. My approach would be to first calculate the percent paid and then check for seasonality. I suspect the data will have seasonality. You'll need to determine how to account for that. There's nothing here that relates to what your original question was, so I still don't see how you're going to answer that original question.

The last step here, calculates the percentiles by state/program.

*create sample data;
data have;
    format state $8. lob_detail1 $20. incmo yymmn6.;
    input state $ lob_detail1 $ incmo : yymmn6. PAID_0 UIC_11;
    cards;
State_1 LGU9 201601 52522.2331334333 384118.697376312
State_1 LGU9 201602 79224.5659670165 376476.388305847
State_1 LGU9 201603 89073.8553223388 458834.052098951
State_1 LGU9 201604 69705.212143928 377996.552023988
State_2 LGU9 201601 244516.245877061 1135764.52848576
State_2 LGU9 201602 219085.415292354 1199894.13418291
State_2 LGU9 201603 372887.598950525 1328352.57946027
State_2 LGU9 201604 319201.644677661 1253976.81184408
State_3 LGU9 201601 169107.013493253 812694.248125937
State_3 LGU9 201602 223892.78035982 826864.192653673
State_3 LGU9 201603 261290.847076462 1049739.1131934
State_3 LGU9 201604 219798.977511244 976374.643928036
State_3 LGUR 201601 12979.0554722639 88326.0428785607
State_3 LGUR 201602 21877.3733133433 87650.8598950525
State_3 LGUR 201603 22726.9617691154 77036.240029985
State_3 LGUR 201604 23083.0509745127 80896.0398050975
State_4 LGU9_ANH 201601 17909.3590704648 101541.369790105
State_4 LGU9_ANH 201602 22045.9145427286 133599.366116942
State_4 LGU9_ANH 201603 23200.1476761619 125372.969865067
State_4 LGU9_ANH 201604 35204.0277361319 170372.117241379
State_4 LGU9 201601 164997.991004498 764712.747376312
State_4 LGU9 201602 183273.077961019 881854.636431784
State_4 LGU9 201603 206703.428785607 1113777.77961019
State_4 LGU9 201604 233976.941529235 977173.073463268
;
run;

proc sort data=have;
    by state lob_detail1;
run;

*calculate percent paid;
data details;
    set have;
    by state lob_detail1;
    percent=paid_0 / uic_11;
run;

*Plot by month for each state and program;
proc sgplot data=details;
    by state lob_detail1;
    series x=incmo y=percent;
run;

*Plot distribution ignoring month;
proc univariate data=details;
    by state lob_detail1;
    var percent;
    cdfplot percent;
    output out=want p10=p10 p99=p99 p95=p95;
run;

acemanhattan · Posted 10-15-2017 12:09 PM

@Reeza

I appreciate your efforts on this, but I think we might sort of be on different pages.

First, to your comment that this doesn't answer my original question: I assume you mean that this doesn't solve the business problem. In other words, I won't be able to predict the ultimate incurred claims for a given line of business using in a given month using this method. I don't disagree that the solution I'm asking for isn't the most accurate solution, but I think it is a solution to the business problem. For example, if I look back over ten years of historical data and see that the 10th percentile of completeness for a given line of business is .15, and I see there's been 1,000,000 in payments for the first month, then I am comfortable saying things like "no, I don't think we should set reserves at 20,000,000. The historical data tells me that 90% of outcomes are less than 6,666,666. Let's set that as our upper bound". Are we on the same page here?

Second, the solution I'm looking for shouldn't return the kth percentile across all months for a given state and line of business; it should, for each incmo, return the kth percentile (for that state and line of business) of only the months that preceded it. The dataset created by the solution you posted seems to just return the 10th, 95th, and 99th percentile for any given state and line of business, across all incmos. Note that when I posted an example of the data I want, it has a 10th percentile entry for each incmo (aside from the first incmo in a group, because there's no historical data at that point).

I think my title was poorly chosen: "Calculate Kth Percentile of Sorted Group Based on Previous Observations" should have been "For each observation in sorted group, calculate kth percentile of previous observations within that group".

Reeza · Posted 10-15-2017 12:26 PM

You need multiple records to return percentiles and you only have one record per month so that’s why I don’t see how this answers your question. Not sure if it’s an issue with the data or communication.

For the record, I am an actuary.

acemanhattan · Posted 10-15-2017 03:09 PM

@Reeza

Great to meet a fellow actuary! I'm not fully credentialed yet, but am moving through exams and working as a healthcare actuary. What field are you in?

I think it's a communication issue. Let's talk again about this data:

state lob_detail1 incmo PAID_0 UIC_11
State_1 LGU9 201601 52522.2331334333 384118.697376312
State_1 LGU9 201602 79224.5659670165 376476.388305847
State_1 LGU9 201603 89073.8553223388 458834.052098951
State_1 LGU9 201604 69705.212143928 377996.552023988
State_2 LGU9 201601 244516.245877061 1135764.52848576
State_2 LGU9 201602 219085.415292354 1199894.13418291
State_2 LGU9 201603 372887.598950525 1328352.57946027
State_2 LGU9 201604 319201.644677661 1253976.81184408
State_3 LGU9 201601 169107.013493253 812694.248125937
State_3 LGU9 201602 223892.78035982 826864.192653673
State_3 LGU9 201603 261290.847076462 1049739.1131934
State_3 LGU9 201604 219798.977511244 976374.643928036
State_3 LGUR 201601 12979.0554722639 88326.0428785607
State_3 LGUR 201602 21877.3733133433 87650.8598950525
State_3 LGUR 201603 22726.9617691154 77036.240029985
State_3 LGUR 201604 23083.0509745127 80896.0398050975
State_4 LGU9_ANH 201601 17909.3590704648 101541.369790105
State_4 LGU9_ANH 201602 22045.9145427286 133599.366116942
State_4 LGU9_ANH 201603 23200.1476761619 125372.969865067
State_4 LGU9_ANH 201604 35204.0277361319 170372.117241379
State_4 LGU9 201601 164997.991004498 764712.747376312
State_4 LGU9 201602 183273.077961019 881854.636431784
State_4 LGU9 201603 206703.428785607 1113777.77961019
State_4 LGU9 201604 233976.941529235 977173.073463268

The first observation is for a particular line of business in state 1, and Paid_0 tells us how much was paid in 1/2016 for services ALL members received in 1/2016 for that state and line of business. Since not all claims are paid in the month the service was received (I know you know this, just trying to clarify what the data represents), we would expect some sort of claims run-out period over which claims for 1/2016 would continue to be paid; UIC_11 is the total aggregate payments that were eventually paid for 1/2016 services (so, say, a year later, we look back and ask ourselves "hey, when it was all said and done, how much in total was paid for 1/2016 services?" UIC_11 is the answer to this question). We can then say that 201601 was, though we didn't know it at the time, 52522/384118=.1367 complete at the end of 1/2016. The second observation is analogous, except it's relative to payments made for services received in 2/2016: it turns out that, though we didn't know it at the time, 201602 was .21 complete at the end of 2/2016.

Do you see how I now have 2 data points (1 for each month) for that given state and line of business? These give me the multiple records I'm interested in. I want to look at the historical data and say "how complete was that state and line of business at the time, even though we didn't know it?" (these are the completion factors we get from paid_0/UIC_11). Then I want to look at the sample of completion factors and answer questions about the sample that was available to us at the time (average, std deviation, percentiles, etc).

Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Re: Calculate Kth Percentile of Sorted Group Based on Previous Observations

Registration is open