BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
danielchoi626
Calcite | Level 5

Long story short, I've been trying to calculate delinquent days for customers using the lag() function with no success and I'd like some help.

 

To elaborate, I am currently working with a table that contains customers' IDs, start dates for delinquencies and end dates for delinquencies among other variables. What I'm trying to do is figure out the total number of delinquent days each customer has whilst accounting for any gaps or overlaps in the dates. 

 

My idea is to shift rows down by customer id and compare dates in order to create a set of dates that take into account potential gaps and overlaps in delinquencies to calculate customers' delinquent days. This would ideally be achieved through the following steps:

 

1. Sort rows by customer id and dates in ascending order. This I achieved through PROC SORT:

PROC SORT DATA = SAMPLE_DATA;
BY CUSTOMER_ID DELINQUENCY_START_DATE DELINQUENCY_END_DATE;
RUN;

2. Shift dates down by a single row using the LAG() function:

DATA SAMPLE_DATA_1;
SET SAMPLE_DATA;
        LAG_START = PUT(LAG(DELINQUENCY_START_DATE), YYMMDD10.);
        LAG_END   = PUT(LAG(DELINQUENCY_END_DATE), YYMMDD10.);
RUN;

3. Compare dates and create finalized date columns using the following logic:

/* Couldn't figure out how to make this logic run so please consider everything else from this point onwards pseudocode */
IF DELINQUENCY_START_DATE < LAG_START THEN FINAL_START = DELINQUENCY_START_DATE; ELSE FINAL_START = DELINQUENCY_END_DATE;
IF LAG_END < DELINQUENCY_START_DATE THEN FINAL_END = LAG_END; ELSE FINAL_END = DELINQUENCY_START_DATE;

4. Change the last value of FINAL_END to MAX(DELINQUENCY_END_DATE)

(MAX(DELINQUENCY_END_DATE) should be the most recent delinquency record)

IF FINAL_END = MAX(FINAL_END) THEN FINAL_END = MAX(DELINQUENCY_END_DATE)

I've gotten as far as step 2, currently at a loss as to how to implement steps 3 and 4. Not sure how I should be shifting rows down by groups either. I've come up with a working Python version of the program along with a validation function to better demonstrate what I'm trying to create here. I've added attached a sample dataset to this post, any form of help would be very much appreciated.  

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

If you just want the total days then no need to worry about the gaps or overlaps.  Just fill a temporary array and sum to get the count.

First choose some reasonable boundaries for the time period of interest.  You could calculate them but why bother?

data want;
  set have ;
   by Customer_Id ;
   array d [%sysevalf('01JAN2000'd) : %sysevalf('31DEC2025'd)] _temporary_;
   if first.Customer_Id then call missing(of d[*]);
   do date=Delinquency_Start_Date to Delinquency_End_Date;
      d[date] =1;
   end;
   if last.Customer_Id then do;
      days = max(0,sum(of d[*]));
      output;
   end;
   keep Customer_Id days;
run;

Result

       Customer_
Obs       Id        days

 1         A        1007
 2         B        1046

View solution in original post

8 REPLIES 8
danielchoi626
Calcite | Level 5

To better demonstrate what I'm going for, here's the Python version of the program:

 
import pandas as pd
import numpy as np
import os
import string
import datetime as dt
 
def time_processor(df, start_col, end_col, key_col):
    df_1 = df[[key_col, start_col, end_col]]

    # Use sort_values to sort data and the groupby function to generate a .groupby() object
    groupob = df_1.sort_values(by = list(df_1.columns)).groupby(key_col)

    # Shift rows down by groups. This method will always generate one null row per group
    df_1[f"Lag {start_col}"] = groupob.shift()[start_col]
    df_1[f"Lag {end_col}"]   = groupob.shift()[end_col]

    # Null values are dropped to ensure they don't interfere with calculations
    df_2 = df_1.dropna().sort_values(by = [key_col, start_col, end_col])

    # Lambda function returns start_date if start_date is earlier than lag_start_date
    df_2[f"True {start_col}"] = df_2.apply(lambda x: x[start_col] if x[start_col] < x[f"Lag {start_col}"] else x[f"Lag {start_col}"], axis = 1)

    # Lambda function returns lag_end_date if lag_end_date is earlier than start_date
    df_2[f"True {end_col}"]   = df_2.apply(lambda x: x[f"Lag {end_col}"] if x[f"Lag {end_col}"] < x[start_col] else x[start_col], axis = 1)

    # The following for loop creates a table for each unique value in the customer_id column, changes the final value
    # of the final_end_date column into the latest date of the delinquency_end_date column and appends it to a list
    # for reassembly
    new_list = []

    for i in df_2[key_col].unique():
        df_3 = df_2[df_2[key_col] == i].reset_index(drop = True)
        df_3.loc[:, f"True {end_col}"][len(df_3) - 1] = df_3[end_col].max()
        new_list.append(df_3)

    return pd.concat(new_list)    

def validation_func(df, start_col, end_col, key_col):
    # This function is based on the idea that if the previous function had worked correctly,
    # dates would be changing in a monotonic fashion. 
    valid_dict = {
        "Start Count" : [],
        "End Count"   : [], 
        "Count"       : []
    }

    # The following nested loop compares an observation to the observation that comes after it to
    # confirm that dates are indeed changing monotonically. Start_num and end_num are
    # incremented by 1 for every observation that satisfies the above condition. 
    for i in df[key_col].unique():
        df_1 = df[df[key_col] == i]

        start_list = list(df_1[start_col])
        end_list   = list(df_1[end_col])

        start_num, end_num, count_num = 0, 0, 0

        while True:
            if start_list[count_num] <= start_list[count_num + 1]:
                start_num += 1

            if end_list[count_num] <= end_list[count_num + 1]:
                end_num += 1

            count_num += 1

            if count_num == len(df_1) - 1:
                start_num += 1
                end_num   += 1
                count_num += 1
                break            

        for (a, b) in zip(valid_dict, [start_num, end_num, count_num]):
            valid_dict[a].append(b)

    # If the sum of start_count and the sum of end_count equals the length of the dataset, 
    # that would mean that dates are changing monotonically throughout the entire dataset
    # and thus the results are correct. 
    if sum(valid_dict["Start Count"]) == len(df) & sum(valid_dict["End Count"]) == len(df):
        print("Start Count and End Count match dataset length")
    else:
        print("Start Count and End Count do not match dataset length")

    return valid_dict

if __name__ == "__main__":
    letters = list(string.ascii_uppercase)

    lambda_func = lambda x: f"0{x}" if x < 10 else str(x)

    df = pd.DataFrame({
        "Customer Id"            : [letters[np.random.randint(len(letters[0:6]))] for i in range(100)],
        "Delinquency Start Date" : [pd.to_datetime(f"{np.random.randint(2014, 2015)}-{lambda_func(np.random.randint(1, 12))}-{lambda_func(np.random.randint(1, 29))}") for i in range(100)],
        "Delinquency End Date"   : [pd.to_datetime(f"{np.random.randint(2016, 2017)}-{lambda_func(np.random.randint(1, 12))}-{lambda_func(np.random.randint(1, 29))}") for i in range(100)]
    })

    # df = pd.DataFrame({
    #     "Customer Id"            : ["A", "A", "A"], 
    #     "Delinquency Start Date" : ["2018-01-31", "2018-03-31", "2018-04-15"],
    #     "Delinquency End Date"   : ["2018-02-14", "2018-07-02", "2018-05-13"]
    # })

    new_df     = time_processor(df, "Delinquency Start Date", "Delinquency End Date", "Customer Id")
    valid_dict = validation_func(new_df, "True Delinquency Start Date", "True Delinquency End Date", "Customer Id") 

    new_df["Del. Days"] = ((pd.to_datetime(new_df["True Delinquency End Date"]) - pd.to_datetime(new_df["True Delinquency Start Date"])).astype(str).str.zfill(10).str[0:5]).astype(int)
Tom
Super User Tom
Super User

Please explain WHAT that code is doing.

 

I see this comment buried in the middle of the long mess

  # The following nested loop compares an observation to the observation that comes after it to
  # confirm that dates are indeed changing monotonically.

So is the goal to fill in any gaps between the start and stop dates of adjacent records for a customer?

Or what is the goal?

danielchoi626
Calcite | Level 5

Example.png

Using this pic as an example, the program is meant to calculate the total days delinquent whilst taking into account any potential gaps or overlaps in delinquent periods for a given customer. I compare dates in the program to make sure that gaps between delinquent periods such as the gap between Loan 1 and Loan 2 are accounted for. Given that Loan 3's delinquent period overlaps with Loan 2's delinquent period, this customer would have a total of 107 days delinquent (14 + 93). Calculating the difference between this customer's earliest delinquency record and the customer's latest delinquency record would give a total of 152 days delinquent which is not true as there is a gap between Loan 1 and Loan 2. 

Tom
Super User Tom
Super User

If you just want the total days then no need to worry about the gaps or overlaps.  Just fill a temporary array and sum to get the count.

First choose some reasonable boundaries for the time period of interest.  You could calculate them but why bother?

data want;
  set have ;
   by Customer_Id ;
   array d [%sysevalf('01JAN2000'd) : %sysevalf('31DEC2025'd)] _temporary_;
   if first.Customer_Id then call missing(of d[*]);
   do date=Delinquency_Start_Date to Delinquency_End_Date;
      d[date] =1;
   end;
   if last.Customer_Id then do;
      days = max(0,sum(of d[*]));
      output;
   end;
   keep Customer_Id days;
run;

Result

       Customer_
Obs       Id        days

 1         A        1007
 2         B        1046
Tom
Super User Tom
Super User

The PUT() function is for converting values to text.  If you use a numeric format like YYMMDD10. then the value being based must be a numeric number of says since start of 1960.

 

Why are you making CHARACTER variables with names like LAG_START and LAG_END?  You will not be able to do any arithmetic with those character variables.

 

If instead you have actual numeric date values you could get the difference in days by just subtracting.

 

Why are you comparing the existing numeric variable DELINQUENCY_START_DATE to the new character variable LAG_START ?

 

So perhaps you want to start with something like this:

data sample_data_1;
  set sample_data;
  by customer_id ;
  lag_start = lag(delinquency_start_date);
  lag_end   = lag(delinquency_end_date);
  if first.customer_id then call missing(of lag_start lag_end);
  format lag_start lag_end yymmdd10.;
Tom
Super User Tom
Super User

Please show your input data as TEXT , not attachments.  Best is to post code that creates the data so the variable names, types and any required formats are clear.

Then show the desired output data.

Explain in words what is the transformation you are trying to do.

If you tried code show your code and explain how it did not work as you intended.

 

If you are unsure how to write a simple data step you can use this %DS2POST macro to create it for you from your dataset 

So run this code:

* Pull macro definition from GITHUB and dump code to the SAS log ;
filename ds2post url
  'https://raw.githubusercontent.com/sasutils/macros/master/ds2post.sas'
;
%include ds2post ;
%ds2post(sample_data)

And then copy the lines of code it writes to the SAS log into your SAS communities post.

 

Although for you simple little file it is just easier to write the data lines yourself.

data _null_;
  set sample_data;
  put (_all_) (+0);
run;

Then copy them into the data step you want to share:

data have;
  input Customer_Id $ (Delinquency_Start_Date Delinquency_End_Date) (:yymmdd.);
  format Delinquency_Start_Date Delinquency_End_Date yymmdd10.;
cards;
A 2014-02-01 2016-04-24
A 2014-02-20 2016-06-05
A 2014-03-14 2016-09-28
A 2014-03-21 2016-06-23
A 2014-03-21 2016-11-03
A 2014-04-01 2016-10-02
A 2014-06-03 2016-06-24
A 2014-06-13 2016-07-17
A 2014-06-28 2016-03-22
A 2014-07-27 2016-06-01
A 2014-08-06 2016-04-26
A 2014-09-02 2016-09-07
A 2014-09-11 2016-06-18
A 2014-10-01 2016-02-20
A 2014-10-13 2016-09-11
B 2014-01-01 2016-11-11
B 2014-01-15 2016-06-14
B 2014-03-16 2016-04-17
B 2014-04-15 2016-02-04
B 2014-04-24 2016-03-10
B 2014-06-06 2016-10-13
B 2014-06-24 2016-07-21
B 2014-07-12 2016-03-14
B 2014-07-16 2016-04-22
B 2014-08-12 2016-10-03
B 2014-09-01 2016-03-19
B 2014-09-16 2016-07-20
B 2014-10-07 2016-08-07
B 2014-10-11 2016-02-08
B 2014-10-12 2016-04-06
B 2014-11-02 2016-07-24
;

So looking at the first two records

A 2014-02-01 2016-04-24
A 2014-02-20 2016-06-05

you have overlapping date ranges. S1 < S2 < E1 < E2

Is the goal to collapse that into one range from S1 to E2 ?

Or are you trying to do something else?

danielchoi626
Calcite | Level 5

Yes, that is what I am going for

Reeza
Super User

How big is your data set?

And do you have a lot of RAM on your computer? 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 780 views
  • 2 likes
  • 3 in conversation