About danielchoi626

danielchoi626 · ‎06-10-2022

Yes, that is what I am going for

danielchoi626 · ‎06-10-2022

Using this pic as an example, the program is meant to calculate the total days delinquent whilst taking into account any potential gaps or overlaps in delinquent periods for a given customer. I compare dates in the program to make sure that gaps between delinquent periods such as the gap between Loan 1 and Loan 2 are accounted for. Given that Loan 3's delinquent period overlaps with Loan 2's delinquent period, this customer would have a total of 107 days delinquent (14 + 93). Calculating the difference between this customer's earliest delinquency record and the customer's latest delinquency record would give a total of 152 days delinquent which is not true as there is a gap between Loan 1 and Loan 2.

danielchoi626 · ‎06-10-2022

To better demonstrate what I'm going for, here's the Python version of the program: import pandas as pd import numpy as np import os import string import datetime as dt def time_processor(df, start_col, end_col, key_col): df_1 = df[[key_col, start_col, end_col]] # Use sort_values to sort data and the groupby function to generate a .groupby() object groupob = df_1.sort_values(by = list(df_1.columns)).groupby(key_col) # Shift rows down by groups. This method will always generate one null row per group df_1[f"Lag {start_col}"] = groupob.shift()[start_col] df_1[f"Lag {end_col}"] = groupob.shift()[end_col] # Null values are dropped to ensure they don't interfere with calculations df_2 = df_1.dropna().sort_values(by = [key_col, start_col, end_col]) # Lambda function returns start_date if start_date is earlier than lag_start_date df_2[f"True {start_col}"] = df_2.apply(lambda x: x[start_col] if x[start_col] < x[f"Lag {start_col}"] else x[f"Lag {start_col}"], axis = 1) # Lambda function returns lag_end_date if lag_end_date is earlier than start_date df_2[f"True {end_col}"] = df_2.apply(lambda x: x[f"Lag {end_col}"] if x[f"Lag {end_col}"] < x[start_col] else x[start_col], axis = 1) # The following for loop creates a table for each unique value in the customer_id column, changes the final value # of the final_end_date column into the latest date of the delinquency_end_date column and appends it to a list # for reassembly new_list = [] for i in df_2[key_col].unique(): df_3 = df_2[df_2[key_col] == i].reset_index(drop = True) df_3.loc[:, f"True {end_col}"][len(df_3) - 1] = df_3[end_col].max() new_list.append(df_3) return pd.concat(new_list) def validation_func(df, start_col, end_col, key_col): # This function is based on the idea that if the previous function had worked correctly, # dates would be changing in a monotonic fashion. valid_dict = { "Start Count" : [], "End Count" : [], "Count" : [] } # The following nested loop compares an observation to the observation that comes after it to # confirm that dates are indeed changing monotonically. Start_num and end_num are # incremented by 1 for every observation that satisfies the above condition. for i in df[key_col].unique(): df_1 = df[df[key_col] == i] start_list = list(df_1[start_col]) end_list = list(df_1[end_col]) start_num, end_num, count_num = 0, 0, 0 while True: if start_list[count_num] <= start_list[count_num + 1]: start_num += 1 if end_list[count_num] <= end_list[count_num + 1]: end_num += 1 count_num += 1 if count_num == len(df_1) - 1: start_num += 1 end_num += 1 count_num += 1 break for (a, b) in zip(valid_dict, [start_num, end_num, count_num]): valid_dict[a].append(b) # If the sum of start_count and the sum of end_count equals the length of the dataset, # that would mean that dates are changing monotonically throughout the entire dataset # and thus the results are correct. if sum(valid_dict["Start Count"]) == len(df) & sum(valid_dict["End Count"]) == len(df): print("Start Count and End Count match dataset length") else: print("Start Count and End Count do not match dataset length") return valid_dict if __name__ == "__main__": letters = list(string.ascii_uppercase) lambda_func = lambda x: f"0{x}" if x < 10 else str(x) df = pd.DataFrame({ "Customer Id" : [letters[np.random.randint(len(letters[0:6]))] for i in range(100)], "Delinquency Start Date" : [pd.to_datetime(f"{np.random.randint(2014, 2015)}-{lambda_func(np.random.randint(1, 12))}-{lambda_func(np.random.randint(1, 29))}") for i in range(100)], "Delinquency End Date" : [pd.to_datetime(f"{np.random.randint(2016, 2017)}-{lambda_func(np.random.randint(1, 12))}-{lambda_func(np.random.randint(1, 29))}") for i in range(100)] }) # df = pd.DataFrame({ # "Customer Id" : ["A", "A", "A"], # "Delinquency Start Date" : ["2018-01-31", "2018-03-31", "2018-04-15"], # "Delinquency End Date" : ["2018-02-14", "2018-07-02", "2018-05-13"] # }) new_df = time_processor(df, "Delinquency Start Date", "Delinquency End Date", "Customer Id") valid_dict = validation_func(new_df, "True Delinquency Start Date", "True Delinquency End Date", "Customer Id") new_df["Del. Days"] = ((pd.to_datetime(new_df["True Delinquency End Date"]) - pd.to_datetime(new_df["True Delinquency Start Date"])).astype(str).str.zfill(10).str[0:5]).astype(int)

danielchoi626 · ‎06-10-2022

Long story short, I've been trying to calculate delinquent days for customers using the lag() function with no success and I'd like some help. To elaborate, I am currently working with a table that contains customers' IDs, start dates for delinquencies and end dates for delinquencies among other variables. What I'm trying to do is figure out the total number of delinquent days each customer has whilst accounting for any gaps or overlaps in the dates. My idea is to shift rows down by customer id and compare dates in order to create a set of dates that take into account potential gaps and overlaps in delinquencies to calculate customers' delinquent days. This would ideally be achieved through the following steps: 1. Sort rows by customer id and dates in ascending order. This I achieved through PROC SORT: PROC SORT DATA = SAMPLE_DATA; BY CUSTOMER_ID DELINQUENCY_START_DATE DELINQUENCY_END_DATE; RUN; 2. Shift dates down by a single row using the LAG() function: DATA SAMPLE_DATA_1; SET SAMPLE_DATA; LAG_START = PUT(LAG(DELINQUENCY_START_DATE), YYMMDD10.); LAG_END = PUT(LAG(DELINQUENCY_END_DATE), YYMMDD10.); RUN; 3. Compare dates and create finalized date columns using the following logic: /* Couldn't figure out how to make this logic run so please consider everything else from this point onwards pseudocode */ IF DELINQUENCY_START_DATE < LAG_START THEN FINAL_START = DELINQUENCY_START_DATE; ELSE FINAL_START = DELINQUENCY_END_DATE; IF LAG_END < DELINQUENCY_START_DATE THEN FINAL_END = LAG_END; ELSE FINAL_END = DELINQUENCY_START_DATE; 4. Change the last value of FINAL_END to MAX(DELINQUENCY_END_DATE) (MAX(DELINQUENCY_END_DATE) should be the most recent delinquency record) IF FINAL_END = MAX(FINAL_END) THEN FINAL_END = MAX(DELINQUENCY_END_DATE) I've gotten as far as step 2, currently at a loss as to how to implement steps 3 and 4. Not sure how I should be shifting rows down by groups either. I've come up with a working Python version of the program along with a validation function to better demonstrate what I'm trying to create here. I've added attached a sample dataset to this post, any form of help would be very much appreciated.

danielchoi626 · ‎10-02-2021

I've been trying to split a master dataset into several smaller datasets based on a category variable, problem is this category variable contains over 50 different categories and I would like to keep my code as simple as possible using a loop. To demonstrate what I mean, this is the method I would've used if I were using Python: df_A = [] for i in df_1["Category"].unique(): df_A.append(df_1[df_1["Category"] == i]) Is there a way to do something similar in SAS using loops? I've tried the following code to no success. %MACRO DATA_SEPARATOR(CATEGORY = ); DATA DUMMY_&CATEGORY; SET DUMMY; IF CATEGORY = "&CATEGORY"; RUN; %MEND; DATA DUMMY; INPUT INDEX CATEGORY $4. FIGURES; CARDS; 0 A 2744 1 A 2874 2 A 823 3 A 1411 4 A 2168 5 A 2816 6 A 1212 7 A 2294 8 A 433 9 B 1137 10 B 2857 11 B 2417 12 B 2348 13 B 762 14 B 836 15 B 684 16 B 1869 17 B 912 18 B 2159 19 B 1388 20 B 1477 21 C 836 22 C 2846 23 C 1173 24 C 1138 ; RUN; %DO I = "A", "B", "C"; %DATA_SEPARATOR(CATEGORY = &I); %END; On a related note, is there a way to create a unique table or list of values in the "Category" column for use in the loop? I'm well aware that PROC SQL or PROC SORT NODUPKEY can be used for this purpose, I'm just not sure how to incorporate said values into the loop from a table, which is why I manually typed out the unique values in the category variable

danielchoi626 · ‎07-06-2021

I am currently working on a dataset that has duplicates in the primary key as shown in the example table I have given below. PROC SORT DATA = SAMPLE_TABLE NODUPKEY; BY ID_NO; RUN; Dropping duplicates using the above snippet resulted in values being lost however. I would very much appreciate some advice on how to drop duplicates and preserve non-null values.

danielchoi626 · ‎06-25-2021

I am currently in the process of converting a Python script to SAS / PROC SQL and whilst everything has been going smoothly I've been stumped by an odd case where calculating Z-Scores with the following scripts give me different answers. Python: HO_HOME_PROVINCE_MEAN = df_A10[["HOME_PROVINCE", "EUCLIDEAN_HOME_OFFICE_DIST_M"]].groupby("HOME_PROVINCE").mean().rename(columns = {"EUCLIDEAN_HOME_OFFICE_DIST_M" : "HO_HOME_PROVINCE_MEAN"}).reset_index() HO_HOME_PROVINCE_STD = df_A10[["HOME_PROVINCE", "EUCLIDEAN_HOME_OFFICE_DIST_M"]].groupby("HOME_PROVINCE").std().rename(columns = {"EUCLIDEAN_HOME_OFFICE_DIST_M" : "HO_HOME_PROVINCE_STD"}).reset_index() df_A11 = pd.merge(df_A10, HO_HOME_PROVINCE_MEAN, on = "HOME_PROVINCE", how = "left") df_A11 = pd.merge(df_A10, HO_HOME_PROVINCE_STD, on = "HOME_PROVINCE", how = "left") df11["HO_HOME_DISTRICT_ZSCORE"] = (df_A11["EUCLIDEAN_HOME_OFFICE_DIST_M"] - df_A11["HO_HOME_DISTRICT_MEAN"]) / df_A11["HO_HOME_DISTRICT_STD"] SAS: PROC SQL; CREATE TABLE DEVDAT11 AS SELECT * FROM DEVDAT10 AS A LEFT JOIN HOME_PROVINCE_MEAN AS B ON A.HOME_PROVINCE = B.HOME_PROVINCE; PROC SQL; CREATE TABLE DEVDAT12 AS SELECT * FROM DEVDAT11 AS A LEFT JOIN HOME_PROVINCE_STD AS B ON A.HOME_PROVINCE = B.HOME_PROVINCE; DATA WORK.DEVDAT13; SET DEVDAT12; HO_HOME_PROVINCE_ZSCORE = (EUCLIDEAN_HOME_OFFICE_DIST_M - HO_HOME_PROVINCE_MEAN) / HO_HOME_PROVINCE_STD; RUN; I've made sure to check if I've made any miscalculations using .sum() for Python and Proc Means Sum for SAS and so far it doesn't seem like I've made any mistakes on that front. I suspect that it might have something to do with the way Python handles null values or zeroes in rows as opposed to SAS as I keep getting this message from the log: NOTE: Division by zero detected at line 2202 column 90. The file that I've attached to this post contains a zero and a null value in column "HO_HOME_PROVINCE_STD". Any advice would be very much appreciated

danielchoi626 · ‎01-18-2021

Whoops, that was a typo. But yes, I was looking for a way to drop duplicates for the entire dataset and replicate the effects of .drop_duplicates() from Python in SAS. Thank you for your assistance! Your answer was exactly what I was looking for.

danielchoi626 · ‎01-18-2021

I am trying to do Exploratory Data Analysis with SAS by following the steps laid out in the following article. Article: https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce Dataset: https://www.kaggle.com/CooperUnion/cardataset Dropping null values duplicate rows with .drop_duplicates() in Python drops a total of 989 rows, while dropping null values using NODUP, NODUPKEY or NODUPREC leaves substantially less rows (around 300~400) rows. PROC SORT DATA = PRACTICE.CARS NODUPKEY; BY ENGINE_HP ENGINE_CYLINDERS; RUN; I'd very much appreciate some pointers on how to drop duplicates correctly. EDIT: I meant dropping duplicate rows

Online Status	Offline
Date Last Visited	‎06-11-2022 11:08 AM

Re: Not sure how I should be writing a delinquent days calculation pro...

Re: Not sure how I should be writing a delinquent days calculation pro...

Re: Not sure how I should be writing a delinquent days calculation pro...

Not sure how I should be writing a delinquent days calculation program...

Not sure how to use loops in SAS.

is there a way to drop duplicate rows while keeping rows that are not ...

Python and SAS returning different values

Re: Dropping duplicate rows

Dropping duplicate rows

Re: is there a way to drop duplicate rows while keeping rows that are ...

Re: Not sure how I should be writing a delinquent days calculation pro...

Re: Not sure how I should be writing a delinquent days calculation pro...

Re: Not sure how I should be writing a delinquent days calculation pro...

Not sure how I should be writing a delinquent days calculation program...

Not sure how to use loops in SAS.

is there a way to drop duplicate rows while keeping rows that are not ...

Python and SAS returning different values

Re: Dropping duplicate rows

Dropping duplicate rows