topic Re: Merge loans data set with daily data sets in SAS Programming

Merge loans data set with daily data sets

Ronein — Sun, 17 Jul 2022 05:28:46 GMT

Hello
It is a general question.
I have a data set ("Have data set")with loans taken by customers in last 2 years since July 2020 till now.( This table for example has 1 million rows)and fields in data set are:
Customer_Id, date ,amount, interest.

There are daily data sets with information
of total obligation that each customer has to the bank.It includes following fields: Customer _Id, date, obligation.
The daily data sets are publisher every business day so in last 2 years there will be around 520 data sets ( because around 260 business days per year ).Each daily data set contains around 2 million rows.
Please note that the daily data sets are csv files so also need to import them into sas data sets in order to work with them.
The problem that I cannot import all of the csv files in one step because too much memory.

For each row in the loans dats set("Have data set") I want to add information of obligation before loan and obligation after loan.It should be done in following way:
For obligation before:
Obligation one business day before date of loan and if not exists then 2 days before.
For obligation after:
Obligation in same date of loan.

There is another data set that contains only one field called date and these are the list of business dates .

My question:
What is the most efficient way to merge the "Have data set" with the daily data sets?
Option1-
I thought about using proc append to append all daily data sets and then merge it with "Have data set" but I think that it is too many rows ( 3 million×520)?

Option2-
Merge the "Have data aet" with each daily data set so there will have 520 new columns with obligation_YYMMDD .
Then need to keep only the relevant 2
fields and rename them to obligation _before and obligation _after.

Option3-
Splitting "Have data set" into multiple data sets by date of loans and then merge each loan data set with the relevant daily data set and then after merging need to append the resulted data sets.

What is the best technique to merge the tables to get the desired data set?

Re: Merge loans data set with daily data sets

ballardw — Sun, 17 Jul 2022 08:44:46 GMT

I am not sure why you are saying this: "The problem that I cannot import all of the csv files in one step because too much memory. "

Unless you have restrictions on disk space memory shouldn't be an issue and a data step can read multiple files at one time. I am also a bit concerned with your use of "import" that you might have been considering proc import to read these. That many files would almost certainly result is differences between data sets that cause problems combining data. If you have a proper description of the files then writing a data step to read them is not difficult.

There also shouldn't really be an issue with "I thought about using proc append to append all daily data sets and then merge it with "Have data set" but I think that it is too many rows ( 3 million×520)?" If you read enough of this forum you will find questions with larger data sets. It just means that there will likely be some time elapse when using largish data sets.

Proliferation of data sets is seldom conducive to clean code, maintenance of code or even keeping track of where you are in a project.

I have a strong suspicion that your Option 2 with " there will have 520 new columns with obligation_YYMMDD " is a very sub-optimal approach. See all the discussions on this forum regarding "wide" vs "long". You would be writing a lot code trying to use 520 variables that gets cumbersome to maintain or understand.

Some dummy example data and the desired result might help.

Re: Merge loans data set with daily data sets

Kurt_Bremser — Sun, 17 Jul 2022 10:24:53 GMT

The main issue I see is the concatenation of the 520 csv files into one dataset containing the daily data.

What do the values in the daily datasets look like? Does the obligation change daily, or will it be constant over a span of days?

My approach would be to create one dataset out of the daily files, then merge it with the loan data, and write only one observation per loan once the necessary data is found.

Re: Merge loans data set with daily data sets

Ronein — Sun, 17 Jul 2022 13:17:43 GMT

All obligation daily data sets have same variables and same order of variables.
It is essential to work with all daily data sets because obligation can be changed daily .
The Question is how can I union 520 data sets with 2 million rows each...

Re: Merge loans data set with daily data sets

Patrick — Sun, 17 Jul 2022 15:15:52 GMT

From what you describe your loans table should easily fit into a hash table. If so then below code should work and perform reasonably well.

data loans;
  infile datalines dsd dlm=',' truncover;
  input customer_id $ loan_date:date9.;
  format loan_date date9.;
  datalines;
a,06Jun2022
b,07Jun2022
c,08Jun2022
;

data working_days;
  infile datalines dsd dlm=',' truncover;
  input seq_no wrkd_date:date9.;
  format wrkd_date date9.;
  datalines;
1,02Jun2022
2,03Jun2022
3,06Jun2022
4,07Jun2022
5,08Jun2022
;

/* this ds just for demo - in real use case infile statement in later data step */
data obligations;
  infile datalines dsd dlm=',' truncover;
  input customer_id $ obli_date:date9. obligation;
  format obli_date date9.;
  datalines;
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304
;

data prev_want(keep=customer_id loan_date obli_date obligation /* and the other columns from loans */);
  if _n_=1 then
    do;
      if 0 then set loans;
      dcl hash h_loans(dataset:'loans');
      h_loans.defineKey('customer_id');
      h_loans.defineData(all:'y');
      h_loans.defineDone();
      
      if 0 then set working_days;
      dcl hash h_wrkd_seq(dataset:'working_days');
      h_wrkd_seq.defineKey('wrkd_date');
      h_wrkd_seq.defineData('seq_no');
      h_wrkd_seq.defineDone();

      dcl hash h_wrkd(dataset:'working_days');
      h_wrkd.defineKey('seq_no');
      h_wrkd.defineData('wrkd_date');
      h_wrkd.defineDone();

    end;
  call missing(of _all_);

  /* this data set to make sample code work */
  set obligations;
  /* in real use case infile statement: assumed source file names sort by date */
/*  infile 'path to files/name*.csv';*/
/*  input <columns>;*/

  if h_loans.find()=0 then
    do;
      if 0<=loan_date-obli_date<=6 then 
        do;
          /* obligation for loan date */
          if loan_date=obli_date then 
            do;
              output;
              _rc=h_loans.remove();
            end;
          /* obligation one and two working days in the past */
          else
            do;
              /* get sequence number of working day for loan_date */
              _rc=h_wrkd_seq.find(key:loan_date);
              /* get working day one and two days in the past */
              do i=1 to 2;
                _rc=h_wrkd.find(key:seq_no-i);
                if wrkd_date=obli_date then 
                  do;
                    output;
                    leave;
                  end;
              end;
            end;
        end;
    end;
run;

/* select per customer_id the two most recent obli_dates */
proc sort data=prev_want;
  by customer_id descending obli_date;
run;
data want(drop=_:);
  set prev_want;
  by customer_id;
  if first.customer_id then _n=1;
  else _n+1;
  if _n<=2 then output;
run;

proc print data=want;
run;

If your .csv all have a header column and/or don't follow a naming convention that makes them sort by date then some extensions to above proposed code will be necessary. If that's the case for you then ideally share sample .csv's that match above sample data.

Re: Merge loans data set with daily data sets

Ronein — Sun, 17 Jul 2022 13:34:08 GMT

Sorry I didn't understand this step.
There are 520 tables of obligation.
Which table should I read here??

/* this ds just for demo - in real use case infile statement in later data step */
data obligations;
infile datalines dsd dlm=',' truncover;
input customer_id $ obli_date:date9. obligation;
format obli_date date9.;
datalines;
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304
;

Re: Merge loans data set with daily data sets

Patrick — Sun, 17 Jul 2022 13:43:55 GMT

You can use a * wildcard in the infile statement for SAS to read all the matching .csv in sequence.

IF your .csv files follow a naming convention that sorts them by date - like: <name>_YYYYMMDD.csv - AND there is no header column in the files then things will work as proposed.

If the files don't sort by date or there is a header column then please post (attach) a representative sample. The proposed code will only require a few tweaks to make it work in such a case.

Re: Merge loans data set with daily data sets

Ronein — Sun, 17 Jul 2022 14:01:37 GMT

So as I understand after read the obligation files I will have a data set which will contain more than 1 billion rows???

Re: Merge loans data set with daily data sets

Patrick — Sun, 17 Jul 2022 14:18:07 GMT

@Ronein wrote:
So as I understand after read the obligation files I will have a data set which will contain more than 1 billion rows???

No, not at all. Data step data prev_want reads all the .csv's in sequence but only outputs "matching" records. If your loans table has a million rows then you get at most 3 million rows in output table prev_want.

Because we process the .csv sequentially from oldest to newest AND based on your description there is a possibility that for loan_date minus one business day there might not be a matching record, the code needs to select matching rows for previous business days -1 and -2 ...and that's why some post processing is required. But that's then only on a table with not a lot of columns and max 3 million rows.

To add to above: There won't be an obligations table. This table is only required for the sample code to be fully working because you didn't provide sample data including .csv's for multiple days. In your real code you won't create an obligations table but you will use the infile/input syntax provided in comment.

Re: Merge loans data set with daily data sets

Ronein — Sun, 17 Jul 2022 14:41:34 GMT

Thank you.
Are you talking about regular concatenation of data sets using proc append when in the resulted data set will have 1 billion of rows? Is this number of rows possible?

Re: Merge loans data set with daily data sets

Ronein — Sun, 17 Jul 2022 14:42:51 GMT

Honestly ,I hate hash table. Is there another way?

Re: Merge loans data set with daily data sets

Patrick — Sun, 17 Jul 2022 15:15:06 GMT

@Ronein wrote:
Honestly ,I hate hash table. Is there another way?

Hash tables are a very powerful "tool" when it comes to creating performant code. With the data volumes you're dealing with performance is important. It's may-be worth for you to spend a bit of time to understand the code I've shared.

I won't spend time to propose a different approach only because you "hate hash tables" and I'm rather curious if someone else can propose something that will perform better without the use of hash tables (with only two columns one could create a format - but with a million customers that's certainly not better than a hash table).

Any approach that first would need to create a SAS table with 520*2M rows is imho sub-optimal.

Re: Merge loans data set with daily data sets

Kurt_Bremser — Sun, 17 Jul 2022 15:11:30 GMT

I was asking for constant obligation values because these would allow us to reduce the number of observations on import.

Since they're not, the hash approach suggested by @Patrick is what I would also do.

Re: Merge loans data set with daily data sets

Ronein — Sun, 17 Jul 2022 15:37:52 GMT

I want yo ask again about this step below
data obligations;
infile datalines dsd dlm=',' truncover;
input customer_id $ obli_date:date9. obligation;
format obli_date date9.;
datalines;
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304
;

This is one step before create the has table.
How many rows will have in obligations data set that was created here?

Re: Merge loans data set with daily data sets

Kurt_Bremser — Sun, 17 Jul 2022 16:30:12 GMT

@Ronein wrote:

How many rows will have in obligations data set that was created here?

How many lines do you count in the DATALINES block?

Or simply run the step in your SAS session and read the log.

Re: Merge loans data set with daily data sets

Ronein — Sun, 17 Jul 2022 16:35:03 GMT

9 lines
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304

But in real life will have 520 lines???

Re: Merge loans data set with daily data sets

Kurt_Bremser — Sun, 17 Jul 2022 17:04:09 GMT

@Ronein wrote:
9 lines
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304

But in real life will have 520 lines???

Why should it? It's just a "fake" dataset created to illustrate the function of the code presented in the post.

Re: Merge loans data set with daily data sets

Tom — Sun, 17 Jul 2022 18:31:10 GMT

@Ronein wrote:
Honestly ,I hate hash table. Is there another way?

HASH tables are not really that hard, they are use "foreign" to the normal SAS way of working.

The advantage they have is that you can index into them by things like CUSTOMER_ID that is probably not a simple integer.

In this case if the set of loans is small enough to fit in memory then you can find the obligations you are looking for with a single pass through the CSV file without having to create a giant SAS dataset.

The basic idea is to read a line from the CSV file and then use the HASH() object to decide if you need to keep that obligation amount. When you get to the end of the data you can then write the HASH() object back to a dataset.

You can use your list of business days in two ways. One you can use it to help you figure out which date is the previous business day (or the previous previous business day). Two you can use it to tell you which CSV files you need to read. I assume you can figure out the name of the CSV file based on the date.

So let's create some an example LOANS and BDAYS dataset and also a series of CSV files. Click the SPOILER tag to see the code.

data loans;
  input customer_id $ date :yymmdd. amount;
  format date yymmdd10.;
datalines;
a 2022-06-06 100
b 2022-06-07 200
c 2022-06-08 300
;

data bdays;
  input date :yymmdd.;
  format date yymmdd10.;
datalines;
2022-06-02
2022-06-03
2022-06-06
2022-06-07
2022-06-08
;

data obligations;
  input customer_id $ date :yymmdd. obligation;
  format date yymmdd10.;
datalines;
b 2022-06-02 201
a 2022-06-03 101
b 2022-06-03 201
c 2022-06-03 301
a 2022-06-06 101
c 2022-06-06 302
b 2022-06-07 203
c 2022-06-07 303
c 2022-06-08 304
;

%let path=%sysfunc(pathname(work));
proc sort data=obligations;
  by date customer_id ;
run;

data _null_;
   set obligations end=eof;
   filename=cats("&path/balance_",put(date,yymmddn8.),'.csv');
   file csv filevar=filename dsd ;
   put customer_id date obligation ;
   if _n_=1 then call symputx('start',date);
   if eof then call symputx('end',date);
run;

Now let's see how to use those datasets and files to make what you want.

First we need to know the path where to find the CSV files. Also since this code is going to use a temporary array to figure out how to map from current business day to the next business day it will help to have a lower and upper bound on the range of dates.

%let start=%sysfunc(mdy(02,06,2022));
%let end=%sysfunc(mdy(08,06,2022));
%let path=%sysfunc(pathname(work));

Now here is the data step to read the obligations files and collect upto three obligation values per loan. The value on the loan date, on the previous business date and also the previous previous business date.

First it will load the business days into a temporary array that map a date to the next business date.

Second it will load the loan date into a hash that stores the customer, date, amount, and the three obligations.

Third it will read in the list of dates and for each date read in the corresponding CSV file.

For each record in the CSV file it checks if that customer+date matches a loan. If so it stores the obligation in the hash.

Then if checks if the next business is a loan date for this customer and if so it stores the obligation. And repeats the look back (ahead) once more.

Finally at the end it writes the hash to a dataset.

data _null_;
  if 0 then set bdays loans ;
  if _n_=1 then do;
    array nday [&start:&end] _temporary_;
    do while(not eof);
      set bdays end=eof;
      lag_date=lag(date);
      if not missing(lag_date) then nday[lag_date]=date;
    end;
    declare hash h(ordered:'yes');
    h.definekey('customer_id','date');
    h.definedata('customer_id','date','amount','owe1','owe2','owe3');
    h.definedone();
    do while(not eof1);
      set loans end=eof1;
      h.add();
    end;
  end;

  set bdays end=eof2;
  filename = cats("&path/balance_",put(date,yymmddn8.),'.csv');
  infile csv filevar=filename dsd truncover /*firstobs=2*/ end=eof3;
  do while(not eof3);
    input customer_id date :yymmdd. obligation ;
    date=date;
    if 0=h.find() then do; owe1=obligation; rc=h.replace(); end;
    date=nday[date];
    if not missing(date) then do;
      if 0=h.find() then do; owe2=obligation; rc=h.replace(); end;
      date=nday[date];
      if not missing(date) then do;
        if 0=h.find() then do; owe3=obligation; rc=h.replace(); end;
      end;
    end;
  end;

  if eof3 then rc=h.output(dataset:'loan_obligations');
run;

Results: