Merge loans data set with daily data sets

Ronein · Posted 07-17-2022 01:06 AM

Hello
It is a general question.
I have a data set ("Have data set")with loans taken by customers in last 2 years since July 2020 till now.( This table for example has 1 million rows)and fields in data set are:
Customer_Id, date ,amount, interest.

There are daily data sets with information
of total obligation that each customer has to the bank.It includes following fields: Customer _Id, date, obligation.
The daily data sets are publisher every business day so in last 2 years there will be around 520 data sets ( because around 260 business days per year ).Each daily data set contains around 2 million rows.
Please note that the daily data sets are csv files so also need to import them into sas data sets in order to work with them.
The problem that I cannot import all of the csv files in one step because too much memory.

For each row in the loans dats set("Have data set") I want to add information of obligation before loan and obligation after loan.It should be done in following way:
For obligation before:
Obligation one business day before date of loan and if not exists then 2 days before.
For obligation after:
Obligation in same date of loan.

There is another data set that contains only one field called date and these are the list of business dates .

My question:
What is the most efficient way to merge the "Have data set" with the daily data sets?
Option1-
I thought about using proc append to append all daily data sets and then merge it with "Have data set" but I think that it is too many rows ( 3 million×520)?

Option2-
Merge the "Have data aet" with each daily data set so there will have 520 new columns with obligation_YYMMDD .
Then need to keep only the relevant 2
fields and rename them to obligation _before and obligation _after.

Option3-
Splitting "Have data set" into multiple data sets by date of loans and then merge each loan data set with the relevant daily data set and then after merging need to append the resulted data sets.

What is the best technique to merge the tables to get the desired data set?

ballardw · Posted 07-17-2022 04:44 AM

I am not sure why you are saying this: "The problem that I cannot import all of the csv files in one step because too much memory. "

Unless you have restrictions on disk space memory shouldn't be an issue and a data step can read multiple files at one time. I am also a bit concerned with your use of "import" that you might have been considering proc import to read these. That many files would almost certainly result is differences between data sets that cause problems combining data. If you have a proper description of the files then writing a data step to read them is not difficult.

There also shouldn't really be an issue with "I thought about using proc append to append all daily data sets and then merge it with "Have data set" but I think that it is too many rows ( 3 million×520)?" If you read enough of this forum you will find questions with larger data sets. It just means that there will likely be some time elapse when using largish data sets.

Proliferation of data sets is seldom conducive to clean code, maintenance of code or even keeping track of where you are in a project.

I have a strong suspicion that your Option 2 with " there will have 520 new columns with obligation_YYMMDD " is a very sub-optimal approach. See all the discussions on this forum regarding "wide" vs "long". You would be writing a lot code trying to use 520 variables that gets cumbersome to maintain or understand.

Some dummy example data and the desired result might help.

Kurt_Bremser · Posted 07-17-2022 06:24 AM

The main issue I see is the concatenation of the 520 csv files into one dataset containing the daily data.

What do the values in the daily datasets look like? Does the obligation change daily, or will it be constant over a span of days?

My approach would be to create one dataset out of the daily files, then merge it with the loan data, and write only one observation per loan once the necessary data is found.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ronein · Posted 07-17-2022 09:17 AM

All obligation daily data sets have same variables and same order of variables.
It is essential to work with all daily data sets because obligation can be changed daily .
The Question is how can I union 520 data sets with 2 million rows each...

Ronein · Posted 07-17-2022 10:41 AM

Thank you.
Are you talking about regular concatenation of data sets using proc append when in the resulted data set will have 1 billion of rows? Is this number of rows possible?

Kurt_Bremser · Posted 07-17-2022 11:11 AM

I was asking for constant obligation values because these would allow us to reduce the number of observations on import.

Since they're not, the hash approach suggested by @Patrick is what I would also do.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 07-17-2022 09:28 AM

From what you describe your loans table should easily fit into a hash table. If so then below code should work and perform reasonably well.

data loans;
  infile datalines dsd dlm=',' truncover;
  input customer_id $ loan_date:date9.;
  format loan_date date9.;
  datalines;
a,06Jun2022
b,07Jun2022
c,08Jun2022
;

data working_days;
  infile datalines dsd dlm=',' truncover;
  input seq_no wrkd_date:date9.;
  format wrkd_date date9.;
  datalines;
1,02Jun2022
2,03Jun2022
3,06Jun2022
4,07Jun2022
5,08Jun2022
;

/* this ds just for demo - in real use case infile statement in later data step */
data obligations;
  infile datalines dsd dlm=',' truncover;
  input customer_id $ obli_date:date9. obligation;
  format obli_date date9.;
  datalines;
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304
;

data prev_want(keep=customer_id loan_date obli_date obligation /* and the other columns from loans */);
  if _n_=1 then
    do;
      if 0 then set loans;
      dcl hash h_loans(dataset:'loans');
      h_loans.defineKey('customer_id');
      h_loans.defineData(all:'y');
      h_loans.defineDone();
      
      if 0 then set working_days;
      dcl hash h_wrkd_seq(dataset:'working_days');
      h_wrkd_seq.defineKey('wrkd_date');
      h_wrkd_seq.defineData('seq_no');
      h_wrkd_seq.defineDone();

      dcl hash h_wrkd(dataset:'working_days');
      h_wrkd.defineKey('seq_no');
      h_wrkd.defineData('wrkd_date');
      h_wrkd.defineDone();

    end;
  call missing(of _all_);

  /* this data set to make sample code work */
  set obligations;
  /* in real use case infile statement: assumed source file names sort by date */
/*  infile 'path to files/name*.csv';*/
/*  input <columns>;*/

  if h_loans.find()=0 then
    do;
      if 0<=loan_date-obli_date<=6 then 
        do;
          /* obligation for loan date */
          if loan_date=obli_date then 
            do;
              output;
              _rc=h_loans.remove();
            end;
          /* obligation one and two working days in the past */
          else
            do;
              /* get sequence number of working day for loan_date */
              _rc=h_wrkd_seq.find(key:loan_date);
              /* get working day one and two days in the past */
              do i=1 to 2;
                _rc=h_wrkd.find(key:seq_no-i);
                if wrkd_date=obli_date then 
                  do;
                    output;
                    leave;
                  end;
              end;
            end;
        end;
    end;
run;

/* select per customer_id the two most recent obli_dates */
proc sort data=prev_want;
  by customer_id descending obli_date;
run;
data want(drop=_:);
  set prev_want;
  by customer_id;
  if first.customer_id then _n=1;
  else _n+1;
  if _n<=2 then output;
run;

proc print data=want;
run;

If your .csv all have a header column and/or don't follow a naming convention that makes them sort by date then some extensions to above proposed code will be necessary. If that's the case for you then ideally share sample .csv's that match above sample data.

Ronein · Posted 07-17-2022 09:34 AM

Sorry I didn't understand this step.
There are 520 tables of obligation.
Which table should I read here??

/* this ds just for demo - in real use case infile statement in later data step */
data obligations;
infile datalines dsd dlm=',' truncover;
input customer_id $ obli_date:date9. obligation;
format obli_date date9.;
datalines;
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304
;

Patrick · Posted 07-17-2022 09:43 AM

You can use a * wildcard in the infile statement for SAS to read all the matching .csv in sequence.

IF your .csv files follow a naming convention that sorts them by date - like: <name>_YYYYMMDD.csv - AND there is no header column in the files then things will work as proposed.

If the files don't sort by date or there is a header column then please post (attach) a representative sample. The proposed code will only require a few tweaks to make it work in such a case.

Ronein · Posted 07-17-2022 10:01 AM

So as I understand after read the obligation files I will have a data set which will contain more than 1 billion rows???

Patrick · Posted 07-17-2022 10:10 AM

@Ronein wrote:
So as I understand after read the obligation files I will have a data set which will contain more than 1 billion rows???

No, not at all. Data step data prev_want reads all the .csv's in sequence but only outputs "matching" records. If your loans table has a million rows then you get at most 3 million rows in output table prev_want.

Because we process the .csv sequentially from oldest to newest AND based on your description there is a possibility that for loan_date minus one business day there might not be a matching record, the code needs to select matching rows for previous business days -1 and -2 ...and that's why some post processing is required. But that's then only on a table with not a lot of columns and max 3 million rows.

To add to above: There won't be an obligations table. This table is only required for the sample code to be fully working because you didn't provide sample data including .csv's for multiple days. In your real code you won't create an obligations table but you will use the infile/input syntax provided in comment.

Ronein · Posted 07-17-2022 10:42 AM

Honestly ,I hate hash table. Is there another way?

Patrick · Posted 07-17-2022 11:01 AM

@Ronein wrote:
Honestly ,I hate hash table. Is there another way?

Hash tables are a very powerful "tool" when it comes to creating performant code. With the data volumes you're dealing with performance is important. It's may-be worth for you to spend a bit of time to understand the code I've shared.

I won't spend time to propose a different approach only because you "hate hash tables" and I'm rather curious if someone else can propose something that will perform better without the use of hash tables (with only two columns one could create a format - but with a million customers that's certainly not better than a hash table).

Any approach that first would need to create a SAS table with 520*2M rows is imho sub-optimal.

Ronein · Posted 07-17-2022 11:37 AM

I want yo ask again about this step below
data obligations;
infile datalines dsd dlm=',' truncover;
input customer_id $ obli_date:date9. obligation;
format obli_date date9.;
datalines;
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304
;

This is one step before create the has table.
How many rows will have in obligations data set that was created here?

Kurt_Bremser · Posted 07-17-2022 12:30 PM

@Ronein wrote:

How many rows will have in obligations data set that was created here?

How many lines do you count in the DATALINES block?

Or simply run the step in your SAS session and read the log.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Re: Merge loans data set with daily data sets

Classroom Training Available!