I am not sure why you are saying this: "The problem that I cannot import all of the csv files in one step because too much memory. "
Unless you have restrictions on disk space memory shouldn't be an issue and a data step can read multiple files at one time. I am also a bit concerned with your use of "import" that you might have been considering proc import to read these. That many files would almost certainly result is differences between data sets that cause problems combining data. If you have a proper description of the files then writing a data step to read them is not difficult.
There also shouldn't really be an issue with "I thought about using proc append to append all daily data sets and then merge it with "Have data set" but I think that it is too many rows ( 3 million×520)?" If you read enough of this forum you will find questions with larger data sets. It just means that there will likely be some time elapse when using largish data sets.
Proliferation of data sets is seldom conducive to clean code, maintenance of code or even keeping track of where you are in a project.
I have a strong suspicion that your Option 2 with " there will have 520 new columns with obligation_YYMMDD " is a very sub-optimal approach. See all the discussions on this forum regarding "wide" vs "long". You would be writing a lot code trying to use 520 variables that gets cumbersome to maintain or understand.
Some dummy example data and the desired result might help.
The main issue I see is the concatenation of the 520 csv files into one dataset containing the daily data.
What do the values in the daily datasets look like? Does the obligation change daily, or will it be constant over a span of days?
My approach would be to create one dataset out of the daily files, then merge it with the loan data, and write only one observation per loan once the necessary data is found.
I was asking for constant obligation values because these would allow us to reduce the number of observations on import.
Since they're not, the hash approach suggested by @Patrick is what I would also do.
From what you describe your loans table should easily fit into a hash table. If so then below code should work and perform reasonably well.
data loans;
infile datalines dsd dlm=',' truncover;
input customer_id $ loan_date:date9.;
format loan_date date9.;
datalines;
a,06Jun2022
b,07Jun2022
c,08Jun2022
;
data working_days;
infile datalines dsd dlm=',' truncover;
input seq_no wrkd_date:date9.;
format wrkd_date date9.;
datalines;
1,02Jun2022
2,03Jun2022
3,06Jun2022
4,07Jun2022
5,08Jun2022
;
/* this ds just for demo - in real use case infile statement in later data step */
data obligations;
infile datalines dsd dlm=',' truncover;
input customer_id $ obli_date:date9. obligation;
format obli_date date9.;
datalines;
a,03Jun2022,101
a,06Jun2022,101
b,02Jun2022,201
b,03Jun2022,201
b,07Jun2022,203
c,03Jun2022,301
c,06Jun2022,302
c,07Jun2022,303
c,08Jun2022,304
;
data prev_want(keep=customer_id loan_date obli_date obligation /* and the other columns from loans */);
if _n_=1 then
do;
if 0 then set loans;
dcl hash h_loans(dataset:'loans');
h_loans.defineKey('customer_id');
h_loans.defineData(all:'y');
h_loans.defineDone();
if 0 then set working_days;
dcl hash h_wrkd_seq(dataset:'working_days');
h_wrkd_seq.defineKey('wrkd_date');
h_wrkd_seq.defineData('seq_no');
h_wrkd_seq.defineDone();
dcl hash h_wrkd(dataset:'working_days');
h_wrkd.defineKey('seq_no');
h_wrkd.defineData('wrkd_date');
h_wrkd.defineDone();
end;
call missing(of _all_);
/* this data set to make sample code work */
set obligations;
/* in real use case infile statement: assumed source file names sort by date */
/* infile 'path to files/name*.csv';*/
/* input <columns>;*/
if h_loans.find()=0 then
do;
if 0<=loan_date-obli_date<=6 then
do;
/* obligation for loan date */
if loan_date=obli_date then
do;
output;
_rc=h_loans.remove();
end;
/* obligation one and two working days in the past */
else
do;
/* get sequence number of working day for loan_date */
_rc=h_wrkd_seq.find(key:loan_date);
/* get working day one and two days in the past */
do i=1 to 2;
_rc=h_wrkd.find(key:seq_no-i);
if wrkd_date=obli_date then
do;
output;
leave;
end;
end;
end;
end;
end;
run;
/* select per customer_id the two most recent obli_dates */
proc sort data=prev_want;
by customer_id descending obli_date;
run;
data want(drop=_:);
set prev_want;
by customer_id;
if first.customer_id then _n=1;
else _n+1;
if _n<=2 then output;
run;
proc print data=want;
run;
If your .csv all have a header column and/or don't follow a naming convention that makes them sort by date then some extensions to above proposed code will be necessary. If that's the case for you then ideally share sample .csv's that match above sample data.
You can use a * wildcard in the infile statement for SAS to read all the matching .csv in sequence.
IF your .csv files follow a naming convention that sorts them by date - like: <name>_YYYYMMDD.csv - AND there is no header column in the files then things will work as proposed.
If the files don't sort by date or there is a header column then please post (attach) a representative sample. The proposed code will only require a few tweaks to make it work in such a case.
@Ronein wrote:
So as I understand after read the obligation files I will have a data set which will contain more than 1 billion rows???
No, not at all. Data step data prev_want reads all the .csv's in sequence but only outputs "matching" records. If your loans table has a million rows then you get at most 3 million rows in output table prev_want.
Because we process the .csv sequentially from oldest to newest AND based on your description there is a possibility that for loan_date minus one business day there might not be a matching record, the code needs to select matching rows for previous business days -1 and -2 ...and that's why some post processing is required. But that's then only on a table with not a lot of columns and max 3 million rows.
To add to above: There won't be an obligations table. This table is only required for the sample code to be fully working because you didn't provide sample data including .csv's for multiple days. In your real code you won't create an obligations table but you will use the infile/input syntax provided in comment.
@Ronein wrote:
Honestly ,I hate hash table. Is there another way?
Hash tables are a very powerful "tool" when it comes to creating performant code. With the data volumes you're dealing with performance is important. It's may-be worth for you to spend a bit of time to understand the code I've shared.
I won't spend time to propose a different approach only because you "hate hash tables" and I'm rather curious if someone else can propose something that will perform better without the use of hash tables (with only two columns one could create a format - but with a million customers that's certainly not better than a hash table).
Any approach that first would need to create a SAS table with 520*2M rows is imho sub-optimal.
@Ronein wrote:
How many rows will have in obligations data set that was created here?
How many lines do you count in the DATALINES block?
Or simply run the step in your SAS session and read the log.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.