Hi everybody!
I have two large datasets that lies in the companys database and I want to join these to another dataset that I have from before in SAS.
These two large datasets contains information about the same account on different dates.
What I want to do is join the latest record of a account from both datasets to the first dataset, but I find that the ways I have done this to now is very inefficient and takes forever.
Is there a good way to do this efficient in proc sql?
This is the way I have tried to this for only one dataset is something like this, but this is very slow:
proc sql;
create table Loans as
select
a.*,
b.*
FROM customer_information as a inner join
( Select *
From Business.LOAN_INFORMATION
Group by ACCOUNTNR
Having DATE_OF_RECORD=max(DATE_OF_RECORD) ) as b
on a.ACCOUNTNR = b.ACCOUNTNR;
;
quit;
Let's say you have three datasets A, B, and C. A is sorted by ID/DATE_FROM_A, B is sorted by ID/DATE_FROM_B, and C by ID/DATE_FROM_C.
Youi want the latest obs from A and from B, but all the obs from C:
data want;
set a (in=ina keep=id)
b (in=inb keep=id)
c (in=inc) ;
by id ;
if first.id then call missing(of _all_);
if ina then set a;
if inb then set b;
if inc;
run;
Just to be a devil's advocate, what definition of "most efficient" are you using?
Several that come to mind as possibilities:
Run using the least CPU
Run using the least memory/disk space/network bandwidth
Take the least time to program
Take the fewest lines of code
There are others.
You may want to consider that if your time programming the solution results in $XXXX of man-hour costs you may have spent way more than additional CPU Disk/Memory impacts for a "small" process, small meaning few records and/or steps needed. For very large data sets that spend a lot of time pushing things around a network you may want to consider that impact more than nice code. Trade-offs abound with this sort of problem. Such as how large is "large". We see people think their data is large with a few thousand records and 50 variables. Then there are the terabyte sized datasets...
Sometimes the fewest lines of code may be the most difficult to understand when you look at them later.
Poor choice of word from my part. I was thinking mostly of least time to run.
And I understand your point.
I have found that the effort to find the how to write the program I want is not something I prioritize for now, but if there is a nice and effective way to do this, I'm very much interested for later projects.
How many rows are there in each table? Is the company database an external RDBMS? If you are joining between a SAS table and an RDBMS then either SAS has to read all of the RDBMS down to SAS and join there, or load the SAS table into the RDBMS and join there. Typically you should load the smaller table into the larger table's environment for best performance.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.