01-07-2014 05:51 AM
Hey SAS Folks,
happy new year,
the following archiving joins code is running awefully slow and its taking ages to run, and so I am seeking an alternative, can i get help with a super efficient and faster approach, perhaps hashes?
The input and datasets are pretty large with over 15 millionsd records and 20 + variables.
CREATE TABLE output1 AS
B.balance as bal_1
FROM table1 A
CREATE TABLE output2 AS
FROM output1 A
CREATE TABLE output3 AS
B.balance AS bal_2
FROM output2 A
CREATE TABLE output4 AS
B.balance AS bal_3
FROM output3 A
AND A.var4 BETWEEN b.var5 AND B.var6
AND A.var6 BETWEEN b.var7 AND B.var8
AND A.var7 BETWEEN b.var9 AND B.var11;
I'd sincerely appreciate any help. Thanks.
01-07-2014 07:16 AM
From looking at your original SQL code, the reason it's taking awfully long time is, your entire Table1 (A) is being read four times!!
Without knowing how big the other tables are (Table2 - Table5), hashes may not do the trick. On the other hand, if these tables are properly indexed, then re-writing the query in the following way could yield better performance
CREATE TABLE output4 AS
T2.balance as bal_1,
T4.balance AS bal_2,
T5.balance AS bal_3
FROM table1 T1
ON T1.var1 = T2.var1
AND T1.var2 = T2.var2
AND T1.var3 = T2.var2
ON T1.var1 = T3.var1
AND T1.var2 = T3.var2
ON T1.var1 = T4.var1
and T1.var2 = T4.var2
AND T1.var3 = T4.var3
AND T1.var4 = T4.var4
on T1.var1 = T5.var1
and T1.var2 = T5.var2
AND T1.var3 = T5.var3
AND T1.var4 BETWEEN T5.var5 AND T5.var6
AND T1.var6 BETWEEN T5.var7 AND T5.var8
AND T1.var7 BETWEEN T5.var9 AND T5.var11;
Note: Table1 in this query is only read once!
Hope this helps,
01-07-2014 10:07 AM
I agree with Ahmed. The rule is to reduce full table scans as much as possible so you want to read each table once and read only the records you need. Also, for larger tables indexes can provide an amazing performance boost. They really don't take as much space as some people seem to think. If your code is run frequently, on each table create an index of the join variable(s). Mike Raithel wrote a great book named 'The Complete Guide to SAS Indexes'. you can use Lex Jensen's web site to find a paper on the subject. Good Luck!
01-07-2014 10:16 AM
I indeed agree with both of you and I have done that too, which certainly does help in reducing the time. I appreciate it.However, it's still taking sometime and my intention is to further optimise. Right now, I am going through a paper of Paul Dorfman- Hashes. I have got plenty of memory to load super huge tables into a hash object, so wanna take a shot with Hashes. Since, my understanding is taking a while, i am kinda slow-.
Many thanks though,
Charlotte from England
01-07-2014 10:33 AM
Also, what SAS version/OS are you running on? Also, how much memory are we talking about?
If you have a monster computer stuck on SAS 9.2 32-bit, you still won't be able to use all that memory for hashing.
01-07-2014 10:29 AM
Hi, Will look into it. Great thinking!. Thanks!!!. Any in memory processing ideas will get me very excited particularly for this solution.
Cheers from England,
01-07-2014 10:30 AM
Maybe this is an issue with anonymizing your variable names but the second proc
CREATE TABLE output2 AS
FROM output1 A
actually does nothing but copy output1 in output2. Assuming the variable kepts are different and actually add data to output one, then you could most definitely improve the process by combining output1-3 within a single data step and thus a single read of the data and using hash objects to get the related column(s) from table B. It shouldn't be too big an issue memory-wise since the hashes would only have 1-3 keys and 1-2 columns based on your example. However if your real case takes significantly more columns than 2, you might run out of memory using a hash solution.
I can provide a hash example for output1-3 after you correct the example for the output2 that I just pointed out.
However, output4 is not hashable in a strict sense because your matching rules allow for range and sadly, odds are that this is by far the slowest join. If your ranges have no overlap in B, it could potentially be achieved through hash of hashes in an extremely efficient way but that's only if you know that's the case with your data. That is, range matching means doing a fuzzy lookup in a hash table and since it is not designed to do so, it becomes far more tedious to accomplish.
01-07-2014 11:37 AM
You can try this, and see if will work for you,
DATA output4(DROP=balance rc);
if 0 then
SET table1 table2 table3 table4 table5; /* avoid length statements */
LENGTH bal_1 bal_2 8;
If _n_=1 Then
Declare Hash ht2(dataset: 'table2', hashexp: 8); /* load Table2 into memory */
Declare Hash ht4(dataset: 'table4', hashexp: 8); /* load Table4 into memory */
/* Assuming both tables have a composite index of (var1,var2,var3) */
MERGE table1 (IN=t1)
KEEP=var1 var2 var3 var5 var6 var7 var8 var9 var11 balance
RENAME=(balance=bal_3 var5=t5_var5 var6=t5_var6
var7=t5_var7 var8=t5_var8 var9=t5_var9 var11=t5_var11));
BY var1 var2 var3;
/* Read all records from Table1 (Left Join) */
if (t1) then
/* Initialize */
call missing (bal_1,bal_2);
if (ht2.find() = 0) then /* Get Balance from Table2 */
bal_1 = balance;
if (ht4.find() = 0) then /* Get Balance from Table4 */
bal_2 = balance;
/* Set bal_3 to missing if Range conditions are not met */
if (NOT((var4 BETWEEN t5_var5 and t5_var6) AND
(var6 BETWEEN t5_var7 AND t5_var8) AND
(var7 BETWEEN t5_var9 AND t5_var11)) ) then
bal_3 = .;
01-07-2014 12:49 PM
Thank you so much for the very prompt response and gracious effort in helping me with the solution, more importantly for your valuable time.
Ahmed, I will certainly try your solution as soon as I step into office tomorrow and let you know because I have just come back home after the days work(it's 6pm here). I sincerely appreciate it so much.
Vincent, First off I would like to apologise for not having provided a much rather simulated variable information. Well, basically I am using SAS enterprise guide (the latest version if i am not wrong) in a multi user environment of a financial services company. I don't quite know the exact configuration and its technicalities but for sure we enjoy enormous cushion in terms of memory. The task of the above or in other words the part of the SAS script does the objective of archiving the data using table1 as the master input table to which new variables are added from appending tables on a left join. So, on each join does the purpose of adding certain important variables such as Customer_balance information, account_renewal for instance. Technically, we are joining to fetch new variables from different fact tables (that has different related information to the entity) to the master and hence you see those multiple left joins one after another. Therefore the output of each goes in as the input to the next to combine with the next table. Finally, the resultant large dataset will have all necessary(appended) information and becomes the base for further stat analysis, decile, scoring and what not.
Ofcourse, as you noticed the join (keys) conditions for each step are not necessarily the same and I guess that's perhaps is the pain. a.id=id, a.market_id=b.market_id, a.date=b.date, and in some between and et al. Hmmm, a new variable added in one of the joining datasets(eg table b) can also happen to become the join key for the next join.
The SQL join one after the other does the job but by the time it does I could probably enjoy a coffee break and come back. If I am not still not good at explaining, I will try and present a simulated code to help better soon as I am at office tomorrow. Also, please let me know if i can provide more information to make it easier for you.
Do I make sense or am i sounding crazy?
Thank you for your VALUABLE time,
Cheers from England,
01-07-2014 01:55 PM
The memory that a single EG client and utilize during a session is capped through the config parameter -MEMSIZE. The default installation for clients is to have only 2G. This is to prevent a single EG clients' task from consuming such a large chunk of memory that all of the ongoing processes crash in an out-of-memory error(that is, crash your own as well as all other server users' job).
I have little to no knowledge in server administration but here at statscan, I was told by the chief admin for our EG servers and the upcoming grid servers that it is reasonably easy to expand the -MEMSIZE option for single user either permanently or short term as well as to possibly dedicate a time window where other processes are queued rather than concurrently ran to ensure no server crash will occur when a single large process is run in-memory. However, it does add some considerations to your problems as far as balancing your requirements to those of the business as a whole.
Nonetheless, with the standard 2G memsize option for a client, it is possible to achieve a lot through hashing, just maybe not all that you want.
I will point out that Ahmed's solution whilst syntaxically correct assumes no multidata. That is, it assumes there is never more than one occurence of a unique "key" combination in the different datasets. Don't get me wrong, it appears possible to work around this limitation with reasonably simple modifications but just bear that in mind when you test.
As to give you a rough estimate on memory requirements:
(15M records) * (n variables) * (8 bytes / variable) / (1024^3 bytes / GB) = n * 0.11 GB
With this many records, it can skyrocket if you have to keep say a 200 bytes length string but under normal install conditions, provided you didn't replicate all the keys for data points (multiple hashes all with the same key) but rather create hash of hashes, you could possibly even fit the very last join saving you 2 proc sorts (on each table1 and table5) to get the merge through.
01-08-2014 06:55 AM
You are absolutely correct, also the attempt to achieve my want is something difficult considering the number of joins could rise in the future as the archiving continues to append more as the history datasets gets added in during every quarterly refresh.(or that's what i have been told*)
Ahmed's help is very useful as I could make it to work by tweaking a little to the needs of the real situation but then realised it won't be just those tables as it is at present and going forward there would be many more. I do thank you Ahmed and Vincent for such a help in quick time and explaining it inside out.
My intent was to experiment something fancy yet effective but I guess my apprach aint bingo for the simple reason making it complex doesn't help always particularly for this situation. /*hash of hashes are beyond my skill to compete with my own self at this point considering I started working on SAS just 6 months ago. But i hope i will get to learn at a faster rate*/
My boss at work aint too concerned about the long time the SQL takes to execute, however it was my thought to make it better.
May i ask, had you been in my situation what would have been your approach, would you have stuck to the present process or if you can provide an alternative at your own convenience with any advance look up for the same in a combination of steps, that will be good too.
Nevertheless, I know the value of your time. So, it's all good. Thanks very much indeed.
I hope you are not freezing in Canada.Have a great day.
Cheers from England,
01-08-2014 08:05 AM
Funny enough, I was in similar situation as yours, I joined a company as a SAS contractor, and they had several processes that did exactly what you are trying to do, lots of tables joins that involved tables of 100+ millions Medical Records. The original code has lots of SORT + MERGE steps, and each process was taking several hours of processing on a very powerful UNIX Server!!
I just couldn't accept leaving it the way it was, so after analyzing the jobs and understanding the characteristics of the tables and their Entity Relationships (1-0, 1-1, 1-Many, Many-Many), I was able to
- Get rid of all the SORT steps
- Add required Indexes
- Perform all my joins using single data step, that involved hashes, merge and formats.
Basically reduce the times I read my 100+ Million record tables.
Don't get discouraged, you are doing the right thing, and your initial pain and uphill learning path, will pay huge dividend down the line. I have using and practicing SAS for 17 years, and I'm still learning something new every single day.
Keep up the good work, and never accept the norm ;-)
01-08-2014 08:22 AM
For what it's worth, I've only been a SAS user for a year and have gotten into hash objects for 3-4 months probably and I have a math/stat/actuary background, not IT. I am naturally fascinated by programming though and had a good grasp on oriented-object programming from a mandatory C++ class in college. Regardless, it is not as big a monster as it appears. Hash of hashes looks like it at first but with the example from Black Belt Hashigana, it kicked in fairly quickly (I can retrieve the SGF paper link later if google search doesn't get you there - I think it's also by P. Dorfman).
The utility of hash of hashes when you are getting memory constrained is in 2 parts, first, you can sort the inner hashes in a different way from the outer hashes. Second you could key certain variables according to a set of subset of keys (var1-3) and another set of data by a larger set that includes the outer keys (var1-var4). It also can help you micromanage memory usage for variables that are repeated a lot.
The bigger issue of hash of hashes, especially as a first dip is that most companies haven't quite moved towards this new tool so it is not nearly as transferable. So if it is meant to be a chunk of a much larger process with many other people working on it, you might find yourself having to explain it over and over again to each and everyone. Not to say that if you want to pass the code to someone else or quit/leave for extended vacations or disease, whoever picks it up might have a lenghty learning curve ahead.
Nonetheless, in an attempt to answer the question "what would you have done?". I would definitely have worked towards a more efficient solution in my spare time if this is a job that is run very often both to show initiative and creativity to my bosses. I believe it's a good opportunity to get visibility even moreso in a fairly new work environment (I assume since you've only been using SAS for 6 months that you've only been at this particular job for the same time). Whether and to what extent I would've used hash object is a different question that I can't quite answer without better knowledge on the data itself though. Size is one thing but the presence of of multiple instances of a given key also impacts, so does the actual join keys and their order.
Like if you have 15M records but the final INNER JOIN (instead of left) would've yielded a few hundred thousands data points only, hashes could be used with great efficiency by simply subseting the data in one pass and throwing the subset in hash objects afterwards because then memory would be a non-issue. Similarly, it may be possible to only load a subset of each table in hashes and then clear them for each value of a new by variable (e.g. imagine your data is presorted by year because its a longitudinal database of some sort of transactions and that one of your matching key is the year, you could load only subsets of data as needed in hashes over a single pass on the data and gain huge efficiency without getting into memory allocation issues).
Anyway, good luck!
01-08-2014 10:37 AM
Charlotte, I'd rework the code to read table 1 only once to get an immediate performance improvement. Next I'd index the source tables on the join key(s). Together that will make a significant difference in performance I think.
The basic uses of SAS hash objects very simple to understand and code. I'm glad to hear that you're not making the mistake of not learning the newer techniques that have come out in the past couple years. I work with too many people who have let their SAS skill set become stale and so we as a company are missing out on really leveraging the software's capabilities and maximizing our computing power.