08-08-2014 01:27 AM
I have several large datasets, and try to merge them together. But since they are too big, it will take a long time. I wonder if there are some alternative efficient ways to do this. For instance, I have datasets A(a1 a2 .... a20), B(a1 a2 b1 b2), and C(b1 c1), with variables in (). I want to merge A with B via a1 and a2, and merge C with b1. Eventually I need the final dataset D with variables (a1-a10, b1,b2,c1). But A has over 50m observations and 20 variables. I don't need variables a3-a10 when merge with B and C. If I include a3-a20 in the merge it will take over 30 minutes to merge, which is not desired. If I only keep a1 and a2, it will take less time, but I have to merge it back later.
Is there any efficient way to do this? I heard of some hash thing, but don't know if it's the right way here.
I try to learn both data merge and proc sql methods and want to compare them.
08-08-2014 02:13 AM
The hash method works only works up to the point where the dataset you want to put into the hash exceeds your available memory. So of you want to merge 50m records with 20k, the hash method may be your best choice.
Unless datasets B and C are "small" enough to fit into memory I suggest the following:
(assuming that A is the largest data set)
During generation of dataset A, have it sorted by a1 and a2 as a final step.
Merge B and C first, sort the result by a1 and a2, then do the final merge with A.
Don't do it in one proc sql, that will most probably yield the worst performance.
If you can, try to have different physical volumes for A, the intermediate result of (B,C) and the final data set. This will spread the read/write load, prevent disk contention and allow the system to read/write sequentially during the "big" merge.
30 min for a 50m merge is not _that_ long, by the way.
08-08-2014 02:24 AM
Thank you for your answer. The size of A with 50m observations is around 16GB. My memory is only 8GB. Does that mean I cannot use hash method? B and C are small enough. But in my actual example, I cannot merge B and C first, they both need to merge with A first. I made a mistake in my description in the example. 30 min is not desired for me as I have several merges that will total several hours. Let's why I try to find a better way to merge.
08-08-2014 02:34 AM
From your description, you can't merge C with A because of a lack of corresponding index variables (b1 is not on A).
What keeps you from using my method? Are there additional conditions to be met?
16GB is clearly too much for the hash method, but with the hash method you usually read the smaller data set into the hash and then work sequentially through the larger data set. So it _may_ be possible to put B and C into two hash objects.
08-08-2014 03:48 AM
You can merge A with B and B with C but not A with C. bcs a and c are not contain same by variable.
If you wish to merge A with B and B with C then you can use below SQL code.
create table d as
select a.*, b.b1, c.c1
from a_1 as a, b_1 as b, c_1 as c
where a.a1=b.a1 and b.b1=c.b1;