11-17-2013 08:08 AM
Hey SAS Folks,
I am in need of rather efficient logic perhaps using HASH Obj rather than typical joins for my problem. I have Dataset A and Dataset B with 100 and 80 variables respectively. I need to combine A and B with a left join with the following conditions:
1. variable1 of A= variable1 of B
2.variable2 of A=variable4 of B
3.variable5 of A= variable5 of B
4. pick only the last 20 variables from Dataset B
Lets call the resultant datatset as AB
Now, I need to left join AB dataset with Datatset C on conditions similar like the above and create the joined Dataset ABC. This process repeats and follows with ABC dataSET combining with Dataset D again on a left join to get the result ABCD with new join and filter conditions as mentioned above. So, Basically the dataset keeps growing if I have explained well enough!!
If you notice, it's pretty easy to keep writing too many join queries that is highly I/O intensive and less efficient. I would appreciate a solution that I believe can be done using Hash Obj or perhaps in other methods that will be give me the most effective solutions coz each of my datatsets are like 20 million large record.
Any help would be most appreciated,
Charlotte from England
11-17-2013 11:26 AM
The problem is stille not well defined in:
- The limitations you have. Hardware, machine, logical limitations the Operating system you are using.
Some hardware with multiple processors (you have that) can be usefull exploited by SAS versions. Wich do you use?
How much internal memory do you have. Using hash-objects implies building a copy of the dataset into internal memory.
- Why it is not possible to do the join first for all datasets ABCD and limit on variables while joining?
- are all variables of A kept and you are limiting them in B?
- As you are mentioning I/O I would guess the datasets with 20M records are about 2Gb of size?
You are looking voor turn-around (elaps) time optimization?
Coding Joining with SQL is easy coding. The processes can be multithreading effectively using multiple cores.
It will possible do random IO (slow-much overhead).
Coding Joining a SAS datastep (eventuly hash) is more work but gives more control.
Having all datasets ordered a merge statement with some additional logic can effectively process the sequential data.
If you have all data on different dasd or being virtualized having a great read ahead cache, this will perform well.
What is what you have done until now?
11-17-2013 04:26 PM
In addition to what Jaap already said:
Hash tables get loaded into memory. Considering the size of your tables do you have enough memory for this?
"ABCD with new join and filter conditions..."
If you can't use hash tables then it's either Merge or SQL - eventually using indexes. What's the most efficient approach will largely depend on your actual data and we would need a good understanding of it to come up with a valid proposition.
You say that the result table will be growing. Is this simply because you're adding variables or will there also be additional rows (so 1:many relationships between the tables)?
May be you give us some sample data - eg. by posting 3 data steps creating such data representative for what you're actually dealing with.
As I understand you the join condition varies (different variables and different logic) between the tables.