01-28-2015 06:23 AM
I have 4 datasets and the number of datasets is bound to grow. My objective is join or merge them to one in an efficient way using below keys
Dataset 1 has 1.1 million records and the key used to join is Var1 of the dataset
Dataset 2 has 1.5 millions records and the key to be used in join is Var3 of the dataset
Dataset 3 has 1.3 million records and the key to be used in join is Var5 of the dataset
Dataset 4 has 1.7 million records and key to be used in join is Var8 of the dataset.
1. I want to avoid any possible cartesian
2. Avoiding sort would be cool
3. The output, that is the joined dataset should contain the minimum of all records in dataset1
4. The datasets are bound to grow in size, so scalability to handle large datasets will help
I'd appreciate a demonstrated example.
Please and Thanks,
01-28-2015 06:44 AM
Optimization is a large subject. So you can get some suggestions here, learn from literature etc. But you need to do the actual testing and ongoing maintenance with your specific data.
So, all records from dataset1, means you need to use SQL left join, or data step merge.
The data step cannot rally be optimized. The data inputs need to be sorted on th BY variables, so if the source data is not in that sort order, it is quite expensive to this every time.
How often do you need to do this?
How often are the source tables updated?
Perhaps you could work with some kind of change data capture technique?
You could also look at hte storage of source data, if that is in your power. Is it in SAS? Moving the data to SPDE can speed up table scans.
If your situation works best with SQL, try to index the join columns. Use the _method opt ion to PROC SQL togehter with OPTIONS MSGLEVEL=I; to examine SQL planner behaviour.
You name three different key variables. Are they holding the same values but are named different, or is there any other join keys?
If all variables are the "same" you cpould still benefit from data step merge, otherwise SQL is my best guess.
Please attach small amount of example data to visualize your situation.
01-28-2015 07:07 AM
After you clarified your question, You should post some data and the output you need, therefore people could understand what you are talking about better .
data Dataset1; infile cards truncover; input var1 Age Buy ; cards; 1 45 2 50 3 70 4 80 5 33 6 20 ; data Dataset2; input var3 Category2 & $40.; cards; 3 Frozen food2 ; run; data Dataset3; input var5 Category3 & $40.; cards; 5 Frozen food3 ; run; data Dataset4; input var8 Category4 & $40.; cards; 6 Frozen food4 ; run; data want; if _n_ eq 1 then do; if 0 then set Dataset2; declare hash h2(dataset:'Dataset2',hashexp:20); h2.definekey('var3'); h2.definedata('Category2'); h2.definedone(); if 0 then set Dataset3; declare hash h3(dataset:'Dataset3',hashexp:20); h3.definekey('var5'); h3.definedata('Category3'); h3.definedone(); if 0 then set Dataset4; declare hash h4(dataset:'Dataset4',hashexp:20); h4.definekey('var8'); h4.definedata('Category4'); h4.definedone(); end; call missing(of _all_); set Dataset1; rc=h2.find(key:var1); rc=h3.find(key:var1); rc=h4.find(key:var1); drop rc var3 var5 var8 ; run;
Message was edited by: xia keshan
01-28-2015 07:17 AM
You must be careful when using hash tables, since they are loaded into RAM. And the input data sets seems quite large - could trigger out of memory conditions.
01-28-2015 11:05 AM
Not when the smaller of all tables has 1 million records? For larger data, it must be SQL with a combination of indexes I think or some other method. Hash is relatively new and performance efficiency has been tested on relatively smaller tables (from the papers that I have read).
01-28-2015 07:44 AM
Where is the data actually stored? If you get these from a database/warehouse, you may be better off doing the processing there and getting an output file.
Also, completely agree with Linus Hjorth, optimization is a big subject with plenty of options. First off, once you have done it once, can you not just extract the data which is different to the previous set (i.e. incremental updating), this would save you over subsequent runs.
01-28-2015 11:18 AM
I would go with Xia Keshan approach (Using Hash Objects) if all the tables were stored in SAS tables/data sets.
You can control the the Memory allocated to your SAS Job/Session/Process using the -MEMSIZE xxxM option on the invocation command line.
SAS comes with default settings that is specified in the *.cfg (SAS Configuration) file, and using the -MEMSIZE will override the default setting.
Now a days, the minimum requirement for installing SAS v9.x is 4GB RAM/CPU or Core, and that's plenty of RAM!!
Otherwise, if your tables are scattered around and stored in mix environments (SAS, Data Base, Flat Files, ....Etc) , Then I would recommend you standardize your storage platform first, then use the best manipulation technique(s).
Just my 2 cents,
01-28-2015 11:22 AM
4 table with millions of records, growing...and we don't know the record length.
MEMSIZE is one thing, but physical memory needs to there, whatever your MEMSIZE setting says.
We need more info from OP before we can make progress.
01-29-2015 05:01 AM
You make me figure out a new fast method , an old school way : SET + BY .
Hash V.S SET . I don't know who would win , but I put money on old school.
data Dataset1; infile cards truncover; input var1 Age Buy ; cards; 1 45 2 50 3 70 4 80 5 33 6 20 ; data Dataset2; input var3 Category2 & $40.; cards; 3 Frozen food2 ; run; data Dataset3; input var5 Category3 & $40.; cards; 5 Frozen food3 ; run; data Dataset4; input var8 Category4 & $40.; cards; 6 Frozen food4 ; run; data want; set Dataset4(rename=(var8=var1)) Dataset3(rename=(var5=var1)) Dataset2(rename=(var3=var1)) Dataset1(in=ina); by var1; length c2 c3 c4 $ 20; retain c2 c3 c4; if first.var1 then call missing(c2,c3,c4); c2=coalescec(c2,Category2); c3=coalescec(c3,Category3); c4=coalescec(c4,Category4); if last.var1 and ina ; drop Category2 Category3 Category4; run;
01-28-2015 11:40 AM
Information provided by you is not sufficient to decide the approach that takes the least run-time and capable to grow over time.
First, what is data type of KEY Variable(var1, var3, var5 and var8). What is the Maximum and Minimum of these in your 4 data sets(This has relevance to use ARRAYS)?
My advice is to provide sample data sets(telling the data types) and show the output data set that you derive out of them.