03-21-2012 08:21 PM
I come across a situatin to merge two datasets each of size 40million observations.
The details are:
1. Dataset1 contains lot_id and unit_id, also some other variables. Each lot_id is mapped to more than one unit_id
2. Dataset2 contains unit_ids and their values.
Now the task is to obtain the total value of each lot_id by merging the two datasets ( by unit_id).
Both the data sets are readily indexed/sorted by unit_id, which is also unique in both datasets.
I have to select very few variables from each of the datasets.
At the moment i used below proc sql code.
create table output as
select A.lot_id, sum(B.value) as value
from Dataset1 as A, Dataset2 as B
where A.unit_id = B.unit_id
group by lot_id;
I knew I can achieve the same by using merge statement and then using first. / last. variables to obtain sum of values for each of the lot_id.
Here my question is:
What is the efficient way(in terms of time) to perform merging on bigger datasets?
In near future I need to perform similar merging on datasets with about 10billion observations.
If anyone can guide me to find a much efficient way(in terms of time), that would be a great help.
Many thanks in advance
03-22-2012 05:01 AM
OMG. 10billion ! What dataset is it.
If I were you , Hash Table is my favorite.
Due to your very large table.
I will spilt the large table into many small tables which only contains one lot_id. then use it one by one.
03-22-2012 06:32 AM
If the data is that large - you might want to leverage the underlying database. Is there data stored in Oracle, Terradata or SPDS? Use SAS integrated functions,pass-thru SQL or execution of a in-database function.
03-22-2012 09:43 AM
Definitely worth a try: Create a format from one data set, then apply the format when summarizing the second data set. For example:
set dataset1 (keep=lot_id unit_id);
retain fmtname '$lots';
rename lot_id=label unit_id=start;
proc format cntlin=create_format;
proc summary data=dataset2 NWAY;
format unit_id $lots.;
output out=totals (keep=unit_id value) sum=;
In the output data set TOTALS, there are just 2 variables and VALUE is the sum of all VALUEs. The format $lots is still required to convert the UNIT_ID into the LOT_ID, for example:
lot_id = put(unit_id, $lots.);
Good luck. Let us know what you find out in terms of speed.
03-22-2012 06:28 PM
Many thanks for all your replies. At the moment i concentrate on merging 40m datasets, which took just more than an hour
to complete my proc sql step. and i have to repeat the same task for 12 different months, which may take more than 15 hours.
( I don't worry about the 10billion dataset merge at this point, because there i will be merging a 20million dataset with 10billion one).
Thanks to Ksharp and will try hash tables as i need to bring couple of more variables from Dataset1(the lot_ids belong to some categories
and need that info for analysis.)
However, proc format technique suggested by Astounding looks simple and effective for me.
But, i don't know how to bring additional variables(e.g., category) in this way from dataset1, without doing another iteration of proc format.
Anyway, i will post the outcomes once i succeed with my trials.
To jcbell.->The data is in SAS server, and don't know how to 'leverage the underlying database'.
03-22-2012 06:55 PM
It's possible to add a few more variables to a single format. But you have to watch the details. Here's how you would begin.
set dataset1 (keep=lot_id unit_id plus a few more);
retain fmtname '$lot';
length label $ 200;
label = unit_id;
substr(label, 31) = plus;
substr(label, 51) = a;
substr(label, 71) = few;
substr(label, 91) = more;
keep start label fmtname;
The idea is to pack additional information into the format, in known positions. If you are adding numeric pieces, you should convert them to character using the PUT function. Once that is done, you need PROC SUMMARY to use only the beginning of the formatted value, not the whole string that includes the added information. So, run PROC FORMAT as before, and then:
proc summary data=dataset2 NWAY;
format unit_id $lots20.;
output out=totals (keep=unit_id value) sum=;
In this step, the only difference is specifying $LOTS20., which tells SAS how many characters of the $LOTS format to apply. Then in the DATA step, get all the related information:
length long_string $200 lot_id plus a few more $ 20;
long_string = put(unit_id, $lots200.);
lot_id = substr(long_string, 1, 20);
plus = substr(long_string, 31, 20);
a = substr(long_string, 51, 20);
few = substr(long_string, 71, 20);
more = substr(long_string, 91, 20);
drop long_string unit_id;
Of course, you have to adjust the length you assign to each variable, depending on what is actually needed.
I suspect that this will be MUCH faster than anything that has to join the data sets, but you have to use tools that you feel comfortable with. Good luck.
03-22-2012 07:33 PM
If you are going to try the lookup format approach suggested by Astounding then you need to be aware that these are loaded into memory and so the size of the format is limited by your available memory.
This SAS note explains in more detail: http://support.sas.com/kb/30/997.html
I have successfully used formats with a million or so discrete values but I suspect you may run into memory limits with a 40 million unique value lookup, and most certainly with 10 billion! Only testing will confirm what your maximum is.
I like Ksharp's idea of splitting the problem by LOT_ID as a fast lookup by hash (which also uses memory) or format would probably work OK. You can then combine the results for all grouped LOT_IDs afterwards.
What is the average size of your LOT_IDs (how many UNIT_IDs in each)?
03-22-2012 10:11 PM
I would favor jcbell's approach. Depending on where your tables are, most of the time might be spent on data transit. Also note that most DBMSs will use hashing for large merge operations when it is deemed beneficial.
03-22-2012 11:50 PM
Performance is fickle so making broad generalization is difficult. I deal with datasets of these sizes regularly (from billions to 100's of billions, mostly web event data) and I have encountered situations that contradict all of the following but they are usually 'best' starting points. I make the following points with a starting assumption that there is no highly paralleling dbms available (Oracle Exadata, Teradata, Netezza, SPDS, Aster nCluster, Greenplum, Vertica, etc...), as in the vast majority of cases that system would beat what a single threaded SAS DATA step ever could.
An index will typically only benefit a merge is the small table makes up under 20-30% of the larger table.
Hash tables will be very fast however the ability to load a file of these sizes is highly unlikely unless you are dealing with tables containing very few variables on a machine containing lots of RAM (my servers have 512GB and there are still many tables too large).
Formats are also good approaches for large table merges. As mentioned you can use it to go from an id to a variable or a version that has not been pointed out is creating a format from merge key to record number and then using a set with point. Sometimes this is very quick, it is similar to utilizing an index and breaks down the higher the percentage of records you are reading gets, also there are memory limits for formats as well which has been pointed out.
A option that has not been mentioned yet is using SPDE, if you have the I/O subsystem available to use it you can sometime see large performance increases because it will allow multi-thread I/O to the DATA step moving the burden of the task more onto RAM and CPU. Index's are also multi-thread read which can help performance with them as well. It also allows for indexes and data contains to be help in separate locations where the I/O systems can be optimized for the different types of usage. Indexs placed in storage meant for intense iops and random access, data on disk geared for sequential read throughput. Also in dealing with tables over 2 billion rows, if you are using a 32-bit operating system you need to use SPDE anyway...
And finally, the best of these options is to always split these size files into logical groups of smaller files and performing the tasks in parallel and then set the results back together at the end if necessary.
On another note with a related topic, I am very excited when it comes to big data and SAS with the new product SAS/ACCESS Interface to Hadoop!