Performance is fickle so making broad generalization is difficult. I deal with datasets of these sizes regularly (from billions to 100's of billions, mostly web event data) and I have encountered situations that contradict all of the following but they are usually 'best' starting points. I make the following points with a starting assumption that there is no highly paralleling dbms available (Oracle Exadata, Teradata, Netezza, SPDS, Aster nCluster, Greenplum, Vertica, etc...), as in the vast majority of cases that system would beat what a single threaded SAS DATA step ever could. An index will typically only benefit a merge is the small table makes up under 20-30% of the larger table. Hash tables will be very fast however the ability to load a file of these sizes is highly unlikely unless you are dealing with tables containing very few variables on a machine containing lots of RAM (my servers have 512GB and there are still many tables too large). Formats are also good approaches for large table merges. As mentioned you can use it to go from an id to a variable or a version that has not been pointed out is creating a format from merge key to record number and then using a set with point. Sometimes this is very quick, it is similar to utilizing an index and breaks down the higher the percentage of records you are reading gets, also there are memory limits for formats as well which has been pointed out. A option that has not been mentioned yet is using SPDE, if you have the I/O subsystem available to use it you can sometime see large performance increases because it will allow multi-thread I/O to the DATA step moving the burden of the task more onto RAM and CPU. Index's are also multi-thread read which can help performance with them as well. It also allows for indexes and data contains to be help in separate locations where the I/O systems can be optimized for the different types of usage. Indexs placed in storage meant for intense iops and random access, data on disk geared for sequential read throughput. Also in dealing with tables over 2 billion rows, if you are using a 32-bit operating system you need to use SPDE anyway... And finally, the best of these options is to always split these size files into logical groups of smaller files and performing the tasks in parallel and then set the results back together at the end if necessary. On another note with a related topic, I am very excited when it comes to big data and SAS with the new product SAS/ACCESS Interface to Hadoop!
... View more