My sincere apologies for not replying in a more timely manner. However, the size of the data and my desire to craft intelligent responses to all the great suggestions delayed my response until now. @mkeintz wrote: I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable. Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU. Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR: @RichardAD wrote: Let T = duration for flat read of detail table D Let K = number of header keys that _can_ fit in a hash table. Do 1,849,842,886 / K data step reads through D with hash lookup selection. Append selections of each run-through. Many many thanks to @mkeintz and @RichardAD Their technique was probably the least glamorous, but in the end it carried the day! Steps I took: Through trial and error I found how many key values could comfortably fit in memory. That was the size of my chunk. Dividing the dataset size by the chunk size I get the number of chunks: 7. I was able to create a dataset of sorted and unique the key-values (only). From this dataset, I created 7 ranges of high and low key-values. I read through the entire big dataset, sorting all obs into one of 7 line-chunks. I saved space on this step by only selecting the variables of interest (29/59) even though all obs were selected. Each year ran approx 4 hours and created 7 chunks. I loaded the first header/key-chunk into a memory hash and used it so select from the first line chunk. This was repeated for each of the 7 chunks. Each of the 7 chunks was repeated for each of the 6 years: 2016 through 2021. On average each chunk took 75 minutes. 75 minutes * 7 chunks * 6 years ~ 53 hours. The result was 64% of the obs were selected. Problem solved. Now that this unwieldy dataset has been cut down to size, I have come up with a number of enhancements inspired by many of these answers: Convert each unique 64-character claim header key into a sequence number from 1 to ~ 2.1B Convert that sequence number into a base36 string (license plate-style) This enables a 64 character string to be stored in 6 digits Repeat the same process for the beneficiary (or patient) ID. Now that the keys are cut down to size and the obs are cut down to size, the data can now be indexed. Also, other storage formats such as SPDE and access methods such as FedSQL can be researched and perhaps employed. Thanks again @mkeintz and @RichardAD ! Responses to most who responded: Follow-up question to @RichardAD: can you direct Proc DS2 to make use of THREADs if you are just looking up a huge number of sequential observations in a hash? I did read but the answer wasn't clear from this page. Also see @SAS_Jedi 's comment. @SASKiwi wrote: Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO. @Patrick wrote: I also would use the SPDE engine for storing such a huge SAS table. data spde_saslibrary.want(compress=yes); Both the small-er (header) and large (line) datasets are compressed with the binary option. Also the "point to observations" option is set to "yes." I feel that this is a big part of what is slowing down processing. It would seem that trying to point to a dataset that is compressed leads to a lot of decompression and computation about where to seek the obs pointer to on the disk. I think this may actually _increase_ IO and will definitely increase CPU. I know it definitely increases run time. Is the compression of big datasets worth the overhead? Once I extracted data into an uncompressed format, everything ran _much_ faster. @FreelanceReinh wrote: It should be possible to use a much smaller key item for the hash object, e.g. md5(clm_id), which takes only 16 bytes, instead of the 64-byte clm_id itself. This is an innovative approach. However, I would be calling the md5 function billions to trillions of times. I'm not sure what overhead that might add. As @Patrick pointed out, there is a risk -- even slight -- of collision. I am skittish about this approach. @Stu_SAS Because FEDSQL does not allow SAS data set options to limit obs and firstobs, I was not able to use this solution. The data is just too d--n big. Also, @SAS_Jedi also questions whether the multi-threading would be useful here. @KachiM Thank you for your stand-alone code example. However, the proc sort data=tempbig; is just not possible in my environment. @whymath I downloaded Paul Dorfman's paper. It is very complex, and I do not profess to have understood it thoroughly. However, the good Dr Dorfman does say on page 3 of the referenced paper that Bitmapping is suitable for "no-matter-how-many-short-keys." On the bottom of page 2, he says I would need to allocate an array of 10**[60]/53 which is 18,867,924,528,301,900,000,000,000,000,000,000,000,000,000,000,000,000,000,000 . So Bitmapping is not a practical solution. @quickbluefish wrote: Why do you want to join these? I get this a lot. This answer is specific to the topic area and not really of interest to SAS Users per se. The TAF is a collection of Medicaid Mgmt Info Sys (MMISs) data. Half of this file contains financial-only transactions that are not of current interest the researchers I work with. Plowing through half of all this data only to delete it is a fantastic waste of both people and computer resources. I am trying to get rid of this financial-only half to make it more usable. The dataset is static and once this is done, I won't need to do it again.
... View more