Problem:
I have two datasets: The first is detail records from a very large dataset (1.2 TB) and the second is row IDs from an only slightly smaller "header" dataset (110 GB). The relation between line and header is many-to-one. I am trying to select the obs in the line that have a match in the header. The header dataset only contains the key variable.
What I've done so far:
Result:
I started the script 8 days ago and my best guess from the looking at the size of the output lck file in Windows File Explorer is that it is only one tenth through.
The help I need:
What would I need to do to access this dataset in a reasonable amount of time -- a couple of days? Should I try to break the line input dataset into chunks, sort and interleave by clm_id and then try a data step merge? If I were to request a more memory and processors for this virtual machine, how much would I need?
SAS Versions:
Smaller Header Dataset:
Query:
proc sql stimer ;
    create table saslibrary.outputdataset as
    select t.bene_id, t.clm_id, <26 other variables>
    from
        saslibrary.lineinputdataset (firstobs=4859048953 obs=5128996116) as t
        inner join saslibrary.headerinputdataset as c on (t.clm_id = c.clm_id)
    ;
    quit;
OS:
For those of you familiar with Medicaid data this is the TAF data from CMS/MACBIS.
Thank you for reading.
I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable.
Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.
Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:
data want1;
  set bigdataset (keep= list of variables);
  where LLLLLLL <= clm_id < UUUUUUU;
  if _n_=1 then do;
    declare hash h (dataset:'header (where=(LLLLLLL <= clm_id < UUUUUUU))');
      h.definekey('clm_id');
      h.definedata('clm_id');
      h.definedone();
  end;
  if h.check()=0;
run;
Notes:
Of course, this requires generating the quintile CLM_ID values. You could do something like this to find the quintiles:
proc sort data=header out=header_sorted  nodup;
  by clm_id;
run;
data limits (drop=CLM_ID);
  set header_sorted nobs=nclm;
  retain quintile 1;
  if _n_=1 then LLLLLLL=clm_id;
  retain LLLLLLL;
  if _N_ = ceil(nclm*(quintile/5));
  UUUUUUU=clm_id;
  output;
  quintile+1;
  LLLLLLL=UUUUUUU;
run;
proc print;
run;
I'd recommend using a DATA step MERGE. You've already sorted and indexed the smaller dataset to conform to the larger one's sort order, so you're ready to roll. I'd expect that to process much faster than the SQL join.
You can also try using FedSQL. FedSQL is multi-threaded where possible.
proc fedsql;
    create table saslibrary.outputdataset as
        select t.bene_id, t.clm_id, <26 other variables>
        from saslibrary.lineinputdataset as t
        inner join
             saslibrary.headerinputdataset as c 
        on t.clm_id = c.clm_id
    ;
quit;
@Stu_SAS , Thank you for the response. 3 follow-up questions:
Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO. What proportion of the rows are you selecting out of the large dataset? Do you always select ALL rows from the small dataset for sub-setting out of the large one? If so an index may not help so I suggest you try without an index to see if that improves performance.
What type of SAS library are these stored in? V9? An SPDE library might improve performance (See table in the link).
FedSQL is computationally multi-threaded, but in base SAS, it uses a single read-write thread. In the situation described, the process is most likely I/O bound, not CPU bound. So I don't think FedSQL (or threaded DS2) would help in this situation.
@Ksharp wrote:
I would like to use Hash Table to merge these two tables.
@kenkaran wrote:
- The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115 GB
(...)
Smaller Header Dataset:
- Dataset size on disk: 110 GB
- Index size on disk: 126 GB
- Obs: 1,849,842,886
- Vars: 1
- Observation Length: 64
It should be possible to use a much smaller key item for the hash object, e.g. md5(clm_id), which takes only 16 bytes, instead of the 64-byte clm_id itself. Or maybe there are obvious redundancies in the structure of the clm_id values (such as long strings of zeros or blanks) which could be "compressed" without losing information. Then the 1.8E9 key values will have a chance to fit into memory.
I'm not sure, though, if the hash lookup plus the operations needed to obtain the smaller keys on both sides of the merge perform better than a DATA step using a second SET statement with KEY=clm_id option, which benefits from the index created already. You may want to compare test runs using small subsets of both datasets so that the run times are only a few minutes.
I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable.
Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.
Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:
data want1;
  set bigdataset (keep= list of variables);
  where LLLLLLL <= clm_id < UUUUUUU;
  if _n_=1 then do;
    declare hash h (dataset:'header (where=(LLLLLLL <= clm_id < UUUUUUU))');
      h.definekey('clm_id');
      h.definedata('clm_id');
      h.definedone();
  end;
  if h.check()=0;
run;
Notes:
Of course, this requires generating the quintile CLM_ID values. You could do something like this to find the quintiles:
proc sort data=header out=header_sorted  nodup;
  by clm_id;
run;
data limits (drop=CLM_ID);
  set header_sorted nobs=nclm;
  retain quintile 1;
  if _n_=1 then LLLLLLL=clm_id;
  retain LLLLLLL;
  if _N_ = ceil(nclm*(quintile/5));
  UUUUUUU=clm_id;
  output;
  quintile+1;
  LLLLLLL=UUUUUUU;
run;
proc print;
run;
My sincere apologies for not replying in a more timely manner. However, the size of the data and my desire to craft intelligent responses to all the great suggestions delayed my response until now.
@mkeintz wrote:I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable.
Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.
Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:
@RichardAD wrote:
Let T = duration for flat read of detail table D
Let K = number of header keys that _can_ fit in a hash table.
Do 1,849,842,886 / K data step reads through D with hash lookup selection.
Append selections of each run-through.
Many many thanks to @mkeintz and @RichardAD Their technique was probably the least glamorous, but in the end it carried the day!
Steps I took:
Now that this unwieldy dataset has been cut down to size, I have come up with a number of enhancements inspired by many of these answers:
Thanks again @mkeintz and @RichardAD !
Responses to most who responded:
Follow-up question to @RichardAD: can you direct Proc DS2 to make use of THREADs if you are just looking up a huge number of sequential observations in a hash? I did read but the answer wasn't clear from this page. Also see @SAS_Jedi 's comment.
@SASKiwi wrote:
Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO.
@Patrick wrote:
I also would use the SPDE engine for storing such a huge SAS table.
data spde_saslibrary.want(compress=yes);
Both the small-er (header) and large (line) datasets are compressed with the binary option. Also the "point to observations" option is set to "yes." I feel that this is a big part of what is slowing down processing. It would seem that trying to point to a dataset that is compressed leads to a lot of decompression and computation about where to seek the obs pointer to on the disk. I think this may actually _increase_ IO and will definitely increase CPU. I know it definitely increases run time. Is the compression of big datasets worth the overhead? Once I extracted data into an uncompressed format, everything ran _much_ faster.
@FreelanceReinh wrote:
It should be possible to use a much smaller key item for the hash object, e.g. md5(clm_id), which takes only 16 bytes, instead of the 64-byte clm_id itself.
This is an innovative approach. However, I would be calling the md5 function billions to trillions of times. I'm not sure what overhead that might add. As @Patrick pointed out, there is a risk -- even slight -- of collision. I am skittish about this approach.
Because FEDSQL does not allow SAS data set options to limit obs and firstobs, I was not able to use this solution. The data is just too d--n big. Also, @SAS_Jedi also questions whether the multi-threading would be useful here.
Thank you for your stand-alone code example. However, the proc sort data=tempbig; is just not possible in my environment.
I downloaded Paul Dorfman's paper. It is very complex, and I do not profess to have understood it thoroughly. However, the good Dr Dorfman does say on page 3 of the referenced paper that Bitmapping is suitable for "no-matter-how-many-short-keys." On the bottom of page 2, he says I would need to allocate an array of 10**[60]/53 which is 18,867,924,528,301,900,000,000,000,000,000,000,000,000,000,000,000,000,000,000 . So Bitmapping is not a practical solution.
@quickbluefish wrote:
Why do you want to join these?
I get this a lot. This answer is specific to the topic area and not really of interest to SAS Users per se. The TAF is a collection of Medicaid Mgmt Info Sys (MMISs) data. Half of this file contains financial-only transactions that are not of current interest the researchers I work with. Plowing through half of all this data only to delete it is a fantastic waste of both people and computer resources. I am trying to get rid of this financial-only half to make it more usable. The dataset is static and once this is done, I won't need to do it again.
You certainly want to avoid sorting your big dataset and though using a hash table lookup feels like a good option.
Given the length of your key variable is 64 I assume that's already a digest hex value created using sha-256.
You can't fit all the keys of your header table into memory and though like @FreelanceReinh I've been thinking how to reduce the size of your key values so they can fit. Converting your key values to an md5 binary string should reduce memory requirements to what's available to you. BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable. IF the still small risk is acceptable that you select a key that's not in your list then using md5 like in below sample code should be an option.
I also would use the SPDE engine for storing such a huge SAS table.
data spde_saslibrary.want(compress=yes);
  if _n_=1 then
    do;
      length _key $16;
      dcl hash h1();
      h1.defineKey('_key');
      h1.defineDone();
      do until(_done);
        set saslibrary.headerinputdataset end=_done;
        _key=md5(clm_id);
        _rc=h1.ref();
      end;
    end;
  set saslibrary.lineinputdataset;
  _key=md5(clm_id);
  if h1.check()=0 then output;
  drop _:;
run;...and I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory. Should I remember right and if that's still an issue with your SAS version then eventually load a placeholder Data variable for this not to happen.
data spde_saslibrary.want(compress=yes);
  if _n_=1 then
    do;
      length _key $16;
      retain _placeholder ' ';
      dcl hash h1();
      h1.defineKey('_key');
      h1.defineData('_placeholder');
      h1.defineDone();
      do until(_done);
        set saslibrary.headerinputdataset end=_done;
        _key=md5(clm_id);
        _rc=h1.ref();
      end;
    end;
  set saslibrary.lineinputdataset;
  _key=md5(clm_id);
  if h1.check()=0 then output;
  drop _:;
run;Now... If your key variable clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value. 
I'm not sure how much processing time such a conversion would add but it's certainly worth giving it a shot - if you can make it work. The approaches I've seen allways first convert the values to base10 and need to do summations. Problem with SAS is that a sha-256 doesn't fit as a full precision integer into a SAS numerical variable. One could do it using something like Python which supports such large integers or then find another approach which doesn't require an intermediary numerical variable.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
