Problem:
I have two datasets: The first is detail records from a very large dataset (1.2 TB) and the second is row IDs from an only slightly smaller "header" dataset (110 GB). The relation between line and header is many-to-one. I am trying to select the obs in the line that have a match in the header. The header dataset only contains the key variable.
What I've done so far:
Result:
I started the script 8 days ago and my best guess from the looking at the size of the output lck file in Windows File Explorer is that it is only one tenth through.
The help I need:
What would I need to do to access this dataset in a reasonable amount of time -- a couple of days? Should I try to break the line input dataset into chunks, sort and interleave by clm_id and then try a data step merge? If I were to request a more memory and processors for this virtual machine, how much would I need?
SAS Versions:
Smaller Header Dataset:
Query:
proc sql stimer ;
create table saslibrary.outputdataset as
select t.bene_id, t.clm_id, <26 other variables>
from
saslibrary.lineinputdataset (firstobs=4859048953 obs=5128996116) as t
inner join saslibrary.headerinputdataset as c on (t.clm_id = c.clm_id)
;
quit;
OS:
For those of you familiar with Medicaid data this is the TAF data from CMS/MACBIS.
Thank you for reading.
I'd recommend using a DATA step MERGE. You've already sorted and indexed the smaller dataset to conform to the larger one's sort order, so you're ready to roll. I'd expect that to process much faster than the SQL join.
You can also try using FedSQL. FedSQL is multi-threaded where possible.
proc fedsql;
create table saslibrary.outputdataset as
select t.bene_id, t.clm_id, <26 other variables>
from saslibrary.lineinputdataset as t
inner join
saslibrary.headerinputdataset as c
on t.clm_id = c.clm_id
;
quit;
@Stu_SAS , Thank you for the response. 3 follow-up questions:
Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO. What proportion of the rows are you selecting out of the large dataset? Do you always select ALL rows from the small dataset for sub-setting out of the large one? If so an index may not help so I suggest you try without an index to see if that improves performance.
What type of SAS library are these stored in? V9? An SPDE library might improve performance (See table in the link).
FedSQL is computationally multi-threaded, but in base SAS, it uses a single read-write thread. In the situation described, the process is most likely I/O bound, not CPU bound. So I don't think FedSQL (or threaded DS2) would help in this situation.
@Ksharp wrote:
I would like to use Hash Table to merge these two tables.
@kenkaran wrote:
- The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115 GB
(...)
Smaller Header Dataset:
- Dataset size on disk: 110 GB
- Index size on disk: 126 GB
- Obs: 1,849,842,886
- Vars: 1
- Observation Length: 64
It should be possible to use a much smaller key item for the hash object, e.g. md5(clm_id), which takes only 16 bytes, instead of the 64-byte clm_id itself. Or maybe there are obvious redundancies in the structure of the clm_id values (such as long strings of zeros or blanks) which could be "compressed" without losing information. Then the 1.8E9 key values will have a chance to fit into memory.
I'm not sure, though, if the hash lookup plus the operations needed to obtain the smaller keys on both sides of the merge perform better than a DATA step using a second SET statement with KEY=clm_id option, which benefits from the index created already. You may want to compare test runs using small subsets of both datasets so that the run times are only a few minutes.
I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable.
Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.
Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:
data want1;
set bigdataset (keep= list of variables);
where LLLLLLL <= clm_id < UUUUUUU;
if _n_=1 then do;
declare hash h (dataset:'header (where=(where LLLLLLL <= clm_id < UUUUUUU))');
h.definekey('clm_id');
h.definedata('clm_id');
h.definedone();
end;
if h.check()=0;
run;
Notes:
Of course, this requires generating the quintile CLM_ID values. You could do something like this to find the quintiles:
proc sort data=header out=header_sorted nodup;
by clm_id;
run;
data limits (drop=CLM_ID);
set header_sorted nobs=nclm;
retain quintile 1;
if _n_=1 then LLLLLLL=clm_id;
retain LLLLLLL;
if _N_ = ceil(nclm*(quintile/5));
UUUUUUU=clm_id;
output;
quintile+1;
LLLLLLL=UUUUUUU;
run;
proc print;
run;
You certainly want to avoid sorting your big dataset and though using a hash table lookup feels like a good option.
Given the length of your key variable is 64 I assume that's already a digest hex value created using sha-256.
You can't fit all the keys of your header table into memory and though like @FreelanceReinh I've been thinking how to reduce the size of your key values so they can fit. Converting your key values to an md5 binary string should reduce memory requirements to what's available to you. BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable. IF the still small risk is acceptable that you select a key that's not in your list then using md5 like in below sample code should be an option.
I also would use the SPDE engine for storing such a huge SAS table.
data spde_saslibrary.want(compress=yes);
if _n_=1 then
do;
length _key $16;
dcl hash h1();
h1.defineKey('_key');
h1.defineDone();
do until(_done);
set saslibrary.headerinputdataset end=_done;
_key=md5(clm_id);
_rc=h1.ref();
end;
end;
set saslibrary.lineinputdataset;
_key=md5(clm_id);
if h1.check()=0 then output;
drop _:;
run;
...and I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory. Should I remember right and if that's still an issue with your SAS version then eventually load a placeholder Data variable for this not to happen.
data spde_saslibrary.want(compress=yes);
if _n_=1 then
do;
length _key $16;
retain _placeholder ' ';
dcl hash h1();
h1.defineKey('_key');
h1.defineData('_placeholder');
h1.defineDone();
do until(_done);
set saslibrary.headerinputdataset end=_done;
_key=md5(clm_id);
_rc=h1.ref();
end;
end;
set saslibrary.lineinputdataset;
_key=md5(clm_id);
if h1.check()=0 then output;
drop _:;
run;
Now... If your key variable clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.
I'm not sure how much processing time such a conversion would add but it's certainly worth giving it a shot - if you can make it work. The approaches I've seen allways first convert the values to base10 and need to do summations. Problem with SAS is that a sha-256 doesn't fit as a full precision integer into a SAS numerical variable. One could do it using something like Python which supports such large integers or then find another approach which doesn't require an intermediary numerical variable.
@Patrick wrote:
BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable.
According to simple approximation formulas (assuming the MD5 digests are uniformly distributed random strings, which might be too optimistic, but I'm not sure), the collision probability for @kenkaran's 1.8E9 keys should be approx. 5E-21, i.e., extremely small (see, e.g., https://towardsdatascience.com/collision-risk-in-hash-based-surrogate-keys-4c87b716cbcd/).
@Patrick wrote:
I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory.
Adding the 1-byte dummy data item won't hurt. Tests on my Windows SAS 9.4M5 suggest, however, that for keys with length 16 there is no decrease in item_size by doing so. (The benefit starts at length 17.)
@Patrick wrote:
Now... If your key variable clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.
I would rather favor base 256, as it is both simpler to obtain -- the $HEX64. informat does the conversion -- and more effective: The string length is halved to 32 bytes (as opposed to 52 bytes with base 32).
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.