topic Re: Join vs Merge 1.2 TB with 110 GB Datasets in SAS Programming

Join vs Merge 1.2 TB with 110 GB Datasets

kenkaran — Thu, 10 Apr 2025 22:15:23 GMT

Problem:

I have two datasets: The first is detail records from a very large dataset (1.2 TB) and the second is row IDs from an only slightly smaller "header" dataset (110 GB). The relation between line and header is many-to-one. I am trying to select the obs in the line that have a match in the header. The header dataset only contains the key variable.

What I've done so far:

The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115
GB – almost all of the available memory on the box!
I sorted and indexed the smaller header dataset by the key variable.
I selected 1/20th of the large dataset using the firstobs and obs dataset option
I use proc because I was advised that it is multi-threaded.
Read post Efficient Way of Merging Very Large Datasets.

Result:

I started the script 8 days ago and my best guess from the looking at the size of the output lck file in Windows File Explorer is that it is only one tenth through.

The help I need:

What would I need to do to access this dataset in a reasonable amount of time -- a couple of days? Should I try to break the line input dataset into chunks, sort and interleave by clm_id and then try a data step merge? If I were to request a more memory and processors for this virtual machine, how much would I need?

SAS Versions:

The large dataset was created under SAS ver 9.0401M7 but the small dataset was created under 9.0401M5. They are being accessed under 9.0401M5.
Large Line Dataset: taf_other_services_line (16)
Size on disk: 1.22 TB
Obs: 5,398,943,292
Vars: 59
Observation Length: 525
Page Size: 65,536 / Pages: 19,749,411
Indexes: 0 / Sorted: NO / Point to Observations: YES

Smaller Header Dataset:

Dataset size on disk: 110 GB
Index size on disk: 126 GB
Obs: 1,849,842,886
Vars: 1
Observation Length: 64
Page Size: 65,536 / Pages: 1,811,797
Indexes: 1 / Sorted: YES

Query:

proc sql stimer ; create table saslibrary.outputdataset as select t.bene_id, t.clm_id, <26 other variables> from saslibrary.lineinputdataset (firstobs=4859048953 obs=5128996116) as t inner join saslibrary.headerinputdataset as c on (t.clm_id = c.clm_id) ; quit;

OS:

MS Windows Server 2016 Standard V 10.0.14393 Build 14393
Hardware according to Windows Task Manager:
Memory Installed: 128 GB
Virtual Memory: 46 GB
Page File Space: 18.0 GB
Maximum Speed: 2.90 GHz
Sockets: 6
Virtual processors: 12
L1 cache: n/a
Processor: Intel Xeon Gold 6542Y

For those of you familiar with Medicaid data this is the TAF data from CMS/MACBIS.

Thank you for reading.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

SASJedi — Fri, 11 Apr 2025 19:11:26 GMT

PROC SQL is not multi-threaded, so there is no threading advantage over DATA step.
SQL and DATA step produce the same result set in a many-to-one situation
Because of the fundamental differences between SQL and DATA step processing, the SQL join will be significantly more resource intensive

I'd recommend using a DATA step MERGE. You've already sorted and indexed the smaller dataset to conform to the larger one's sort order, so you're ready to roll. I'd expect that to process much faster than the SQL join.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Stu_SAS — Fri, 11 Apr 2025 19:49:03 GMT

You can also try using FedSQL. FedSQL is multi-threaded where possible.

proc fedsql;
    create table saslibrary.outputdataset as
        select t.bene_id, t.clm_id, <26 other variables>
        from saslibrary.lineinputdataset as t
        inner join
             saslibrary.headerinputdataset as c 
        on t.clm_id = c.clm_id
    ;
quit;

Re: Join vs Merge 1.2 TB with 110 GB Datasets

kenkaran — Fri, 11 Apr 2025 19:55:58 GMT

@Stu_SAS , Thank you for the response. 3 follow-up questions:

Can I use the the firstobs and obs data set options as I did in regular PROC SQL?
Is there a way to write what observation I'm on and the clock time to the log every n number of obs?
Does FedSQL have a hint feature as in Oracle? (Not that Oracle ever "takes the hint.")

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Stu_SAS — Fri, 11 Apr 2025 20:25:57 GMT

Unfortunately you cannot use SAS input dataset options like in PROC SQL. FEDSQL follows 1999 ANSI SQL standards.
FEDSQL does support the stimer option but you cannot have it print out information as it goes like you could do for a DATA Step
FEDSQL does not have hints, but PROC SQL kind of does with the magic= option; however FEDSQL does have a ton of table options you can apply. Here's how to apply them.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

kenkaran — Fri, 11 Apr 2025 20:31:04 GMT

@Stu_SAS, thank you for the FedSQL suggestion and the clarifications. I will try this.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

SASKiwi — Fri, 11 Apr 2025 23:20:24 GMT

Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO. What proportion of the rows are you selecting out of the large dataset? Do you always select ALL rows from the small dataset for sub-setting out of the large one? If so an index may not help so I suggest you try without an index to see if that improves performance.

What type of SAS library are these stored in? V9? An SPDE library might improve performance (See table in the link).

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Ksharp — Sat, 12 Apr 2025 07:18:30 GMT

Since your smaller dataset has only ONE variable and your memory is so big, I would like to use Hash Table to merge these two tables.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

FreelanceReinh — Sat, 12 Apr 2025 08:33:02 GMT

@Ksharp wrote:
I would like to use Hash Table to merge these two tables.

@kenkaran wrote:

The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115 GB

(...)

Smaller Header Dataset:

Dataset size on disk: 110 GB

Index size on disk: 126 GB

Obs: 1,849,842,886

Vars: 1

Observation Length: 64

It should be possible to use a much smaller key item for the hash object, e.g. md5(clm_id), which takes only 16 bytes, instead of the 64-byte clm_id itself. Or maybe there are obvious redundancies in the structure of the clm_id values (such as long strings of zeros or blanks) which could be "compressed" without losing information. Then the 1.8E9 key values will have a chance to fit into memory.

I'm not sure, though, if the hash lookup plus the operations needed to obtain the smaller keys on both sides of the merge perform better than a DATA step using a second SET statement with KEY=clm_id option, which benefits from the index created already. You may want to compare test runs using small subsets of both datasets so that the run times are only a few minutes.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

quickbluefish — Sat, 12 Apr 2025 16:31:51 GMT

Out of curiosity, WHY do you need to actually join these? I haven't worked with Medicaid but have worked a ton with similarly huge Medicare data. If you're actually performing an analysis, do you actually need all 60-ish variables to do this? The complete date range?

Re: Join vs Merge 1.2 TB with 110 GB Datasets

mkeintz — Tue, 06 May 2025 03:42:54 GMT

I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable.

Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.

Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:

data want1;
  set bigdataset (keep= list of variables);
  where LLLLLLL <= clm_id < UUUUUUU;

  if _n_=1 then do;
    declare hash h (dataset:'header (where=(LLLLLLL <= clm_id < UUUUUUU))');
      h.definekey('clm_id');
      h.definedata('clm_id');
      h.definedone();
  end;

  if h.check()=0;
run;

Notes:

Limiting the range in the hash object this way allows you to avoid requiring more memory than available.
Using the "where" statement after the SET outsources the filtering of the big data set to the data set engine, saves a lot of resources.
Then it's just a matter of seeing if the filtered CLM_ID from the big data set is also found in the hash object.
I've coded the above ("<=" for lower limit, and "<" for upper limit) to avoid double inclusion of the quintile values. So either drop the upper limit for the highest range, or change "<" to "<=".

Of course, this requires generating the quintile CLM_ID values. You could do something like this to find the quintiles:

proc sort data=header out=header_sorted  nodup;
  by clm_id;
run;

data limits (drop=CLM_ID);
  set header_sorted nobs=nclm;

  retain quintile 1;

  if _n_=1 then LLLLLLL=clm_id;
  retain LLLLLLL;

  if _N_ = ceil(nclm*(quintile/5));

  UUUUUUU=clm_id;
  output;
  quintile+1;
  LLLLLLL=UUUUUUU;
run;
proc print;
run;

Re: Join vs Merge 1.2 TB with 110 GB Datasets

whymath — Mon, 14 Apr 2025 02:39:39 GMT

May be bitmap can help, see Paul M. Dorfman’s classical article and give it a try: https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p008-26.pdf

Re: Join vs Merge 1.2 TB with 110 GB Datasets

SASJedi — Mon, 14 Apr 2025 23:59:50 GMT

FedSQL is computationally multi-threaded, but in base SAS, it uses a single read-write thread. In the situation described, the process is most likely I/O bound, not CPU bound. So I don't think FedSQL (or threaded DS2) would help in this situation.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Patrick — Tue, 15 Apr 2025 03:27:43 GMT

You certainly want to avoid sorting your big dataset and though using a hash table lookup feels like a good option.

Given the length of your key variable is 64 I assume that's already a digest hex value created using sha-256.

You can't fit all the keys of your header table into memory and though like @FreelanceReinh I've been thinking how to reduce the size of your key values so they can fit. Converting your key values to an md5 binary string should reduce memory requirements to what's available to you. BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable. IF the still small risk is acceptable that you select a key that's not in your list then using md5 like in below sample code should be an option.

I also would use the SPDE engine for storing such a huge SAS table.

data spde_saslibrary.want(compress=yes);
  if _n_=1 then
    do;
      length _key $16;
      dcl hash h1();
      h1.defineKey('_key');
      h1.defineDone();
      do until(_done);
        set saslibrary.headerinputdataset end=_done;
        _key=md5(clm_id);
        _rc=h1.ref();
      end;
    end;
  set saslibrary.lineinputdataset;
  _key=md5(clm_id);
  if h1.check()=0 then output;
  drop _:;
run;

...and I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory. Should I remember right and if that's still an issue with your SAS version then eventually load a placeholder Data variable for this not to happen.

data spde_saslibrary.want(compress=yes);
  if _n_=1 then
    do;
      length _key $16;
      retain _placeholder ' ';
      dcl hash h1();
      h1.defineKey('_key');
      h1.defineData('_placeholder');
      h1.defineDone();
      do until(_done);
        set saslibrary.headerinputdataset end=_done;
        _key=md5(clm_id);
        _rc=h1.ref();
      end;
    end;
  set saslibrary.lineinputdataset;
  _key=md5(clm_id);
  if h1.check()=0 then output;
  drop _:;
run;

Now... If your key variable clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.
I'm not sure how much processing time such a conversion would add but it's certainly worth giving it a shot - if you can make it work. The approaches I've seen allways first convert the values to base10 and need to do summations. Problem with SAS is that a sha-256 doesn't fit as a full precision integer into a SAS numerical variable. One could do it using something like Python which supports such large integers or then find another approach which doesn't require an intermediary numerical variable.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

FreelanceReinh — Tue, 15 Apr 2025 10:07:21 GMT

@Patrick wrote:

BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable.

According to simple approximation formulas (assuming the MD5 digests are uniformly distributed random strings, which might be too optimistic, but I'm not sure), the collision probability for @kenkaran's 1.8E9 keys should be approx. 5E-21, i.e., extremely small (see, e.g., https://towardsdatascience.com/collision-risk-in-hash-based-surrogate-keys-4c87b716cbcd/).

@Patrick wrote:

I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory.

Adding the 1-byte dummy data item won't hurt. Tests on my Windows SAS 9.4M5 suggest, however, that for keys with length 16 there is no decrease in item_size by doing so. (The benefit starts at length 17.)

@Patrick wrote:

Now... If your key variable clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.

I would rather favor base 256, as it is both simpler to obtain -- the $HEX64. informat does the conversion -- and more effective: The string length is halved to 32 bytes (as opposed to 52 bytes with base 32).

Re: Join vs Merge 1.2 TB with 110 GB Datasets

DerylHollick — Tue, 15 Apr 2025 13:25:17 GMT

Will half of the header table fit into memory? A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

mkeintz — Tue, 15 Apr 2025 22:46:17 GMT

@DerylHollick wrote:

Will half of the header table fit into memory? A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table.

This is what my suggested code does. But instead of just loading any old half of the header (or in my example a fifth), I proposed selecting a half (or a fifth) for a given range of join variable values - on both datasets. This makes the number of needed join comparisons about one fourth (in the case of halves) or one 25th (for fifths) for each subgroup join - a very effective reduction especially when the process "outsources" the subgroup selection to the data engine. True, one has to process the big dataset for each subgroup join, but that burden is significantly reduced by outsourcing the subgroup filtering to the data engine, via the WHERE options.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

KachiM — Mon, 21 Apr 2025 15:16:17 GMT

I am using two data sets (small and big) to illustrate one method of solving your problem.

data small;
input id $8.;
datalines;
Rama
Seetha
Sras
Gopal
John
;
run;



data big;
input id $8. amount;
datalines;
Seetha   100
Rama     200
Gopal    500
Krishna  300
John     400
Anbu     500
Kachi    500
Lakshi   600
asdfgh   700
ordsfg   600
pqwers   600
kasert   700
lasert   800
Anbu     100
Rama     100
Gopal    400
;
run;

I am keeping only essential variables for sorting.

the BIG data set and add rowid(_N_)

This new data set is small enough to sort.

proc sort data = small;
by id;
run;

data tempbig;
   set big(keep = id);
   RID = _N_;
run;

proc sort data = tempbig;
by id;
run;

Next I merge the 2 data sets, SMALL and TEMPBIG.

data temp;
   merge small(in = a) tempbig(in = b);
   by id;
   if a and b;
run;

Next, using POINT = option of SET statement, the matched records from BIG are output.

data want;
   do i = 1 by 1 until(eof);
      set temp(keep = rid)  end = eof;
      p = rid;
      set big point = p;
      output;
   end;
   stop;
   drop i rid;
run;

The output data set is:

Obs id amount
1 Gopal 500
2 Gopal 400
3 John 400
4 Rama 200
5 Rama 100
6 Seetha 100

Re: Join vs Merge 1.2 TB with 110 GB Datasets

RichardAD — Mon, 21 Apr 2025 16:08:18 GMT

Let T = duration for flat read of detail table D

Let K = number of header keys that _can_ fit in a hash table.

Do 1,849,842,886 / K data step reads through D with hash lookup selectiond.

Append selections of each run-through.

Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D. Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records)

If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step.

Re: Join vs Merge 1.2 TB with 110 GB Datasets

mkeintz — Tue, 22 Apr 2025 00:50:13 GMT

@RichardAD wrote:

Let T = duration for flat read of detail table D

Let K = number of header keys that _can_ fit in a hash table.

Do 1,849,842,886 / K data step reads through D with hash lookup selectiond.

Append selections of each run-through.

Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D. Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records)

If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step.

You don't need that much input output activity.

If after examining the header dataset in comparison to your available memory for hash object you find that you have to do 10 subgroup joins, you don't have to have generate input/output totaling 10*(T +Tout). It can be reduced to 3*T + 1*Tout. Divide the header into 10 subgroups based on the value of the join variable. Then, given you have disk space, also divide the detail dataset into 10 smaller datasets using the same join variable. That can be done in one DATA step totaling 2*T input/output. Then each of the 10 subgroup joins will need only 0.1*(T+ ~0.1*Tout). You can save even more my creating the detail subgroups containing only the variables of interest.