Solved: Join vs Merge 1.2 TB with 110 GB Datasets

kenkaran · Posted 04-10-2025 06:15 PM

Problem:

I have two datasets: The first is detail records from a very large dataset (1.2 TB) and the second is row IDs from an only slightly smaller "header" dataset (110 GB). The relation between line and header is many-to-one. I am trying to select the obs in the line that have a match in the header. The header dataset only contains the key variable.

What I've done so far:

The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115
GB – almost all of the available memory on the box!
I sorted and indexed the smaller header dataset by the key variable.
I selected 1/20th of the large dataset using the firstobs and obs dataset option
I use proc because I was advised that it is multi-threaded.
Read post Efficient Way of Merging Very Large Datasets.

Result:

I started the script 8 days ago and my best guess from the looking at the size of the output lck file in Windows File Explorer is that it is only one tenth through.

The help I need:

What would I need to do to access this dataset in a reasonable amount of time -- a couple of days? Should I try to break the line input dataset into chunks, sort and interleave by clm_id and then try a data step merge? If I were to request a more memory and processors for this virtual machine, how much would I need?

SAS Versions:

The large dataset was created under SAS ver 9.0401M7 but the small dataset was created under 9.0401M5. They are being accessed under 9.0401M5.
Large Line Dataset: taf_other_services_line (16)
Size on disk: 1.22 TB
Obs: 5,398,943,292
Vars: 59
Observation Length: 525
Page Size: 65,536 / Pages: 19,749,411
Indexes: 0 / Sorted: NO / Point to Observations: YES

Smaller Header Dataset:

Dataset size on disk: 110 GB
Index size on disk: 126 GB
Obs: 1,849,842,886
Vars: 1
Observation Length: 64
Page Size: 65,536 / Pages: 1,811,797
Indexes: 1 / Sorted: YES

Query:

proc sql stimer ;
    create table saslibrary.outputdataset as
    select t.bene_id, t.clm_id, <26 other variables>
    from
        saslibrary.lineinputdataset (firstobs=4859048953 obs=5128996116) as t
        inner join saslibrary.headerinputdataset as c on (t.clm_id = c.clm_id)
    ;
    quit;

OS:

MS Windows Server 2016 Standard V 10.0.14393 Build 14393
Hardware according to Windows Task Manager:
Memory Installed: 128 GB
Virtual Memory: 46 GB
Page File Space: 18.0 GB
Maximum Speed: 2.90 GHz
Sockets: 6
Virtual processors: 12
L1 cache: n/a
Processor: Intel Xeon Gold 6542Y

For those of you familiar with Medicaid data this is the TAF data from CMS/MACBIS.

Thank you for reading.

mkeintz · Posted 04-12-2025 04:02 PM

I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable.

Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.

Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:

data want1;
  set bigdataset (keep= list of variables);
  where LLLLLLL <= clm_id < UUUUUUU;

  if _n_=1 then do;
    declare hash h (dataset:'header (where=(LLLLLLL <= clm_id < UUUUUUU))');
      h.definekey('clm_id');
      h.definedata('clm_id');
      h.definedone();
  end;

  if h.check()=0;
run;

Notes:

Limiting the range in the hash object this way allows you to avoid requiring more memory than available.
Using the "where" statement after the SET outsources the filtering of the big data set to the data set engine, saves a lot of resources.
Then it's just a matter of seeing if the filtered CLM_ID from the big data set is also found in the hash object.
I've coded the above ("<=" for lower limit, and "<" for upper limit) to avoid double inclusion of the quintile values. So either drop the upper limit for the highest range, or change "<" to "<=".

Of course, this requires generating the quintile CLM_ID values. You could do something like this to find the quintiles:

proc sort data=header out=header_sorted  nodup;
  by clm_id;
run;

data limits (drop=CLM_ID);
  set header_sorted nobs=nclm;

  retain quintile 1;

  if _n_=1 then LLLLLLL=clm_id;
  retain LLLLLLL;

  if _N_ = ceil(nclm*(quintile/5));

  UUUUUUU=clm_id;
  output;
  quintile+1;
  LLLLLLL=UUUUUUU;
run;
proc print;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

SASJedi · Posted 04-11-2025 03:11 PM

PROC SQL is not multi-threaded, so there is no threading advantage over DATA step.
SQL and DATA step produce the same result set in a many-to-one situation
Because of the fundamental differences between SQL and DATA step processing, the SQL join will be significantly more resource intensive

I'd recommend using a DATA step MERGE. You've already sorted and indexed the smaller dataset to conform to the larger one's sort order, so you're ready to roll. I'd expect that to process much faster than the SQL join.

Check out my Jedi SAS Tricks for SAS Users

Stu_SAS · Posted 04-11-2025 03:46 PM

You can also try using FedSQL. FedSQL is multi-threaded where possible.

proc fedsql;
    create table saslibrary.outputdataset as
        select t.bene_id, t.clm_id, <26 other variables>
        from saslibrary.lineinputdataset as t
        inner join
             saslibrary.headerinputdataset as c 
        on t.clm_id = c.clm_id
    ;
quit;

Follow me on LinkedIn for SAS tips with #SASTipTuesday
Check out my GitHub for SAS tips and tricks
Read my blog for cool SAS and open source data science projects

kenkaran · Posted 04-11-2025 03:55 PM

@Stu_SAS , Thank you for the response. 3 follow-up questions:

Can I use the the firstobs and obs data set options as I did in regular PROC SQL?
Is there a way to write what observation I'm on and the clock time to the log every n number of obs?
Does FedSQL have a hint feature as in Oracle? (Not that Oracle ever "takes the hint.")

Stu_SAS · Posted 04-11-2025 04:24 PM

Unfortunately you cannot use SAS input dataset options like in PROC SQL. FEDSQL follows 1999 ANSI SQL standards.
FEDSQL does support the stimer option but you cannot have it print out information as it goes like you could do for a DATA Step
FEDSQL does not have hints, but PROC SQL kind of does with the magic= option; however FEDSQL does have a ton of table options you can apply. Here's how to apply them.

Follow me on LinkedIn for SAS tips with #SASTipTuesday
Check out my GitHub for SAS tips and tricks
Read my blog for cool SAS and open source data science projects

kenkaran · Posted 04-11-2025 04:31 PM

@Stu_SAS, thank you for the FedSQL suggestion and the clarifications. I will try this.

SASKiwi · Posted 04-11-2025 07:20 PM

Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO. What proportion of the rows are you selecting out of the large dataset? Do you always select ALL rows from the small dataset for sub-setting out of the large one? If so an index may not help so I suggest you try without an index to see if that improves performance.

What type of SAS library are these stored in? V9? An SPDE library might improve performance (See table in the link).

SASJedi · Posted 04-14-2025 07:59 PM

FedSQL is computationally multi-threaded, but in base SAS, it uses a single read-write thread. In the situation described, the process is most likely I/O bound, not CPU bound. So I don't think FedSQL (or threaded DS2) would help in this situation.

Check out my Jedi SAS Tricks for SAS Users

Ksharp · Posted 04-12-2025 03:18 AM

Since your smaller dataset has only ONE variable and your memory is so big, I would like to use Hash Table to merge these two tables.

FreelanceReinh · Posted 04-12-2025 04:33 AM

@Ksharp wrote:
I would like to use Hash Table to merge these two tables.

@kenkaran wrote:

The smaller "header" dataset is too small to fit in a hash dataset even if I increased the memsize to 115 GB

(...)

Smaller Header Dataset:

Dataset size on disk: 110 GB

Index size on disk: 126 GB

Obs: 1,849,842,886

Vars: 1

Observation Length: 64

It should be possible to use a much smaller key item for the hash object, e.g. md5(clm_id), which takes only 16 bytes, instead of the 64-byte clm_id itself. Or maybe there are obvious redundancies in the structure of the clm_id values (such as long strings of zeros or blanks) which could be "compressed" without losing information. Then the 1.8E9 key values will have a chance to fit into memory.

I'm not sure, though, if the hash lookup plus the operations needed to obtain the smaller keys on both sides of the merge perform better than a DATA step using a second SET statement with KEY=clm_id option, which benefits from the index created already. You may want to compare test runs using small subsets of both datasets so that the run times are only a few minutes.

quickbluefish · Posted 04-12-2025 12:31 PM

Out of curiosity, WHY do you need to actually join these? I haven't worked with Medicaid but have worked a ton with similarly huge Medicare data. If you're actually performing an analysis, do you actually need all 60-ish variables to do this? The complete date range?

mkeintz · Posted 04-12-2025 04:02 PM

I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable.

Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.

Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:

data want1;
  set bigdataset (keep= list of variables);
  where LLLLLLL <= clm_id < UUUUUUU;

  if _n_=1 then do;
    declare hash h (dataset:'header (where=(LLLLLLL <= clm_id < UUUUUUU))');
      h.definekey('clm_id');
      h.definedata('clm_id');
      h.definedone();
  end;

  if h.check()=0;
run;

Notes:

Limiting the range in the hash object this way allows you to avoid requiring more memory than available.
Using the "where" statement after the SET outsources the filtering of the big data set to the data set engine, saves a lot of resources.
Then it's just a matter of seeing if the filtered CLM_ID from the big data set is also found in the hash object.
I've coded the above ("<=" for lower limit, and "<" for upper limit) to avoid double inclusion of the quintile values. So either drop the upper limit for the highest range, or change "<" to "<=".

Of course, this requires generating the quintile CLM_ID values. You could do something like this to find the quintiles:

proc sort data=header out=header_sorted  nodup;
  by clm_id;
run;

data limits (drop=CLM_ID);
  set header_sorted nobs=nclm;

  retain quintile 1;

  if _n_=1 then LLLLLLL=clm_id;
  retain LLLLLLL;

  if _N_ = ceil(nclm*(quintile/5));

  UUUUUUU=clm_id;
  output;
  quintile+1;
  LLLLLLL=UUUUUUU;
run;
proc print;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

kenkaran · Posted 05-05-2025 06:13 PM

My sincere apologies for not replying in a more timely manner. However, the size of the data and my desire to craft intelligent responses to all the great suggestions delayed my response until now.

@mkeintz wrote:
I would not use firstobs/obs to divide the join into subgroup joins, because a given CLM_ID may be in more than one of those subgroup joins. Instead, examine each CLM_ID once, by choosing a restricted range of CLM_ID in both datasets, for each subgroup join. This can work because CLM_ID is the join variable.

Let's say you divide your CLM_ID values into 5 ranges, each range with a lower limit (LLLLLLL) and upper limit (UUUUUUU), where LLLLLLL and UUUUUUU are quintile values. Of course, the lowest range doesn't need a specified LLLLLLL, and the highest range doesn't need a specified UUUUUUU.

Then you could run five programs, such as the below - just put in values in place of LLLLLLL and RRRRRRR:

@RichardAD wrote:

Let T = duration for flat read of detail table D
Let K = number of header keys that _can_ fit in a hash table.
Do 1,849,842,886 / K data step reads through D with hash lookup selection.
Append selections of each run-through.

Many many thanks to @mkeintz and @RichardAD Their technique was probably the least glamorous, but in the end it carried the day!

Steps I took:

Through trial and error I found how many key values could comfortably fit in memory. That was the size of my chunk.
Dividing the dataset size by the chunk size I get the number of chunks: 7.
I was able to create a dataset of sorted and unique the key-values (only).
From this dataset, I created 7 ranges of high and low key-values.
I read through the entire big dataset, sorting all obs into one of 7 line-chunks. I saved space on this step by only selecting the variables of interest (29/59) even though all obs were selected. Each year ran approx 4 hours and created 7 chunks.
I loaded the first header/key-chunk into a memory hash and used it so select from the first line chunk.
This was repeated for each of the 7 chunks. Each of the 7 chunks was repeated for each of the 6 years: 2016 through 2021.
On average each chunk took 75 minutes. 75 minutes * 7 chunks * 6 years ~ 53 hours.
The result was 64% of the obs were selected.
Problem solved.

Now that this unwieldy dataset has been cut down to size, I have come up with a number of enhancements inspired by many of these answers:

Convert each unique 64-character claim header key into a sequence number from 1 to ~ 2.1B
Convert that sequence number into a base36 string (license plate-style)
This enables a 64 character string to be stored in 6 digits
Repeat the same process for the beneficiary (or patient) ID.
Now that the keys are cut down to size and the obs are cut down to size, the data can now be indexed. Also, other storage formats such as SPDE and access methods such as FedSQL can be researched and perhaps employed.

Thanks again @mkeintz and @RichardAD !

Responses to most who responded:

Follow-up question to @RichardAD: can you direct Proc DS2 to make use of THREADs if you are just looking up a huge number of sequential observations in a hash? I did read but the answer wasn't clear from this page. Also see @SAS_Jedi 's comment.

@SASKiwi wrote:

Are either of the datasets compressed? Compressing the 1.2TB dataset would likely speed up joining as it will improve IO.

@Patrick wrote:

I also would use the SPDE engine for storing such a huge SAS table.
data spde_saslibrary.want(compress=yes);

Both the small-er (header) and large (line) datasets are compressed with the binary option. Also the "point to observations" option is set to "yes." I feel that this is a big part of what is slowing down processing. It would seem that trying to point to a dataset that is compressed leads to a lot of decompression and computation about where to seek the obs pointer to on the disk. I think this may actually _increase_ IO and will definitely increase CPU. I know it definitely increases run time. Is the compression of big datasets worth the overhead? Once I extracted data into an uncompressed format, everything ran _much_ faster.

@FreelanceReinh wrote:

It should be possible to use a much smaller key item for the hash object, e.g. md5(clm_id), which takes only 16 bytes, instead of the 64-byte clm_id itself.

This is an innovative approach. However, I would be calling the md5 function billions to trillions of times. I'm not sure what overhead that might add. As @Patrick pointed out, there is a risk -- even slight -- of collision. I am skittish about this approach.

@Stu_SAS

Because FEDSQL does not allow SAS data set options to limit obs and firstobs, I was not able to use this solution. The data is just too d--n big. Also, @SAS_Jedi also questions whether the multi-threading would be useful here.

@KachiM

Thank you for your stand-alone code example. However, the proc sort data=tempbig; is just not possible in my environment.

@whymath

I downloaded Paul Dorfman's paper. It is very complex, and I do not profess to have understood it thoroughly. However, the good Dr Dorfman does say on page 3 of the referenced paper that Bitmapping is suitable for "no-matter-how-many-short-keys." On the bottom of page 2, he says I would need to allocate an array of 10**[60]/53 which is 18,867,924,528,301,900,000,000,000,000,000,000,000,000,000,000,000,000,000,000 . So Bitmapping is not a practical solution.

@quickbluefish wrote:

Why do you want to join these?

I get this a lot. This answer is specific to the topic area and not really of interest to SAS Users per se. The TAF is a collection of Medicaid Mgmt Info Sys (MMISs) data. Half of this file contains financial-only transactions that are not of current interest the researchers I work with. Plowing through half of all this data only to delete it is a fantastic waste of both people and computer resources. I am trying to get rid of this financial-only half to make it more usable. The dataset is static and once this is done, I won't need to do it again.

whymath · Posted 04-13-2025 10:39 PM

May be bitmap can help, see Paul M. Dorfman’s classical article and give it a try: https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p008-26.pdf

Patrick · Posted 04-14-2025 11:24 PM

You certainly want to avoid sorting your big dataset and though using a hash table lookup feels like a good option.

Given the length of your key variable is 64 I assume that's already a digest hex value created using sha-256.

You can't fit all the keys of your header table into memory and though like @FreelanceReinh I've been thinking how to reduce the size of your key values so they can fit. Converting your key values to an md5 binary string should reduce memory requirements to what's available to you. BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable. IF the still small risk is acceptable that you select a key that's not in your list then using md5 like in below sample code should be an option.

I also would use the SPDE engine for storing such a huge SAS table.

data spde_saslibrary.want(compress=yes);
  if _n_=1 then
    do;
      length _key $16;
      dcl hash h1();
      h1.defineKey('_key');
      h1.defineDone();
      do until(_done);
        set saslibrary.headerinputdataset end=_done;
        _key=md5(clm_id);
        _rc=h1.ref();
      end;
    end;
  set saslibrary.lineinputdataset;
  _key=md5(clm_id);
  if h1.check()=0 then output;
  drop _:;
run;

...and I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory. Should I remember right and if that's still an issue with your SAS version then eventually load a placeholder Data variable for this not to happen.

data spde_saslibrary.want(compress=yes);
  if _n_=1 then
    do;
      length _key $16;
      retain _placeholder ' ';
      dcl hash h1();
      h1.defineKey('_key');
      h1.defineData('_placeholder');
      h1.defineDone();
      do until(_done);
        set saslibrary.headerinputdataset end=_done;
        _key=md5(clm_id);
        _rc=h1.ref();
      end;
    end;
  set saslibrary.lineinputdataset;
  _key=md5(clm_id);
  if h1.check()=0 then output;
  drop _:;
run;

Now... If your key variable clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.
I'm not sure how much processing time such a conversion would add but it's certainly worth giving it a shot - if you can make it work. The approaches I've seen allways first convert the values to base10 and need to do summations. Problem with SAS is that a sha-256 doesn't fit as a full precision integer into a SAS numerical variable. One could do it using something like Python which supports such large integers or then find another approach which doesn't require an intermediary numerical variable.

Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Registration is open

Registration is open

SAS Training: Just a Click Away