topic Re: hash joins with multiple fact tables in SAS Programming

hash joins with multiple fact tables

RyanMcGovern — Thu, 22 Jun 2017 14:41:21 GMT

Hi everyone!

I am looking to improve some processing and am looking to see if there is a way to either:

Keep the hash tables in memory after the data step is closed (so that the hash tables do not have to be reloaded into memory for the next iteration)
Match multiple tables back to the same lookup tables in one data step while avoiding butchering the variables in the output datasets.

Here is my situation. I am currently matching dozens of large files (100m+ rows >800 columns) against the same 2 lookup tables. Lookup table 1 has 20m+rows 5 columns and lookup table 2 has 2m+rows 4 columns.

Option 1 I think is promising… if there is a way to restrict sas from purging the hash objects it should save time loading them when I go to lookup values for the next dataset.

I made an example of both options, if you run it you will see what happens to the variables for option 2. Option one I don’t know how to restrict SAS from purging the hash objects.

I appreciate your comments!

-Ryan

Re: hash joins with multiple fact tables

RW9 — Thu, 22 Jun 2017 16:42:35 GMT

Sorry, maybe I am missing something here, but in the example you give, why not merge on the two additional variables or do a dataset update?

proc sort data=sashelp.class out=class (keep=name age height);
  by name;
run;
proc sort data=sashelp.classfit out=classfit (keep=name lowermean uppermean);
  by name;
run;
data want; 
  merge class (in=a) classfit (in=b);
  by name;
  if a;
run;

/* Or update - not sure what would be quiicker and not tested this code */
data want;
  update class classfit;
  by name;
run;

Re: hash joins with multiple fact tables

RyanMcGovern — Thu, 22 Jun 2017 17:21:48 GMT

Thanks for the response, in real life the lookup tables are independent, they each share a common key with the fact table but do not share one with each other. Because of this I cannot combine both crosswalks and then read this all into one hash object.

Thanks RW9!

here is what it looks like

again i have dozens of different fact tables that i need to hit against the same lookup tables.

Re: hash joins with multiple fact tables

snoopy369 — Fri, 23 Jun 2017 14:10:15 GMT

The answer to this depends on the exact details.

In particular, there are probably two or three reasonable options here - depending on your system, the datasets, etc.

If the many datasets with 800+ variables are either:

a) Identical in metadata [variable names, types, lengths]

b) Have a combination of identical variable names/types/lengths and non-overlapping variable names

Then there is a simple solution, where you have one data step and run each dataset through a loop.

---

If your datasets are nonidentical and have overlapping variable names that are not identically typed, then you can't easily do that. You might be able to if you use some interesting renaming; I've done that, where I renamed every variable (through a macro) to DS_<var>, where DS was a prefix for that dataset or even the name of the dataset if sufficiently short. It's just ... messy, and error-prone.

You can use DOSUBL to generate data steps inside your main data step to do the matching work, and allow persistent hash tables, but that's quite slow and probably not faster than just reloading the data sets into hash tables.

---

Third, you could consider formats instead of hash tables. Formats are persistent and quite fast to lookup, and if you're only writing a couple of them per dataset, not slow to load either. They do sometimes have performance issues with high numbers of formats, but in a server environment you might be okay there.

---

Fourth, another probably-slow-but-maybe-worth-checking option is PROC DS2. That would let you have persistent hash tables, I believe, and process each of your datasets. DS2 tends to be slow, though, on simple data reads.

If you can give us a bit more information as to your problems' full scope, we can probably help you find the specific solution that's most appropriate.

Re: hash joins with multiple fact tables

RyanMcGovern — Fri, 23 Jun 2017 16:02:10 GMT

The datasets are not identical in metadata and have a few overlapping named variables. I worry about stacking the datasets because of this. I wouldn’t want to accidentally truncate values in instances where the lengths and things disagreed but the variable names were the same.

Because of this I like your approach to rename the variables to something new then rename them back post merge, this could work but I would also have to drop and variables that were in one dataset but not in the appended dataset. This is shown in the second example of the code that I had originally attached. This would also be easier if I used your renaming method and just made each variable have a prefix of the variable name. That would simplify my keep= statements.

I might attempt using multiple formats to see how that works out, I’m not a fan of needing to concatenate the multiple primary keys to make one for the index and then subsequently doing the same on the base dataset. That seems like a lot of reads to me.

At this point I think I am going to have to just sort/sort merge my lookup tables and split up the larger files prior to merging them. That or revisit my settings on my UNIX server.

Right now my config file has

Memsize 8g/Sortsize 1g/BUFNO =20/BUFSIZE=512k

I will need to see if I can up some of these to get the tables to load into hash faster. It is a virtual machine so potentially I could have the server admin allocate more memory for this task, or move the job to a higher powered server.

Thanks!

Re: hash joins with multiple fact tables

Peter_C — Fri, 23 Jun 2017 16:59:24 GMT

Try overloading the formats to make a single lookup return the multple columns by parsing the returned value on a suitable delimiter.
Then you need just one format lookup for each dimension.
However the problem of loading the format into memory is still needed in each step (I expect).