06-22-2017 10:41 AM
I am looking to improve some processing and am looking to see if there is a way to either:
Here is my situation. I am currently matching dozens of large files (100m+ rows >800 columns) against the same 2 lookup tables. Lookup table 1 has 20m+rows 5 columns and lookup table 2 has 2m+rows 4 columns.
Option 1 I think is promising… if there is a way to restrict sas from purging the hash objects it should save time loading them when I go to lookup values for the next dataset.
I made an example of both options, if you run it you will see what happens to the variables for option 2. Option one I don’t know how to restrict SAS from purging the hash objects.
I appreciate your comments!
06-22-2017 12:42 PM
Sorry, maybe I am missing something here, but in the example you give, why not merge on the two additional variables or do a dataset update?
proc sort data=sashelp.class out=class (keep=name age height); by name; run; proc sort data=sashelp.classfit out=classfit (keep=name lowermean uppermean); by name; run; data want; merge class (in=a) classfit (in=b); by name; if a; run; /* Or update - not sure what would be quiicker and not tested this code */ data want; update class classfit; by name; run;
06-22-2017 01:21 PM
Thanks for the response, in real life the lookup tables are independent, they each share a common key with the fact table but do not share one with each other. Because of this I cannot combine both crosswalks and then read this all into one hash object.
here is what it looks like
again i have dozens of different fact tables that i need to hit against the same lookup tables.
06-23-2017 10:10 AM
The answer to this depends on the exact details.
In particular, there are probably two or three reasonable options here - depending on your system, the datasets, etc.
If the many datasets with 800+ variables are either:
a) Identical in metadata [variable names, types, lengths]
b) Have a combination of identical variable names/types/lengths and non-overlapping variable names
Then there is a simple solution, where you have one data step and run each dataset through a loop.
If your datasets are nonidentical and have overlapping variable names that are not identically typed, then you can't easily do that. You might be able to if you use some interesting renaming; I've done that, where I renamed every variable (through a macro) to DS_<var>, where DS was a prefix for that dataset or even the name of the dataset if sufficiently short. It's just ... messy, and error-prone.
You can use DOSUBL to generate data steps inside your main data step to do the matching work, and allow persistent hash tables, but that's quite slow and probably not faster than just reloading the data sets into hash tables.
Third, you could consider formats instead of hash tables. Formats are persistent and quite fast to lookup, and if you're only writing a couple of them per dataset, not slow to load either. They do sometimes have performance issues with high numbers of formats, but in a server environment you might be okay there.
Fourth, another probably-slow-but-maybe-worth-checking option is PROC DS2. That would let you have persistent hash tables, I believe, and process each of your datasets. DS2 tends to be slow, though, on simple data reads.
If you can give us a bit more information as to your problems' full scope, we can probably help you find the specific solution that's most appropriate.
06-23-2017 12:02 PM
The datasets are not identical in metadata and have a few overlapping named variables. I worry about stacking the datasets because of this. I wouldn’t want to accidentally truncate values in instances where the lengths and things disagreed but the variable names were the same.
Because of this I like your approach to rename the variables to something new then rename them back post merge, this could work but I would also have to drop and variables that were in one dataset but not in the appended dataset. This is shown in the second example of the code that I had originally attached. This would also be easier if I used your renaming method and just made each variable have a prefix of the variable name. That would simplify my keep= statements.
I might attempt using multiple formats to see how that works out, I’m not a fan of needing to concatenate the multiple primary keys to make one for the index and then subsequently doing the same on the base dataset. That seems like a lot of reads to me.
At this point I think I am going to have to just sort/sort merge my lookup tables and split up the larger files prior to merging them. That or revisit my settings on my UNIX server.
Right now my config file has
Memsize 8g/Sortsize 1g/BUFNO =20/BUFSIZE=512k
I will need to see if I can up some of these to get the tables to load into hash faster. It is a virtual machine so potentially I could have the server admin allocate more memory for this task, or move the job to a higher powered server.
06-23-2017 12:59 PM