04-20-2018 03:27 PM
I'm combining a series of 28 SAS datasets all of which are sorted by a common numeric key. I want the final output to be a single SAS data set sorted by the common numeric key. Right now, I'm doing an an "interleave", that is a DATA step with a series of SET statements (one for each dataset) immediately followed by a BY statement specifying the common numeric key. The result is a single SAS dataset the entirety of which is sorted by the aforementioned common numeric key. The combined total record count is around 230 million rows with about 480 columns.
This is on a 20 core (80 logical core) AIX machine with 252 Gb of memory running SAS 9.4 M4. The raw input files are ASCII csv files.
QUESTION: Is there a faster way to do the interleave? The performance isn't bad. Originally it was running 9 hours before I split the largest file (about 180 million rows) into 25 smaller files, but since the interleave is 50% of the total, it would be nice to cut it down to less than or equal the time it takes to read in the data.
All of the files, after splitting the largest file into 25 files, are small enough that they can be loaded into a hash tables. I'm not sure that all 28 files could be loaded into hash tables in a single DATA step, but no single file is so large that it can't be loaded into a hash table. I've loaded up to three files into hash tables in a single data step in the testing I've done so far. However, I am not aware of any method of using hash tables to combine all 230 million rows that offers the possibility that it would be faster than the interleave that I'm already doing. Are there any hash techniques out there that I'm just not aware of? I need to keep all rows from all 28 SAS datasets.
Any ideas on how I might combine these 28 datasets while maintaining sort order faster than a SAS interleave?
I have not tried using SPDE for the 28 datasets. SPDE is not something I've used before. Would that have any potential at all to make things significantly faster? SPDG is not available in our environment.
Maybe I'm just whining since I've already got the job down from 9 hours to 3, but since we frequently deal with datasets of 250+ million rows, I'd love to know about it if there were faster techniques to combine large SAS datasets while maintaining sort order.
04-20-2018 04:19 PM - edited 04-20-2018 04:45 PM
1. Loading all your datasets into hash object 1
2. loading all your unique keys into hash object 2
3. instantiate hiter and iterate through the keys of hash object 2
4. while iterating look up hash object1 for each unique key from the PDV extract all observations and write out to the output dataset
5. Point 1 will require one full pass of using dataset names in the declare statement instantiation though and then hiter can take over
04-20-2018 04:50 PM
Interesting. I think the second hash table might not be necessary if all data can be loaded into the first hash table although maybe I'm not understanding your idea.
If I specify "ordered:a" on the first hash table, all records will be automatically in order. I then need to do a hash_table1.OUTPUT(dataset:'sasout.output_data') . I believe this would require two passes through the data: 1. Load the hash table 2. Write the hash table. Still, it would work. I wonder if it would be faster. I also wonder if I can really load 230 million records with each 480 columns into a hash table. I have a big machine available, but that is a lot of data.
04-20-2018 04:55 PM - edited 04-20-2018 04:57 PM
Hmm you are right. I am not sure if this is approach is any faster. I guess that is something you can test in a dev environment and compare with traditional interleave.
for --"I also wonder if I can really load 230 million records with each 480 columns into a hash table. I have a big machine available, but that is a lot of data"
Hashexp: 20 should hold i think ???
Btw, My apporach of the second table is to zip through an an unsorted hash table as i was overthinking whether ordered option is perhaps expensive
04-20-2018 06:01 PM
I don't know for sure, but my thought is that records are added to a hash table in sort order as they are read. I think the hashing algorithm is such that the hash generated automatically places the record in sort order -- at least that's my understanding. I'll have to check some of Paul Dorfman's papers and see if I can find out exactly how that works.
I've done some hash sort tests (where I load a file into an ordered hash table and then write the file out in sort order from the hash table). I've compared the time it takes to do a hash sort vs. one of the provided SAS sorts (traditional sort, multi-threaded sort, SQL sort, and tag sort), and hash sort is never as fast as the SAS procedures except maybe on small files. Still, it might be interesting to try.
04-20-2018 06:32 PM
A good question in that SYNCSORT is known to be pretty fast. Alas, I don't have SYNCSORT available to me to run some tests.
04-20-2018 08:18 PM
If your datasets contain a lot of character columns then I would test the performance of using COMPRESS = BINARY dataset option. If the compression ratio is high, like over 50% I would expect your processing to speed up due to reduced IO.
04-20-2018 09:23 PM
I've tried both binary and character compression. Binary definitely creates a smaller file. I'm not sure if I'm understanding you correctly -- would switching to character compression make things go faster? That seems counter-intuitive, but I suppose I could try it.
04-20-2018 11:18 PM
I believe binary compression improves performance since it's the smallest "footprint", but I haven't run any tests. I suppose I should run three tests:
I'll try it when I get a chance although I suspect I'm already using the fastest option, binary.
04-20-2018 07:35 PM - edited 04-20-2018 07:39 PM
I am not sure I understand exactly what you are saying is your process. Sounds like you are just trying to describe code like this:
data want; set ds1-ds28 ; by key; run;
Why are you starting from text files instead of datasets? Can't you just create the datasets once instead of spending 45 minutes each time to convert them from text into datasets? Most SAS jobs are I/O bound so if the text files are compressed that might save time.
data ds1 ; infile 'file1.csv.gz' zip gzip dsd truncover firstobs=2; .... run;
Do you have SAS/Connect? Can you spin up N sessions to each read 28/N of the text files independently?
waitfor _all_ ; data want ; set server1.ds1 server1.ds2 server2.ds1 server2.ds2 ...... ; by key; run;
04-20-2018 08:15 PM - edited 04-20-2018 08:16 PM
Your illustration (set ds1 - ds28; by key; ) is exactly what I'm doing -- after I've read in all the text files. The files are csv but have been gzipped first.
Why do I keep reading them in? Well, these are from an external source. We get a new set every month. We get four files totaling about 220 to 250 million records, depending on the month. One file is about a million, one about 30 to 35 million, one about 10 to 15 million, and the last one, by far the largest, is about 180 to 220 million records. The job was originally single-threaded and took about 9 hours, including the analytical steps that follow the input and interleave steps. Eek.
I took the largest file and, using multiple threads, read it into 25 separate SAS datasets. The multi-threaded read of the largest file now takes 30 - 45 minutes instead of 4+ hours, depending on how many subordinate threads I use. However, I then have to do the interleave, as you described (set ds1 - ds28; by key; ). The interleave takes 90 minutes, which is 2 to 3 times how long it takes to read in the data. Maybe that's the best that can be had, but I thought I'd ask if there were a faster way. Taking a job from 9 hours run time to 3 hours is hardly disappointing, but, you know, I'd like to see if there are any other performance gains to be had if I can.
P.S. We don't have MP Connect, but multi-threading can still be done through UNIX
04-20-2018 10:53 PM
Get a lot of fast disks? I/O is usually the bottle neck with SAS.
Check if your system has some large fast disks reserved for temporary files and try to use those. I have had some success with use OS commands to copy large gzipped files from regular disks to the SAS work disk and then reading them.
SPDE should help with every step. You can spread the SAS datasets across multiple disks. Make sure they are using different IO controllers if you can. You can use COMPRESS=BINARY with SPDE engine and get much better compression that with BASE engine. If nothing else that should save a lot of I/O.
I don't think you can multi-thread reading from a single large GZIP file. So you might speed up the first step by getting the large file delivered as multiple small files.
04-20-2018 11:28 PM - edited 04-20-2018 11:31 PM
Interesting. I'll have to try SPDE although I haven't as yet ever used it. I've read that optimal set up is not trivial. I wasn't sure that SPDE would improve things for sequential access.
The idea of fast disks is interesting although taking the time to copy the raw data from our "landing zone" to another disk might negate the time savings obtained from using a faster disk. Still, perhaps the SAS datasets themselves could be written to faster disks even if the raw data is on slower disks. Perhaps conducting the interleave, the bottleneck, on the fast disks is the answer.
As for multiple threads simultaneously reading a single gzipped file, it seems to be working fine. In fact it's working fabulously. Single threaded was 4 hours. Multi-threaded is 45 minutes (or less if I go crazy on allocating threads), a 500% improvement in run time. I'm doing a PIPE "gunzip -c exciting_filename.csv" in my FILENAME statement.