topic Re: Combining multiple SAS datasets while maintaining sort order? in SAS Programming

Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Fri, 20 Apr 2018 19:27:11 GMT

I'm combining a series of 28 SAS datasets all of which are sorted by a common numeric key. I want the final output to be a single SAS data set sorted by the common numeric key. Right now, I'm doing an an "interleave", that is a DATA step with a series of SET statements (one for each dataset) immediately followed by a BY statement specifying the common numeric key. The result is a single SAS dataset the entirety of which is sorted by the aforementioned common numeric key. The combined total record count is around 230 million rows with about 480 columns.

Timings:

Reading the raw files to create the 28 SAS datasets takes 45 minutes.
The interleave takes 90 minutes (double the time to read).
The follow on analytical steps take 45 minutes
TOTAL: about 3 hours, half of which is the interleave.

This is on a 20 core (80 logical core) AIX machine with 252 Gb of memory running SAS 9.4 M4. The raw input files are ASCII csv files.

QUESTION: Is there a faster way to do the interleave? The performance isn't bad. Originally it was running 9 hours before I split the largest file (about 180 million rows) into 25 smaller files, but since the interleave is 50% of the total, it would be nice to cut it down to less than or equal the time it takes to read in the data.

All of the files, after splitting the largest file into 25 files, are small enough that they can be loaded into a hash tables. I'm not sure that all 28 files could be loaded into hash tables in a single DATA step, but no single file is so large that it can't be loaded into a hash table. I've loaded up to three files into hash tables in a single data step in the testing I've done so far. However, I am not aware of any method of using hash tables to combine all 230 million rows that offers the possibility that it would be faster than the interleave that I'm already doing. Are there any hash techniques out there that I'm just not aware of? I need to keep all rows from all 28 SAS datasets.

Any ideas on how I might combine these 28 datasets while maintaining sort order faster than a SAS interleave?

I have not tried using SPDE for the 28 datasets. SPDE is not something I've used before. Would that have any potential at all to make things significantly faster? SPDG is not available in our environment.

Maybe I'm just whining since I've already got the job down from 9 hours to 3, but since we frequently deal with datasets of 250+ million rows, I'd love to know about it if there were faster techniques to combine large SAS datasets while maintaining sort order.

Thanks,

Jim

Re: Combining multiple SAS datasets while maintaining sort order?

novinosrin — Fri, 20 Apr 2018 20:45:28 GMT

How about

1. Loading all your datasets into hash object 1

2. loading all your unique keys into hash object 2

3. instantiate hiter and iterate through the keys of hash object 2

4. while iterating look up hash object1 for each unique key from the PDV extract all observations and write out to the output dataset

5. Point 1 will require one full pass of using dataset names in the declare statement instantiation though and then hiter can take over

Re: Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Fri, 20 Apr 2018 20:50:10 GMT

Interesting. I think the second hash table might not be necessary if all data can be loaded into the first hash table although maybe I'm not understanding your idea.

If I specify "ordered:a" on the first hash table, all records will be automatically in order. I then need to do a hash_table1.OUTPUT(dataset:'sasout.output_data') . I believe this would require two passes through the data: 1. Load the hash table 2. Write the hash table. Still, it would work. I wonder if it would be faster. I also wonder if I can really load 230 million records with each 480 columns into a hash table. I have a big machine available, but that is a lot of data.

Jim

Re: Combining multiple SAS datasets while maintaining sort order?

novinosrin — Fri, 20 Apr 2018 20:57:08 GMT

Hmm you are right. I am not sure if this is approach is any faster. I guess that is something you can test in a dev environment and compare with traditional interleave.

for --"I also wonder if I can really load 230 million records with each 480 columns into a hash table. I have a big machine available, but that is a lot of data"

Hashexp: 20 should hold i think ???

Btw, My apporach of the second table is to zip through an an unsorted hash table as i was overthinking whether ordered option is perhaps expensive

Re: Combining multiple SAS datasets while maintaining sort order?

PGStats — Fri, 20 Apr 2018 21:52:45 GMT

I'm wondering, is syncsort any good at multi-threading?

Re: Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Fri, 20 Apr 2018 22:01:46 GMT

I don't know for sure, but my thought is that records are added to a hash table in sort order as they are read. I think the hashing algorithm is such that the hash generated automatically places the record in sort order -- at least that's my understanding. I'll have to check some of Paul Dorfman's papers and see if I can find out exactly how that works.

I've done some hash sort tests (where I load a file into an ordered hash table and then write the file out in sort order from the hash table). I've compared the time it takes to do a hash sort vs. one of the provided SAS sorts (traditional sort, multi-threaded sort, SQL sort, and tag sort), and hash sort is never as fast as the SAS procedures except maybe on small files. Still, it might be interesting to try.

Jim

Re: Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Fri, 20 Apr 2018 22:32:18 GMT

A good question in that SYNCSORT is known to be pretty fast. Alas, I don't have SYNCSORT available to me to run some tests.

Jim

Re: Combining multiple SAS datasets while maintaining sort order?

Tom — Fri, 20 Apr 2018 23:39:39 GMT

I am not sure I understand exactly what you are saying is your process. Sounds like you are just trying to describe code like this:

data want;
  set ds1-ds28 ;
  by key;
run;

Why are you starting from text files instead of datasets? Can't you just create the datasets once instead of spending 45 minutes each time to convert them from text into datasets? Most SAS jobs are I/O bound so if the text files are compressed that might save time.

data ds1 ;
  infile 'file1.csv.gz' zip gzip dsd truncover firstobs=2;
  ....
run;

Do you have SAS/Connect? Can you spin up N sessions to each read 28/N of the text files independently?

waitfor _all_ ;
data want ;
  set server1.ds1 server1.ds2 server2.ds1 server2.ds2 ...... ;
  by key;
run;

Re: Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Sat, 21 Apr 2018 00:16:55 GMT

Hi, Tom,

Your illustration (set ds1 - ds28; by key; ) is exactly what I'm doing -- after I've read in all the text files. The files are csv but have been gzipped first.

Why do I keep reading them in? Well, these are from an external source. We get a new set every month. We get four files totaling about 220 to 250 million records, depending on the month. One file is about a million, one about 30 to 35 million, one about 10 to 15 million, and the last one, by far the largest, is about 180 to 220 million records. The job was originally single-threaded and took about 9 hours, including the analytical steps that follow the input and interleave steps. Eek.

I took the largest file and, using multiple threads, read it into 25 separate SAS datasets. The multi-threaded read of the largest file now takes 30 - 45 minutes instead of 4+ hours, depending on how many subordinate threads I use. However, I then have to do the interleave, as you described (set ds1 - ds28; by key; ). The interleave takes 90 minutes, which is 2 to 3 times how long it takes to read in the data. Maybe that's the best that can be had, but I thought I'd ask if there were a faster way. Taking a job from 9 hours run time to 3 hours is hardly disappointing, but, you know, I'd like to see if there are any other performance gains to be had if I can.

Jim

P.S. We don't have MP Connect, but multi-threading can still be done through UNIX

Re: Combining multiple SAS datasets while maintaining sort order?

SASKiwi — Sat, 21 Apr 2018 00:18:52 GMT

If your datasets contain a lot of character columns then I would test the performance of using COMPRESS = BINARY dataset option. If the compression ratio is high, like over 50% I would expect your processing to speed up due to reduced IO.

Re: Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Sat, 21 Apr 2018 01:23:57 GMT

@SASKiwi

I've tried both binary and character compression. Binary definitely creates a smaller file. I'm not sure if I'm understanding you correctly -- would switching to character compression make things go faster? That seems counter-intuitive, but I suppose I could try it.

Jim

Re: Combining multiple SAS datasets while maintaining sort order?

SASKiwi — Sat, 21 Apr 2018 02:40:21 GMT

If BINARY compresses your datasets the most I would stick with it - does it improve performance?

Re: Combining multiple SAS datasets while maintaining sort order?

Tom — Sat, 21 Apr 2018 02:53:16 GMT

Get a lot of fast disks? I/O is usually the bottle neck with SAS.

Check if your system has some large fast disks reserved for temporary files and try to use those. I have had some success with use OS commands to copy large gzipped files from regular disks to the SAS work disk and then reading them.

SPDE should help with every step. You can spread the SAS datasets across multiple disks. Make sure they are using different IO controllers if you can. You can use COMPRESS=BINARY with SPDE engine and get much better compression that with BASE engine. If nothing else that should save a lot of I/O.

I don't think you can multi-thread reading from a single large GZIP file. So you might speed up the first step by getting the large file delivered as multiple small files.

Re: Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Sat, 21 Apr 2018 03:18:57 GMT

@SASKiwi

I believe binary compression improves performance since it's the smallest "footprint", but I haven't run any tests. I suppose I should run three tests:

COMPRESS=NO
COMPRESS=CHAR
COMPRESS=BINARY

I'll try it when I get a chance although I suspect I'm already using the fastest option, binary.

Jim

Re: Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Sat, 21 Apr 2018 03:31:48 GMT

@Tom

Interesting. I'll have to try SPDE although I haven't as yet ever used it. I've read that optimal set up is not trivial. I wasn't sure that SPDE would improve things for sequential access.

The idea of fast disks is interesting although taking the time to copy the raw data from our "landing zone" to another disk might negate the time savings obtained from using a faster disk. Still, perhaps the SAS datasets themselves could be written to faster disks even if the raw data is on slower disks. Perhaps conducting the interleave, the bottleneck, on the fast disks is the answer.

As for multiple threads simultaneously reading a single gzipped file, it seems to be working fine. In fact it's working fabulously. Single threaded was 4 hours. Multi-threaded is 45 minutes (or less if I go crazy on allocating threads), a 500% improvement in run time. I'm doing a PIPE "gunzip -c exciting_filename.csv" in my FILENAME statement.

Jim

Re: Combining multiple SAS datasets while maintaining sort order?

Patrick — Sat, 21 Apr 2018 05:47:26 GMT

@jimbarbour

Do you have to pre-sort your source data before the merge or is the data already sorted in source. IF it's already sorted then it would be worth to try and read and merge all the data in a single data step as this would save a full read/write cycle.

Another option: Even if you don't have SAS Connect you could still read all the gzipped sources in parallel using a job run in batch per source. If you don't have a scheduler then systask() together with the waitfor statement will allow you to control the whole process and dependencies from a single master job.

http://support.sas.com/documentation/cdl/en/hostunx/63053/HTML/default/viewer.htm#p0w8zwo1dyssdfn1mjm11dt2v7e2.htm

And last but not least: Using the SPDE engine remains worth trying.

Re: Combining multiple SAS datasets while maintaining sort order?

jimbarbour — Sat, 21 Apr 2018 06:03:49 GMT

@Patrick.

I've been doing just what you suggest: Using a SYSTASK with a NOWAIT parameter to launch multiple subordinate threads followed by a WAITFOR _ALL_. Generally, it works very well, but be warned that you do not want NOWORKINIT in your cfg file. Ask me how I know that. 🙂

The data is sorted when we get it, but I don't think combining the 28 datasets and reading them all in one DATA step will save time.

Here's why:

Single threaded, the job takes 4 hours to read the largest data set, then 1.5 hours to interleave with the three other datasets. Doing the combining in the same DATA step as the read would eliminate the 1.5 hours, but it would still take at least 4 hours to read in the largest data set. The largest data set has to be read in single threaded in order for the merge to work correctly.

With my multi-threaded approach, I can get the largest data set read in 45 minutes (some times 30 if I allocate more threads). The interleave still takes 1.5 hours. So, we're talking about 2 to 2.5 hours with the multi-threaded approach with separate read and merge phases vs. a minimum of 4 hours with the combined read/merge in a single DATA step.

SPDE remains intriguing.

Jim

Re: Combining multiple SAS datasets while maintaining sort order?

s_lassen — Mon, 23 Apr 2018 09:29:24 GMT

One way to speed up things may be to do the reading and the interleaving in one step. In other words, instead of reading the files first, create the input steps as data step views, and then set the views:

data v1/view=v1;
  infile file1 dsd delimiter=';';
  input...
/* etc. */

data v2/view=v2;
  infile file2 dsd delimiter=';';
  input...
/* etc */

data interleaved;
  set v1 v2 v3 v4 v5 v6...;
  by num_key;
run;

Re: Combining multiple SAS datasets while maintaining sort order?

emrancaan — Mon, 23 Apr 2018 10:25:28 GMT

If you have some kind of Relational Database. I was confronted with exactly same issue. CSVs ReadIns was taking enormous amount of time 2 hours. we imported the data to Teradata warehouse and then imported data to SAS that was very quick . Significant improving was made and new import time from Teradata was between 5 to 10 min.

Re: Combining multiple SAS datasets while maintaining sort order?

Patrick — Tue, 24 Apr 2018 02:22:49 GMT

@jimbarbour

It looks to me that you've done already a lot of the good stuff that's available to you. One of the things you could still try is to test your I/O relevant settings and see if you can optimize them for your use case. Not sure how much you've done already and if it's even worth for you to spend the time for it. It could be interesting though.

http://support.sas.com/kb/51/660.html

http://support.sas.com/kb/46/954.html