BookmarkSubscribeRSS Feed
jimbarbour
Meteorite | Level 14

I'm combining a series of 28 SAS datasets all of which are sorted by a common numeric key.  I want the final output to be a single SAS data set sorted by the common numeric key.  Right now, I'm doing an an "interleave", that is a DATA step with a series of SET statements (one for each dataset) immediately followed by a BY statement specifying the common numeric key.  The result is a single SAS dataset the entirety of which is sorted by the aforementioned common numeric key. The combined total record count is around 230 million rows with about 480 columns.  

 

Timings:

  • Reading the raw files to create the 28 SAS datasets takes 45 minutes.
  • The interleave takes 90 minutes (double the time to read).
  • The follow on analytical steps take 45 minutes
  • TOTAL: about 3 hours, half of which is the interleave.

 

This is on a 20 core (80 logical core) AIX machine with 252 Gb of memory running SAS 9.4 M4.  The raw input files are ASCII csv files.  

 

QUESTION:  Is there a faster way to do the interleave?  The performance isn't bad.  Originally it was running 9 hours before I split the largest file (about 180 million rows) into 25 smaller files, but since the interleave is 50% of the total, it would be nice to cut it down to less than or equal the time it takes to read in the data.  

 

All of the files, after splitting the largest file into 25 files, are small enough that they can be loaded into a hash tables.  I'm not sure that all 28 files could be loaded into hash tables in a single DATA step, but no single file is so large that it can't be loaded into a hash table.  I've loaded up to three files into hash tables in a single data step in the testing I've done so far.  However, I am not aware of any method of using hash tables to combine all 230 million rows that offers the possibility that it would be faster than the interleave that I'm already doing.   Are there any hash techniques out there that I'm just not aware of?  I need to keep all rows from all 28 SAS datasets.

 

Any ideas on how I might combine these 28 datasets while maintaining sort order faster than a SAS interleave?

 

I have not tried using SPDE for the 28 datasets.  SPDE is not something I've used before.  Would that have any potential at all to make things significantly faster?  SPDG is not available in our environment.  

 

Maybe I'm just whining since I've already got the job down from 9 hours to 3, but since we frequently deal with datasets of 250+ million rows, I'd love to know about it if there were faster techniques to combine large SAS datasets while maintaining sort order.

 

Thanks,

 

Jim

19 REPLIES 19
novinosrin
Tourmaline | Level 20

How about 

1. Loading all your datasets  into hash object 1

2. loading all your unique keys into hash object 2

3. instantiate hiter  and iterate through the keys of hash object 2

4. while iterating look up hash object1 for each unique key from the PDV extract all observations and write out to the output dataset

5. Point 1 will require one full pass of using dataset names in the declare statement instantiation though and then hiter can take over

?

jimbarbour
Meteorite | Level 14

Interesting.  I think the second hash table might not be necessary if all data can be loaded into the first hash table although maybe I'm not understanding your idea. 

 

If I specify "ordered:a" on the first hash table, all records will be automatically in order.  I then need to do a hash_table1.OUTPUT(dataset:'sasout.output_data') .  I believe this would require two passes through the data:  1.  Load the hash table  2.  Write the hash table.  Still, it would work.  I wonder if it would be faster.  I also wonder if I can really load 230 million records with each 480 columns into a hash table.  I have a big machine available, but that is a lot of data.

 

Jim

novinosrin
Tourmaline | Level 20

Hmm you are right. I am not sure if  this is approach is any faster. I guess that is something you can test in a dev environment and compare with traditional interleave.

 

for --"I also wonder if I can really load 230 million records with each 480 columns into a hash table.  I have a big machine available, but that is a lot of data" 

Hashexp: 20 should hold i think ???

 

Btw, My apporach of the second table is to zip through an an unsorted hash table as i was overthinking whether ordered option is perhaps expensive

jimbarbour
Meteorite | Level 14

I don't know for sure, but my thought is that records are added to a hash table in sort order as they are read.  I think the hashing algorithm is such that the hash generated automatically places the record in sort order -- at least that's my understanding.  I'll have to check some of Paul Dorfman's papers and see if I can find out exactly how that works.

 

I've done some hash sort tests (where I load a file into an ordered hash table and then write the file out in sort order from the hash table).  I've compared the time it takes to do a hash sort vs. one of the provided SAS sorts (traditional sort, multi-threaded sort, SQL sort, and tag sort), and hash sort is never as fast as the SAS procedures except maybe on small files.  Still, it might be interesting to try.

 

Jim

PGStats
Opal | Level 21

I'm wondering, is syncsort any good at multi-threading?

PG
jimbarbour
Meteorite | Level 14

A good question in that SYNCSORT is known to be pretty fast.  Alas, I don't have SYNCSORT available to me to run some tests.

 

Jim

SASKiwi
PROC Star

If your datasets contain a lot of character columns then I would test the performance of using COMPRESS = BINARY dataset option. If the compression ratio is high, like over 50% I would expect your processing to speed up due to reduced IO.

jimbarbour
Meteorite | Level 14

@SASKiwi

 

I've tried both binary and character compression.  Binary definitely creates a smaller file.  I'm not sure if I'm understanding you correctly -- would switching to character compression make things go faster?  That seems counter-intuitive, but I suppose I could try it.

 

Jim

SASKiwi
PROC Star

If BINARY compresses your datasets the most I would stick with it - does it improve performance?

jimbarbour
Meteorite | Level 14

@SASKiwi

 

I believe binary compression improves performance since it's the smallest "footprint", but I haven't run any tests.  I suppose I should run three tests:

  • COMPRESS=NO
  • COMPRESS=CHAR
  • COMPRESS=BINARY

I'll try it when I get a chance although I suspect I'm already using the fastest option, binary.

 

Jim

Tom
Super User Tom
Super User

I am not sure I understand exactly what you are saying is your process.  Sounds like you are just trying to describe code like this:

data want;
  set ds1-ds28 ;
  by key;
run;

Why are you starting from text files instead of datasets?  Can't you just create the datasets once instead of spending 45 minutes each time to convert them from text into datasets?  Most SAS jobs are I/O bound so if the text files are compressed that might save time.

data ds1 ;
  infile 'file1.csv.gz' zip gzip dsd truncover firstobs=2;
  ....
run;

Do you have SAS/Connect?  Can you spin up N sessions to each read 28/N of the text files independently?

waitfor _all_ ;
data want ;
  set server1.ds1 server1.ds2 server2.ds1 server2.ds2 ...... ;
  by key;
run;

 

jimbarbour
Meteorite | Level 14

Hi, Tom,

 

Your illustration (set ds1 - ds28; by key; ) is exactly what I'm doing -- after I've read in all the text files.  The files are csv but have been gzipped first.

Why do I keep reading them in?  Well, these are from an external source.  We get a new set every month.  We get four files totaling about 220 to 250 million records, depending on the month.  One file is about a million, one about 30 to 35 million, one about 10 to 15 million, and the last one, by far the largest, is about 180 to 220 million records.  The job was originally single-threaded and took about 9 hours, including the analytical steps that follow the input and interleave steps.  Eek. 

 

I took the largest file and, using multiple threads, read it into 25 separate SAS datasets.  The multi-threaded read of the largest file now takes 30 - 45 minutes instead of 4+ hours, depending on how many subordinate threads I use.  However, I then have to do the interleave, as you described (set ds1 - ds28; by key; ).  The interleave takes 90 minutes, which is 2 to 3 times how long it takes to read in the data.  Maybe that's the best that can be had, but I thought I'd ask if there were a faster way.  Taking a job from 9 hours run time to 3 hours is hardly disappointing, but, you know, I'd like to see if there are any other performance gains to be had if I can.

 

 

Jim

 

P.S.  We don't have MP Connect, but multi-threading can still be done through UNIX

 

Tom
Super User Tom
Super User

Get a lot of fast disks?  I/O is usually the bottle neck with SAS.

 

Check if your system has some large fast disks reserved for temporary files and try to use those.  I have had some success with use OS commands to copy large gzipped files from regular disks to the SAS work disk and then reading them.  

 

SPDE should help with every step.  You can spread the SAS datasets across multiple disks. Make sure they are using different IO controllers if you can.  You can use COMPRESS=BINARY with SPDE engine and get much better compression that with BASE engine.  If nothing else that should save a lot of I/O.

 

I don't think you can multi-thread reading from a single large GZIP file. So you might speed up the first step by getting the large file delivered as multiple small files.

 

jimbarbour
Meteorite | Level 14

@Tom

 

Interesting.  I'll have to try SPDE although I haven't as yet ever used it. I've read that optimal set up is not trivial.  I wasn't sure that SPDE would improve things for sequential access.

 

The idea of fast disks is interesting although taking the time to copy the raw data from our "landing zone" to another disk might negate the time savings obtained from using a faster disk.  Still, perhaps the SAS datasets themselves could be written to faster disks even if the raw data is on slower disks.  Perhaps conducting the interleave, the bottleneck, on the fast disks is the answer.

 

As for multiple threads simultaneously reading a single gzipped file, it seems to be working fine.  In fact it's working fabulously.  Single threaded was 4 hours.  Multi-threaded is 45 minutes (or less if I go crazy on allocating threads), a 500% improvement in run time.  I'm doing a PIPE "gunzip -c exciting_filename.csv" in my FILENAME statement.

 

Jim

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 1447 views
  • 7 likes
  • 8 in conversation