BookmarkSubscribeRSS Feed
Patrick
Opal | Level 21

@jimbarbour

Do you have to pre-sort your source data before the merge or is the data already sorted in source. IF it's already sorted then it would be worth to try and read and merge all the data in a single data step as this would save a full read/write cycle.

 

Another option: Even if you don't have SAS Connect you could  still read all the gzipped sources in parallel using a job run in batch per source. If you don't have a scheduler then systask() together with the waitfor statement will allow you to control the whole process and dependencies from a single master job.

http://support.sas.com/documentation/cdl/en/hostunx/63053/HTML/default/viewer.htm#p0w8zwo1dyssdfn1mj...

 

And last but not least: Using the SPDE engine remains worth trying. 

jimbarbour
Meteorite | Level 14

@Patrick.

 

I've been doing just what you suggest:  Using a SYSTASK with a NOWAIT parameter to launch multiple subordinate threads followed by a WAITFOR _ALL_.  Generally, it works very well, but be warned that you do not want NOWORKINIT in your cfg file.  Ask me how I know that.  🙂

 

The data is sorted when we get it, but I don't think combining the 28 datasets and reading them all in one DATA step will save time. 

 

Here's why:

Single threaded, the job takes 4 hours to read the largest data set, then 1.5 hours to interleave with the three other datasets.  Doing the combining in the same DATA step as the read would eliminate the 1.5 hours, but it would still take at least 4 hours to read in the largest data set.  The largest data set has to be read in single threaded in order for the merge to work correctly.  

 

With my multi-threaded approach, I can get the largest data set read in 45 minutes (some times 30 if I allocate more threads).  The interleave still takes 1.5 hours.  So, we're talking about 2 to 2.5 hours with the multi-threaded approach with separate read and merge phases vs. a minimum of 4 hours with the combined read/merge in a single DATA step.

 

SPDE remains intriguing.

 

Jim

 

 

Patrick
Opal | Level 21

@jimbarbour

It looks to me that you've done already a lot of the good stuff that's available to you. One of the things you could still try is to test your I/O relevant settings and see if you can optimize them for your use case. Not sure how much you've done already and if it's even worth for you to spend the time for it. It could be interesting though.

http://support.sas.com/kb/51/660.html 

http://support.sas.com/kb/46/954.html 

 

s_lassen
Meteorite | Level 14

One way to speed up things may be to do the reading and the interleaving in one step. In other words, instead of reading the files first, create the input steps as data step views, and then set the views:

data v1/view=v1;
  infile file1 dsd delimiter=';';
  input...
/* etc. */

data v2/view=v2;
  infile file2 dsd delimiter=';';
  input...
/* etc */

data interleaved;
  set v1 v2 v3 v4 v5 v6...;
  by num_key;
run;
 
emrancaan
Obsidian | Level 7

If you have some kind of Relational Database. I was confronted with exactly same issue. CSVs  ReadIns  was taking enormous amount of time 2 hours. we imported the data to Teradata warehouse and then imported data to SAS that was very quick . Significant improving was made and new import time from Teradata was between 5 to 10 min.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 1446 views
  • 7 likes
  • 8 in conversation