DATA Step, Macro, Functions and more

how to run faster if it reads several billions rows

Reply
Super Contributor
Posts: 345

how to run faster if it reads several billions rows

[ Edited ]
proc sql;
create table  lib1.want as
         select *
            from have
			where va1=2
           ;
quit;

will use datastep faster than proc sql?

It is estimated that it will run two hours to get the new dataset I want based on the code above. So it is hard to try different code. I would like to get advice from you on how to run it efficiently and faster. Thank you 

Super User
Posts: 7,762

Re: how to run faster if it reads several billions rows

[ Edited ]

Since that is a sequential read through the dataset without much processing, you won't gain anything by using a data step. You're simply bound by the sheer amount of data that has to be read.

Depending on the structure of have, you might gain something (or even a lot) if you use the compress=yes option when you store it, and do similar for dataset want.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 216

Re: how to run faster if it reads several billions rows

Hi,

 

If you the following can be applied to you, then your query should run faster:

- Your under lying storage system, allows for 50-75 MB/sec sustained I/O throughput

- Use SPDE/SPDS for your Output Library (lib1) to speed the writing of your output table

http://support.sas.com/rnd/scalability/spde/index.html

- Use Index for your Va1 variable to speed the reading of your subset/filtered data

 

Hope this helps,

Ahmed

Super User
Posts: 7,762

Re: how to run faster if it reads several billions rows

As an addendum to what @AhmedAl_Attar said, if you can throw hardware at the problem, consider to set up an array of disks and span a SPDE library over those. For SPDE to really improve your throughput, you need to remove the bottleneck caused by a single disk having to carry all your storage traffic.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Trusted Advisor
Posts: 1,554

Re: how to run faster if it reads several billions rows

Posted in reply to KurtBremser

Assuming you are going to read the file several times, each time with different VA1 value,

and the file is mostly static, then do the next preparing steps once:

 

- use option compress=yes 

- sort the file by VA1 or by variables used for WHERE selection and

   create the output with indexes of those variables.

   Pay attention - indexes needs storage

- If applicable use SPDE/SPDS for the sorted output

 

Thus, using advices of @KurtBremser and of @AhmedAl_Attar

 

The code should look like:

libname input '.....';
libname output SPDE '.......';

proc sort data=input.have
                 out=output.want (index=VA1 ....);  /* add expected variables used with WHERE */
         by  VA1 ..... ;  /* same variables defined for INDEX */
run;

 

 

 

 

Regular Contributor
Posts: 216

Re: how to run faster if it reads several billions rows

One more potential way to speed your query, would be

- if you have SAS 9.4, you may want to try the Threaded Read feature of Proc DS2, that's available in BASE SAS.

 

Here is an implementation example

 

options threads; /* Tells SAS to use threaded processing */
%let have_ds = have;
%let filter= va1=2;

proc ds2;
	thread newton/overwrite=yes;
		/* Some debugging variables */
		dcl double count thisThread;
		drop count thisThread;

		method run();
			SET {select * from &have_ds where &filter}; /* Ability to use FedSql query within DS2 */

			/* Assign/Initialize debugging vars */
			thisThread=_threadid_;
			count+1;
		end;

		method term();
			put '**Thread' _threadid_ 'processed'  count 'rows:';
		end;
	endthread;

	data lib1.want/overwrite=yes;
		dcl thread newton frac; /* Declare an Instance of the newton thread */

		method run();
			set from frac threads=4; /* <--- You Change the number of threads to control the threaded reads */
		end;
	enddata;
run;
quit;

Please note, according to the Proc DS2 documentation, "If one computation thread can keep up with the I/O thread, then that single thread is used for all computation."

 

Ahmed

 

 

Ask a Question
Discussion stats
  • 5 replies
  • 163 views
  • 0 likes
  • 4 in conversation