Re: Sql Insert vs Data Step Efficiency Question

rileyd · Posted 12-12-2014 11:47 AM

Hey,

I have a Data step like follows:

Data Table3;

Set Table1

Table2;

Run;

I replaced it using a Sql Insert step. I decided not to use Proc Append because it would require me to use the force option. The Sql Insert option lets me defulat some fields to 0 that are not in Table2 but are in Table1. So my insert looks something like:

Proc Sql;

Insert Into Table1

Select Field1,

Field2,

0 as Field3 /* Defaulted field. */

From Table2

;

Quit;

I would have expected the Sql Insert to be more efficient because I'm inserting only the records from Table2 into Table1 versus the Data step which is reading all the records in Table1 and Table2 to create Table3. But when I checked the log the CPU time for the Data step was 38.61 and the CPU time for the Sql Insert was 2:00.35.

Just curious to know why the Sql Insert appears to be so much less efficient.

Thanks!

-Andrew

blom0344 · Posted 12-13-2014 11:54 AM

What is the CPU time of:

proc sql;

create table3 as

select t.* from table1 t

union

select v.* from table2 v;

quit;

since this closely resembles building a table from scratch as the datastep does..

PGStats · Posted 12-13-2014 02:33 PM

Good suggestion, but that should be:

proc sql;

create table table3 as

select * from table1

union all corresponding

select *, 0 as Field3 from table2;

quit;

PG

blom0344 · Posted 12-13-2014 03:55 PM

Yes, you are right. My idea was to point to the use of set operators. I wasn't sure SAS supports union all, so I sticked to the union

Cynthia_sas · Posted 12-13-2014 12:31 PM

Hi:

In the Programming 3 class, we discuss these kinds of efficiencies and illustrate the proper way to do benchmarking. Here's the description of the Programming 3 class: SAS Training in the U.S. -- SAS Programming 3: Advanced Techniques and Efficiencies

cynthia

rileyd · Posted 12-15-2014 09:10 AM

Thanks Cynthia. I have the Programming 3 book so I'll look into doing some benchmarking as outlined in it. I guess I was a little surprised in general that the Sql Insert wasn't less CPU intensive. In the past when I've replaced similar Data steps with Proc Append the CPU difference (improvement) is typically noticeable and significant (2-3 times faster). I assumed with the Sql Insert I would see similar results so when the CPU actually increased I was surprised.

Kurt_Bremser · Posted 12-15-2014 10:05 AM

What were the real times of the data step and sql solutions?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rileyd · Posted 12-15-2014 10:45 AM

The real time increased as well although I generally don't pay much attention to real time. They're not unreliable when it comes to determining the efficiency of a program.

Kurt_Bremser · Posted 12-15-2014 10:56 AM

For me, efficiency means the time spent to achieve a result. 100 CPU seconds in 100 seconds real time is therefore more efficent than 5 seconds CPU time in 10 minutes real time.

Granted that real time depends very much on what else is going on in a certain system at a certain time, so you need more than one run to come to a valid conclusion.

SAS SQL is often a notorious I/O resource hog (which reflects mainly in real time), that's why I asked the question.

Watch which files (work directory, utility location) are created during the steps and to what size they grow.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

LinusH · Posted 12-16-2014 10:56 AM

PROC APPEND should be the most efficient technique.

Perhaps you could assign 0 to field3 when creating Table2?

Data never sleeps

rileyd · Posted 01-16-2015 04:10 PM

That's not an option in this situation. Thanks for the suggestion though Linus.

Howles · Posted 12-17-2014 04:32 PM

What I don't know is how the SQL statement is optimized by PROC SQL.

Maybe the SELECT is evaluated and the result set is spooled somewhere. Then the rows are batch-inserted into the target table. That seems to be the original poster's theory.

Maybe the SELECT is evaluated and as each row is generated it triggers a single-row insertion operation. That's my hunch.

The results will of course be the same either way, but the incurrence of overhead could differ.

ballardw · Posted 12-17-2014 04:50 PM

A possibly obnoxious confounder in some of this is SAS will attempt to keep some of the data in memory. So sometimes a second test involving the same data set(s) may run quicker just because it isn't being read from disk.

Kurt_Bremser · Posted 12-18-2014 02:30 AM

Any operating system that deserves to be called one will do caching on its own, so the second operation on the same data will usually be faster. A thorough test needs to clean the system cache out first and run the compared steps in reverse order to allow an educated guess about efficiency.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PGStats · Posted 12-17-2014 06:02 PM

I suspect that the default UNDO_POLICY=REQUIRED option forces SAS to do the inserts one at a time. Setting UNDO_POLICY=NONE may improve performance significantly.

PG

Registration is open

SAS Training: Just a Click Away