topic Re: A join in proc sql takes 5 hours and consumes a lot of disk space in SAS Procedures

A join in proc sql takes 5 hours and consumes a lot of disk space

SergioSanchez — Tue, 03 Jun 2014 16:07:46 GMT

Hi all

I have to do a join with datasets that have over 50 millions of rows each one and it´s a nightmare. It takes 5 hours or more in perform the task.

I dont know if there is a better solution, testing with a merge statement takes too much time too because the datasets need to be sorted and
these tasks takes time and I have read that proc sort don´t use an index to perfom the task.

Plus the disk space is reduced drastically.

Could anybody help me please?

Is there a better solution to achieve the result without have to wait for 5 hours?

thanks in advanced

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

PGStats — Tue, 03 Jun 2014 16:34:55 GMT

Where is the data? What kind of join are you doing? What is the join on, what indexes are available? Is the result of the join much smaller than the joined tables? Are you getting a message from SAS about a join that can't be optimized? We need answers to these questions to optimize the operation.

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

jakarman — Tue, 03 Jun 2014 17:18:40 GMT

When you have an advanced join, Proc Sql can cause a lot of overhead. the "Cartesian product" being famous for that.
Show you source / describe your data what you want to achieve. There are a lot more technical solutions a just a SQL.

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

skillman — Tue, 03 Jun 2014 19:41:56 GMT

Add indexes to the join conditions and rerun the query on a subset of the data to test performance. When you are happy with the performance increases on the subset of data, run the join on all of the data (with indexes).

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

Doc_Duke — Tue, 03 Jun 2014 19:42:02 GMT

Some generic things to look at

SAS(R) Data Integration Studio 4.2: User's Guide

(The link refers to DI, but the recommendations are largely for Base SAS.)

If your data are in a remote database (Oracle, etc.), SAS will try to run the join on the database. However, some things that are allowed on SAS SQL don't exist in one (or more) of the remote system types and will force SAS to bring all of the data to the SAS workspace. Pass-thru SQL gives you more control with remote databases, but requires more knowledge on your part.

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

stat_sas — Tue, 03 Jun 2014 21:59:02 GMT

If Join involves remerging summary stats to the original data then it will also slow down the joining process.

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

jakarman — Wed, 04 Jun 2014 05:58:05 GMT

The processing of IO is with SQL designed to be random for a OLTP. That is why OLTP dbms is not the best approach for analytics and a lot of others have been developed.
The SAS dataset is already a classic one but this has been designed for sequential processing in a ordered way.
Do you access a little portion of the dataset and using this many times with slightly different subsets indexing will be a great help.

Having an external DBMS the communication line will often be the bottleneck. Let the data be as close to the source being processed when possible (federation).
The SQL pass-through should be done without seeing that. the sqltrace option can give you some proceedings. (implicit sql pass through).

You can code explicit pass through when you are needing special unique SQL language features, SQL has many dialects not all being ANSI-SQL.

All is about knowing your data and how that will be processed technically. Your machine will not choose an optimal performance approach.
That is your, the human, responsibility. What are all those details you have to deal with?

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

Kurt_Bremser — Wed, 04 Jun 2014 07:06:19 GMT

Once you get to REAL data sets (50 million rows is in this range), you need to take care of your storage infrastructure.

a) make it FAST, using high-rpm disks or SSDs for the work area. If you are concerned about failsafes, use RAID1 (simple mirrors). If being failsafe is not a big thing, use striping

b) separate your UTILLOC physically from the work/data location, and make sure these disks do nothing else. UTILLOC is where the intermediate file is stored during PROC SORT

Then look at this:

a) use a combination of proc sort and data steps to do the merge. PROC SQL is a resource hog of the nth order when it comes to large joins. Real life experience here has shown that SQL gets progressively slower when several processes are running, much more than the sort/merge steps. Up to a point where the server becomes unresponsive, which is very rare with an AIX system(!).

b) indexes usually don't help (much), because in addition to the data, SAS needs to read the index, causing even more I/O. Indexes are very good if you need to access a small subset of data.

c) identify which sort criteria will be needed most, and have your data sets already sorted correctly when you store them. That way users (including yourself) do not need to sort and can read the big datasets sequentially.

d) when you do a data step merge, you need space for (just) the source files and the target files. With SQL, you also need space for the utiilty file, which will grow to a size equal of all the source files together. During the sorts preceding the merge, you only need extra space for the file being sorted, the temp file will be in UTILLOC

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

SergioSanchez — Wed, 04 Jun 2014 07:34:14 GMT

Morning all

First, thanks for the help it´s much appreciated for me.

Well, I´ll try to answer your questions.

The data are in an Oracle Server and I make a copy of the datasets connecting to the server through a libname statement and after

data a;

set b (where = (var1<= date and var2>date and var3>date); where "b" is the dataset on the Oracle server

I do a left join, the target in most of the cases is to obtain the surrogate key of the second table and one or two variables more.Something like this

proc sql;

create table x as

select a.*, b.var1, b.var2, b.var3

from dataset1 as a left join dataset2 as b

on (a.var1 = b.var1);

quit;

There is no index at all in any of the datasets and no message appears in the log about an issue in the performance

Regards

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

Kurt_Bremser — Wed, 04 Jun 2014 08:01:09 GMT

Try this:

proc sort

data=b /* this is your original oracle data set */

(where = (var1<= date and var2>date and var3>date)

out=dataset1

;

by var1;

run;

proc sort

data=dataset2 (keep=var1 var2 var3)

out=data2x

;

by var1;

run;

data x;

merge

dataset1 (in=a)

data2x

;

by var1;

if a;

run;

Compare this method and the SQL method using options fullstimer;

Also watch the disks while the jobs are running; you may be surprised by the disk usage(s).

I remember when I first came across a piece of code done by a SAS consultant that had > 100 lines. I quickly saw that I could do the same in one create table with ~ 10 lines in PROC SQL, so why bother with all that code? Then I had to wait 5 hours for my SQL to finish, while his code took about 20 minutes to produce the same result. With less than half the disk space.

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

RW9 — Wed, 04 Jun 2014 08:27:36 GMT

As a previous post had mentioned why not perform your tasks on the SQL server where your data resides, store the results into a temporary table and then extract the data into SAS? SQL databases should be fully optimized and geared up to performing code on very large datasets, although once you reach a certain point then it becomes less a database and more of a data warehouse with different storage requirements and processes.

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

jakarman — Wed, 04 Jun 2014 11:38:32 GMT

I hope you are allowed to define tables at Oracle.
The DBA often is doing a forbid for users as the DDL (Data Definition language) is his area. If you would extend this to Excel usage nobody would be allowed to define a spreadsheet.

The common usage with a RDBMS is DML (Data Manipulation Language) the parts of SQL giving you access to tables.

data a;

set b (where = (var1<= date and var2>date and var3>date);

b is the dataset on Oracle. The selection will run and store the result table into a SAS table.

proc sql;

create table x as

select a.*, b.var1, b.var2, b.var3

from dataset1 as a left join dataset2 as b

on (a.var1 = b.var1);

quit;

b is the dataset2 on Oracle and a dataset1 in SAS. The join will run on tables with a different storage type.

This logic is only possible with SAS-SQL, but the disadvantage will be much overhead as the only way to solve this making a copy of the table in SAS. It will happen automatic behind scenes.

What can you do for performance?

1. having the DDL open in Oracle

Define the table a within Oracle not SAS. You could use Explicit pass through when implicit is still causing copying of the data.

Use the SAStrace option to analyze what is happening.
Define the table x as you wish, but decide an what side Oracle or SAS it should be.

2. NOT having the DDL open in Oracle

   Copy the table into SAS environment and make the selection/join as smart as you can do.
   As some table-lookup looks to be done, you could also thing on using hashing or SAS formats
   Your SQL sample is really simple an looks to be able to be done in one pass without join or whatever.

SAS(R) 9.3 SQL Procedure User's Guide sqlconstdatetime

SAS/ACCESS(R) 9.3 for Relational Databases: Reference, Second Edition sastrace

SAS/ACCESS(R) 9.3 for Relational Databases: Reference, Second Edition Sql pass through specifics Oracle

SAS/ACCESS(R) 9.3 for Relational Databases: Reference, Second Edition bulk loading Oracel (do not forget Oracle performance)

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

LinusH — Wed, 04 Jun 2014 12:01:58 GMT

You are not telling where your "dataset2" is stored. If that one is i Oracle two, consider to move all processing to Oracle (SQL implicit/explicit pass-thru).

If you lookup surrogate keys from dataset2, it sounds like that is a permanent table, which mean you could consider applying an index to to it.

When doing similar work in DI Studio, there Look-up transformation is using data step hash tables. This technique has the benefit that it does not require that the "master" table to be sorted, it just performs a clean table scan.

Perhaps your whole process could be re-configured, but there's too little information at this point to give any suggestion in that direction.

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

SergioSanchez — Wed, 04 Jun 2014 16:37:35 GMT

Sorry guys, my PC have been half to death all day and I couldn´t test your adviced. I'm trying to sent a the query to Oracle server but I recive an error like this

"ERROR: PROC SQL requires any created table to have at least 1 column."

96 options fullstimer;

97 options sastrace=',,,s ,,d, ,,t,' sastraceloc=saslog nostsuffix;

99 proc sql;

100 connect to odbc as oracle (USER=aaaaaa PW=XXXXXXXX DSN='pppp');

101 create table result as select * from connection to oracle

102 (select a.var1, a.var2, a.var3, a.var4, a.var5,

102 a.var6, a.var7, a.var8, a.var9, a.var10, b.var1, b.var2

104 from epi as a left join blas b on (a.var1 = b.var1)

105 where a.var1 <= date and a.var2 >= date and 105! a.var3>date);

ERROR: PROC SQL requires any created table to have at least 1 column

LinusH, both datasets are in the work library, I copy it from a DWH in Oracle server.

Thanks for the link Jaap, It's very helpful for me.

One more thing, I haven`t permissions for create, modify or alter tables in the Server so all the tables that I create should be in the local drive

Regards

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

jwillis — Wed, 04 Jun 2014 16:59:12 GMT

Dear Sergio,

When using pass thru in the past, I see two things that could be the cause of your error message. The "as a" might be an issue. Try " epi a". The other issue may be that there are no rows selected by your where statement. Try testing with just one of the were conditions. Are the values in "var1" and "date" formatted exactly the same?

104 from epi as a left join blas b on (a.var1 = b.var1)

105 where a.var1 <= date and a.var2 >= date and 105! a.var3>date);

ERROR: PROC SQL requires any created table to have at least 1 column

This is code that worked for me in the past. "Doris" is the name of the Oracle database that lived in a UNIX environment.

proc sql feedback inobs=max outobs=max;

drop table dlib.hstclmshdr;

connect to oracle (user="&userd." password="&passd." path='mypath'

schema=doris preserve_comments buffsize=8000);

create table work.claims as

select G.*

from connection to oracle

(select d.*

from doris.inst_claim_header d

where (d.line_of_business = 'HST')

and (d.claim_thru_date between TO_DATE('01/01/2012','MM/DD/YYYY') and

TO_DATE('12/31/2012','MM/DD/YYYY'))

) G

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

SergioSanchez — Thu, 05 Jun 2014 07:14:37 GMT

Morning all

jwillis your code doesn't work for me, it`s neither an alias issue or were conditions.:smileycry::smileycry:

I'll see if I am able to find a solution.

Thanks

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

jakarman — Thu, 05 Jun 2014 07:34:33 GMT

Sergio,
Jwillis has used an installation with an oracle client (schema=). That one is offering more advanced options. The oracle client is a free client from oracle but is requiring a SAS license for SAS/Access to Oracel. You are using the ODBC method obviously using the Windows client (DSN= )

Within a DBMS (explicit pass through) there is a common separation by schema-s. This looking like the libnames in SAS.
You cannot intermix libnames (SAS environment) and schemas (DBMS environment).

The important difference imo is " as select G.*" between the create and from connection

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

SergioSanchez — Thu, 05 Jun 2014 08:40:21 GMT

Jaap Karman escribió:

Within a DBMS (explicit pass through) there is a common separation by schema-s. This looking like the libnames in SAS.
You cannot intermix libnames (SAS environment) and schemas (DBMS environment).

The important difference imo is " as select G.*" between the create and from connection

Sorry Jaap but I don't understand anything, I run the following code and it works perfectly

proc sql;

connect to odbc as aaaaaa(dsn=xxxx USER=xxxxx PW=xxxxxx);

create table test as select * from connection to aaaaaa

(select * from schemaname.tablename);

quit;

After the code run I can see the dataset in my library, what is the difference between this code and the join I try to execute?

Regards

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

jakarman — Thu, 05 Jun 2014 09:21:17 GMT

proc sql;

connect to odbc as aaaaaa(dsn=xxxx USER=xxxxx PW=xxxxxx);

create table test as select * from connection to aaaaaa

(select * from schemaname.tablename);

quit;

Re: A join in proc sql takes 5 hours and consumes a lot of disk space

SergioSanchez — Thu, 05 Jun 2014 10:36:51 GMT

Well, another option, I am downloading the table order by thee key variable using pass thourh. After this I'll downloading the second table, ordering too.

This way I suppose I dont need the proc sort and I can perfom a merge to find wich rows are in common.

What do you think about this?

Thanks