Re: A join in proc sql takes 5 hours and consumes a lot of disk space

SergioSanchez · Posted 06-03-2014 12:07 PM

Hi all

I have to do a join with datasets that have over 50 millions of rows each one and it´s a nightmare. It takes 5 hours or more in perform the task.

I dont know if there is a better solution, testing with a merge statement takes too much time too because the datasets need to be sorted and
these tasks takes time and I have read that proc sort don´t use an index to perfom the task.

Plus the disk space is reduced drastically.

Could anybody help me please?

Is there a better solution to achieve the result without have to wait for 5 hours?

thanks in advanced

PGStats · Posted 06-03-2014 12:34 PM

Where is the data? What kind of join are you doing? What is the join on, what indexes are available? Is the result of the join much smaller than the joined tables? Are you getting a message from SAS about a join that can't be optimized? We need answers to these questions to optimize the operation.

PG

SergioSanchez · Posted 06-04-2014 03:34 AM

Morning all

First, thanks for the help it´s much appreciated for me.

Well, I´ll try to answer your questions.

The data are in an Oracle Server and I make a copy of the datasets connecting to the server through a libname statement and after

data a;

set b (where = (var1<= date and var2>date and var3>date); where "b" is the dataset on the Oracle server

I do a left join, the target in most of the cases is to obtain the surrogate key of the second table and one or two variables more.Something like this

proc sql;

create table x as

select a.*, b.var1, b.var2, b.var3

from dataset1 as a left join dataset2 as b

on (a.var1 = b.var1);

quit;

There is no index at all in any of the datasets and no message appears in the log about an issue in the performance

Regards

Kurt_Bremser · Posted 06-04-2014 04:01 AM

Try this:

proc sort

data=b /* this is your original oracle data set */

(where = (var1<= date and var2>date and var3>date)

out=dataset1

;

by var1;

run;

proc sort

data=dataset2 (keep=var1 var2 var3)

out=data2x

;

by var1;

run;

data x;

merge

dataset1 (in=a)

data2x

;

by var1;

if a;

run;

Compare this method and the SQL method using options fullstimer;

Also watch the disks while the jobs are running; you may be surprised by the disk usage(s).

I remember when I first came across a piece of code done by a SAS consultant that had > 100 lines. I quickly saw that I could do the same in one create table with ~ 10 lines in PROC SQL, so why bother with all that code? Then I had to wait 5 hours for my SQL to finish, while his code took about 20 minutes to produce the same result. With less than half the disk space.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RW9 · Posted 06-04-2014 04:27 AM

As a previous post had mentioned why not perform your tasks on the SQL server where your data resides, store the results into a temporary table and then extract the data into SAS? SQL databases should be fully optimized and geared up to performing code on very large datasets, although once you reach a certain point then it becomes less a database and more of a data warehouse with different storage requirements and processes.

LinusH · Posted 06-04-2014 08:01 AM

You are not telling where your "dataset2" is stored. If that one is i Oracle two, consider to move all processing to Oracle (SQL implicit/explicit pass-thru).

If you lookup surrogate keys from dataset2, it sounds like that is a permanent table, which mean you could consider applying an index to to it.

When doing similar work in DI Studio, there Look-up transformation is using data step hash tables. This technique has the benefit that it does not require that the "master" table to be sorted, it just performs a clean table scan.

Perhaps your whole process could be re-configured, but there's too little information at this point to give any suggestion in that direction.

Data never sleeps

SergioSanchez · Posted 06-04-2014 12:37 PM

Sorry guys, my PC have been half to death all day and I couldn´t test your adviced. I'm trying to sent a the query to Oracle server but I recive an error like this

"ERROR: PROC SQL requires any created table to have at least 1 column."

96 options fullstimer;

97 options sastrace=',,,s ,,d, ,,t,' sastraceloc=saslog nostsuffix;

98

99 proc sql;

100 connect to odbc as oracle (USER=aaaaaa PW=XXXXXXXX DSN='pppp');

101 create table result as select * from connection to oracle

102 (select a.var1, a.var2, a.var3, a.var4, a.var5,

102 a.var6, a.var7, a.var8, a.var9, a.var10, b.var1, b.var2

104 from epi as a left join blas b on (a.var1 = b.var1)

105 where a.var1 <= date and a.var2 >= date and 105! a.var3>date);

ERROR: PROC SQL requires any created table to have at least 1 column

LinusH, both datasets are in the work library, I copy it from a DWH in Oracle server.

Thanks for the link Jaap, It's very helpful for me.

One more thing, I haven`t permissions for create, modify or alter tables in the Server so all the tables that I create should be in the local drive

Regards

jwillis · Posted 06-04-2014 12:59 PM

Dear Sergio,

When using pass thru in the past, I see two things that could be the cause of your error message. The "as a" might be an issue. Try " epi a". The other issue may be that there are no rows selected by your where statement. Try testing with just one of the were conditions. Are the values in "var1" and "date" formatted exactly the same?

104 from epi as a left join blas b on (a.var1 = b.var1)

105 where a.var1 <= date and a.var2 >= date and 105! a.var3>date);

ERROR: PROC SQL requires any created table to have at least 1 column

This is code that worked for me in the past. "Doris" is the name of the Oracle database that lived in a UNIX environment.

proc sql feedback inobs=max outobs=max;

drop table dlib.hstclmshdr;

connect to oracle (user="&userd." password="&passd." path='mypath'

schema=doris preserve_comments buffsize=8000);

create table work.claims as

select G.*

from connection to oracle

(select d.*

from doris.inst_claim_header d

where (d.line_of_business = 'HST')

and (d.claim_thru_date between TO_DATE('01/01/2012','MM/DD/YYYY') and

TO_DATE('12/31/2012','MM/DD/YYYY'))

) G

SergioSanchez · Posted 06-05-2014 03:14 AM

Morning all

jwillis your code doesn't work for me, it`s neither an alias issue or were conditions.:smileycry::smileycry:

I'll see if I am able to find a solution.

Thanks

ffurquim · Posted 06-10-2014 04:54 PM

what date are you specifying where you use a.var3>date? If date is current day, do you expect to have future-dated records? do you need to use the date functions/formats to get this to work?

Tom · Posted 06-06-2014 05:54 PM

Do both datasets have multiple records for the same value of var1?

If at least one of them VAR1 uniquely identifies rows then just use a normal data step merge .

So instead of SQL code like your example:

create table x as

select a.*, b.var2, b.var3

from dataset1 as a

left join dataset2 as b

on (a.var1 = b.var1)

;

You would use normal SAS code and write this as:

data x ;

merge dataset1 (in=in1) dataset2 (in=in2 keep=var1 var2 var3);

by var1 ;

if in1 ;

run;

rachel · Posted 06-06-2014 07:21 PM

Sergio,

You can do a hash sort on both data sets, this will reduce processing time, then try the data step merge.

But the hash sort must be done after you pull the data into sas, you can't hash sort in a pass through query.

SergioSanchez · Posted 06-08-2014 11:51 AM

I know it Jaap but it's my working computer and there is nothing I can do for change this machine or add more memory RAM

which probably resolved the memory's problem.

thanks bentleyj1 for the links I'll take a look.

Tom, KurtBremser told me something like you are telling me now, I test it when I have a little time .

rachel, I don't kwow what you mean with you talk about hash sort. Could you explain to me please?.

Regards.

rachel · Posted 06-09-2014 01:41 PM

Sergio,

I was saying you should using a hash method, not proc sort...that will shorten the run time.

SergioSanchez wrote:

I know it Jaap but it's my working computer and there is nothing I can do for change this machine or add more memory RAM

which probably resolved the memory's problem.

thanks bentleyj1 for the links I'll take a look.

Tom, KurtBremser told me something like you are telling me now, I test it when I have a little time .

rachel, I don't kwow what you mean with you talk about hash sort. Could you explain to me please?.

Regards.

rachel · Posted 06-09-2014 01:52 PM

I meant to say sort your two datasets using hash sort, because the hash sort is much faster then the proc sort. instead of proc sort, then try merging with a data step.

data _null_ if _N_ =1 then do;

declare hash h(dataset:"your_data", ordered:'d'); /* 'd ' means descending or you can put 'a' for ascending*/

h.defineKey('your_key');

h.defineData('your_key');

call missing(your_key); /*no quotes'*/

end; /* Create output dataset of sorted data */

rc=h.output(dataset:'out_your_data');

run;

SAS Innovate 2025: Call for Content

Classroom Training Available!