Re: A join in proc sql takes 5 hours and consumes a lot of disk space - Page 2

SergioSanchez · Posted 06-09-2014 03:42 PM

rachel escribió:

I meant to say sort your two datasets using hash sort, because the hash sort is much faster then the proc sort. instead of proc sort, then try merging with a data step.

data _null_ if _N_ =1 then do;

declare hash h(dataset:"your_data", ordered:'d'); /* 'd ' means descending or you can put 'a' for ascending*/

h.defineKey('your_key');

h.defineData('your_key');

call missing(your_key); /*no quotes'*/

end; /* Create output dataset of sorted data */

rc=h.output(dataset:'out_your_data');

run;

thanks rachel, it's look very nice, but what about the error in the previous hash?

ERROR: Hash object added 25165808 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

Is there any option to avoid this error without having to change the pc?

rachel · Posted 06-09-2014 03:54 PM

Sergio,

Yes thats a problem not having enough disk space.

You can try and assign a lib that's not the work directory, but some other location where you have more disk space. Other than that I dont know what you could do. I will try and think of work arounds.

Sorry about that.

SergioSanchez · Posted 06-09-2014 04:03 PM

rachel escribió:

Sergio,

Yes thats a problem not having enough disk space.

You can try and assign a lib that's not the work directory, but some other location where you have more disk space. Other than that I dont know what you could do. I will try and think of work arounds.

Sorry about that.

Ok rachel, it's doesn't matter (thanks again for the help) but I thing you are wrong, in this case is a memory issue not a problem with the disk space.

SergioSanchez · Posted 06-10-2014 05:14 AM

184 data _null_ ;

185 length Ckey $10;

186 if _N_=1 then do ;

187 declare hash test(dataset:"origin", ordered:'d');

188 test.definekey ('CKey');

189 test.definedata ('CKey');

190 test.definedone();

191 call missing ('CKey');

192 end;

193 rc = test.output (dataset:'ouput');

194 run;

ERROR: Hash object added 18874352 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

NOTE: DATA statement used (Total process time):

real time 1:47.61

user cpu time 36.50 seconds

system cpu time 21.30 seconds

Memory 1107591k

OS Memory 1168676k

Timestamp 10/06/2014 11:07:43

:smileyangry::smileyangry::smileyangry::smileyangry::smileyangry:

I'm afraid I'm going to cry....

LinusH · Posted 06-10-2014 05:39 AM

Feels like that you need a sponsor at this point, who can help you:

get permission to create (at least temporary) tables in Oracle
Get you the necessary resources to do your processing in SAS, i.e. computing power
Perhaps let you hire a consultant for a day or two to go through your case on site

Data never sleeps

rachel · Posted 06-10-2014 11:27 AM

Sergio, I made up a test case for you its a small data set so you can actually see what is happening. Split up the data into two sets of 25mill and see if you can hash join 25 by 25 mill. Here is the sample code for the example. Please let us know if this worked.

DATA A;
DO I=1 TO 20;
X=(I) ;
OUTPUT;
end;
Drop i;
RUN;




DATA b;
DO I=1 TO 40 BY 2;
Y=(I) ;
OUTPUT;
end;
Drop i;
RUN;
DATA FIRST_HALF_A ;
SET A (firstobs=1 obs=8);
RUN;


DATA FIRST_HALF_B ;
SET B (firstobs=1 obs=8);
RUN;


DATA SECOND_HALF_A ;
SET A (firstobs=9 obs=20);
RUN;


DATA SECOND_HALF_B ;
SET B (firstobs=9 obs=20);
RUN;




DATA FIRST_HALF_AB;
IF 0 THEN SET FIRST_HALF_A   ;
IF _N_ = 1 THEN DO;
DECLARE HASH HS (DATASET:'FIRST_HALF_A ');
HS.DEFINEKEY ('X');
HS.DEFINEDATA (all:'Y');
HS.DEFINEDONE ();
END;
SET FIRST_HALF_B;
RC=HS.FIND(KEY:Y);
IF RC = 0 THEN OUTPUT ;
DROP RC;
/*IF RC^=0 THEN  X = ' ';*/
RUN;




DATA SECOND_HALF_AB;
IF 0 THEN SET SECOND_HALF_A   ;
IF _N_ = 1 THEN DO;
DECLARE HASH HS (DATASET:'SECOND_HALF_A ');
HS.DEFINEKEY ('X');
HS.DEFINEDATA (all:'Y');
HS.DEFINEDONE ();
END;
SET SECOND_HALF_B;
RC=HS.FIND(KEY:Y);
IF RC = 0 THEN OUTPUT ;
DROP RC;
/*IF RC^=0 THEN  X = ' ';*/
RUN;




DATA FIRSTHALF_A_SENCONDHALF_B;
IF 0 THEN SET SECOND_HALF_A   ;
IF _N_ = 1 THEN DO;
DECLARE HASH HS (DATASET:'FIRST_HALF_A ');
HS.DEFINEKEY ('X');
HS.DEFINEDATA (all:'Y');
HS.DEFINEDONE ();
END;
SET SECOND_HALF_B;
RC=HS.FIND(KEY:Y);
IF RC = 0 THEN OUTPUT ;
DROP RC;
/*IF RC^=0 THEN  X = ' ';*/
RUN;
PROC SQL;
CREATE TABLE ONE_TWO AS
SELECT   *  FROM FIRST_HALF_AB
UNION
SELECT * FROM SECOND_HALF_AB ;
QUIT;

JBerry · Posted 06-10-2014 08:55 PM

I've had this exact same issue, and my best advice would be to process everything on the Oracle Server as much as possible.

Looks like you have 2 steps. In Step 1 you're doing

data a;

set b(where=);

For this portion, I recommend making a macro and then passing it the where clause elements:

%macro ExecuteSQL(param1=,paramN=);

proc sql;

connect to odbc as aaaaaa(dsn=xxxx USER=xxxxx PW=xxxxxx);

create table test as select * from connection to aaaaaa

(select * from schemaname.tablename

where var1<=&param1 and var2>&param1 and var3>&param1

and var4=&paramN);

quit;

%mend ExecuteSQL;

Now instead of your datastep, you can use

%ExecuteSQL(param1=20140610);

Unfortunately, this might end up being a huge pain to get right the first time, with the syntax of dates and all that - I remember Oracle is sort of a pain. But if your data step takes a long time, this could help.

The 2nd part of your problem appears you now have 2 datasets in SAS (dataset1 as a, dataset2 as b). From my experience, now that they're both in SAS you're in big trouble because it's going to take a really long time and the only answer I've been able to find is to upgrade your system. I've tried indexing and I've tried the sorting and merging, and I've tried all sorts of things. You might get a little performance gain, but nothing is going to take your 5 hours down to 5 minutes like a RDBMS would be able to deliver. If you used SAS passthrough sql to (1) create a table in Oracle (2) insert the data from SAS of both dataset1 and dataset2, (3) pass the query and join to Oracle and return results to SAS - it still might beat 5 hours!

jakarman · Posted 06-11-2014 02:44 AM

@jberry, the real problem is understanding what is happening when you are coding some data processing. Oracle can be the worst approach when then data analytics request is not a fit.

My experience is that something Tha bad has doen inside oracle, sas processing was easily beating that but many factors. Sas is made for analytics and Oracle for old with acid. That are different world's when getting to their limits.

---->-- ja karman --<-----

jakarman · Posted 06-03-2014 01:18 PM

When you have an advanced join, Proc Sql can cause a lot of overhead. the "Cartesian product" being famous for that.
Show you source / describe your data what you want to achieve. There are a lot more technical solutions a just a SQL.

---->-- ja karman --<-----

skillman · Posted 06-03-2014 03:41 PM

Add indexes to the join conditions and rerun the query on a subset of the data to test performance. When you are happy with the performance increases on the subset of data, run the join on all of the data (with indexes).

Doc_Duke · Posted 06-03-2014 03:42 PM

Some generic things to look at

SAS(R) Data Integration Studio 4.2: User's Guide

(The link refers to DI, but the recommendations are largely for Base SAS.)

If your data are in a remote database (Oracle, etc.), SAS will try to run the join on the database. However, some things that are allowed on SAS SQL don't exist in one (or more) of the remote system types and will force SAS to bring all of the data to the SAS workspace. Pass-thru SQL gives you more control with remote databases, but requires more knowledge on your part.

stat_sas · Posted 06-03-2014 05:59 PM

If Join involves remerging summary stats to the original data then it will also slow down the joining process.

jakarman · Posted 06-04-2014 01:58 AM

The processing of IO is with SQL designed to be random for a OLTP. That is why OLTP dbms is not the best approach for analytics and a lot of others have been developed.
The SAS dataset is already a classic one but this has been designed for sequential processing in a ordered way.
Do you access a little portion of the dataset and using this many times with slightly different subsets indexing will be a great help.

Having an external DBMS the communication line will often be the bottleneck. Let the data be as close to the source being processed when possible (federation).
The SQL pass-through should be done without seeing that. the sqltrace option can give you some proceedings. (implicit sql pass through).

You can code explicit pass through when you are needing special unique SQL language features, SQL has many dialects not all being ANSI-SQL.

All is about knowing your data and how that will be processed technically. Your machine will not choose an optimal performance approach.
That is your, the human, responsibility. What are all those details you have to deal with?

---->-- ja karman --<-----

Kurt_Bremser · Posted 06-04-2014 03:06 AM

Once you get to REAL data sets (50 million rows is in this range), you need to take care of your storage infrastructure.

a) make it FAST, using high-rpm disks or SSDs for the work area. If you are concerned about failsafes, use RAID1 (simple mirrors). If being failsafe is not a big thing, use striping

b) separate your UTILLOC physically from the work/data location, and make sure these disks do nothing else. UTILLOC is where the intermediate file is stored during PROC SORT

Then look at this:

a) use a combination of proc sort and data steps to do the merge. PROC SQL is a resource hog of the nth order when it comes to large joins. Real life experience here has shown that SQL gets progressively slower when several processes are running, much more than the sort/merge steps. Up to a point where the server becomes unresponsive, which is very rare with an AIX system(!).

b) indexes usually don't help (much), because in addition to the data, SAS needs to read the index, causing even more I/O. Indexes are very good if you need to access a small subset of data.

c) identify which sort criteria will be needed most, and have your data sets already sorted correctly when you store them. That way users (including yourself) do not need to sort and can read the big datasets sequentially.

d) when you do a data step merge, you need space for (just) the source files and the target files. With SQL, you also need space for the utiilty file, which will grow to a size equal of all the source files together. During the sorts preceding the merge, you only need extra space for the file being sorted, the temp file will be in UTILLOC

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jakarman · Posted 06-04-2014 07:38 AM

I hope you are allowed to define tables at Oracle.
The DBA often is doing a forbid for users as the DDL (Data Definition language) is his area. If you would extend this to Excel usage nobody would be allowed to define a spreadsheet.

The common usage with a RDBMS is DML (Data Manipulation Language) the parts of SQL giving you access to tables.

data a;

set b (where = (var1<= date and var2>date and var3>date);

b is the dataset on Oracle. The selection will run and store the result table into a SAS table.

proc sql;

create table x as

select a.*, b.var1, b.var2, b.var3

from dataset1 as a left join dataset2 as b

on (a.var1 = b.var1);

quit;

b is the dataset2 on Oracle and a dataset1 in SAS. The join will run on tables with a different storage type.

This logic is only possible with SAS-SQL, but the disadvantage will be much overhead as the only way to solve this making a copy of the table in SAS. It will happen automatic behind scenes.

What can you do for performance?

1. having the DDL open in Oracle

Define the table a within Oracle not SAS. You could use Explicit pass through when implicit is still causing copying of the data.

Use the SAStrace option to analyze what is happening.
Define the table x as you wish, but decide an what side Oracle or SAS it should be.

2. NOT having the DDL open in Oracle

   Copy the table into SAS environment and make the selection/join as smart as you can do.
   As some table-lookup looks to be done, you could also thing on using hashing or SAS formats
   Your SQL sample is really simple an looks to be able to be done in one pass without join or whatever.

SAS(R) 9.3 SQL Procedure User's Guide sqlconstdatetime

SAS/ACCESS(R) 9.3 for Relational Databases: Reference, Second Edition sastrace

SAS/ACCESS(R) 9.3 for Relational Databases: Reference, Second Edition Sql pass through specifics Oracle

SAS/ACCESS(R) 9.3 for Relational Databases: Reference, Second Edition bulk loading Oracel (do not forget Oracle performance)

---->-- ja karman --<-----

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away