Re: A join in proc sql takes 5 hours and consumes a lot of disk space - Page 3

jakarman · Posted 06-05-2014 03:34 AM

Sergio,
Jwillis has used an installation with an oracle client (schema=). That one is offering more advanced options. The oracle client is a free client from oracle but is requiring a SAS license for SAS/Access to Oracel. You are using the ODBC method obviously using the Windows client (DSN= )

Within a DBMS (explicit pass through) there is a common separation by schema-s. This looking like the libnames in SAS.
You cannot intermix libnames (SAS environment) and schemas (DBMS environment).

The important difference imo is " as select G.*" between the create and from connection

---->-- ja karman --<-----

SergioSanchez · Posted 06-05-2014 04:40 AM

Jaap Karman escribió:

Sergio,
Jwillis has used an installation with an oracle client (schema=). That one is offering more advanced options. The oracle client is a free client from oracle but is requiring a SAS license for SAS/Access to Oracel. You are using the ODBC method obviously using the Windows client (DSN= )

Within a DBMS (explicit pass through) there is a common separation by schema-s. This looking like the libnames in SAS.
You cannot intermix libnames (SAS environment) and schemas (DBMS environment).

The important difference imo is " as select G.*" between the create and from connection

Sorry Jaap but I don't understand anything, I run the following code and it works perfectly

proc sql;

connect to odbc as aaaaaa(dsn=xxxx USER=xxxxx PW=xxxxxx);

create table test as select * from connection to aaaaaa

(select * from schemaname.tablename);

quit;

After the code run I can see the dataset in my library, what is the difference between this code and the join I try to execute?

Regards

jakarman · Posted 06-05-2014 05:21 AM

proc sql;

connect to odbc as aaaaaa(dsn=xxxx USER=xxxxx PW=xxxxxx);

create table test as select * from connection to aaaaaa

(select * from schemaname.tablename);

quit;

---->-- ja karman --<-----

SergioSanchez · Posted 06-05-2014 06:36 AM

Well, another option, I am downloading the table order by thee key variable using pass thourh. After this I'll downloading the second table, ordering too.

This way I suppose I dont need the proc sort and I can perfom a merge to find wich rows are in common.

What do you think about this?

Thanks

Kurt_Bremser · Posted 06-05-2014 07:20 AM

If you can get the sort done "on the fly", that is probably the simplest solution. Then the merge just reads sequentially through the tables and writes the target,so it all depends on the sequential I/O throughput of your SAS server.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jakarman · Posted 06-05-2014 07:22 AM

If you have the space available on your local machine and that one is fast enough.... you are on Kurt's track. Nothing to bother on unexpected data transports.

Should work fine when those conditions (local performance) are met.

---->-- ja karman --<-----

AhmedAl_Attar · Posted 06-05-2014 08:43 AM

Hi Sergio,

This is typically what I do when it comes to debugging and resolving Performance issues with Queries that involves RDBMS source tables in SAS.

1. Always test the SQL statement(s) within a Native RDBMS Query Tool. This allows you to validate and Profile the query, which in turn, gives you better insight into the execution path and the true cost of the query (Performance Impact).

In your case, using ORACLE you are spoiled for choices

1.1 Oracle SQL Developer (Free)

1.2 Toad for Oracle (Commercial)

1.3 .... (Other tools that I'm not familiar with)

2. Use SQL Pass-Through. This allows you to use RDBM's specific syntax that may not be surfaced by the SAS Libname statement options.

3. Change the default BUFFSIZE= option value.

"BUFFSIZE=

specifies the number of rows to retrieve from an Oracle table or view with each fetch. Using this argument can improve the performance of any query to Oracle.

By setting the value of the BUFFSIZE= argument in your SAS programs, you can find the optimal number of rows for a given query on a given table. The default buffer size is 250 rows per fetch. The value of BUFFSIZE= can be up to 2,147,483,647 rows per fetch, although a practical limit for most applications is less, depending on the available memory."

4. Use the SASTRACE= System Option to determine whether an operation is processed by SAS or is passed t...

I hope this helps,

Ahmed

rachel · Posted 06-05-2014 09:27 AM

Sergio,

You can try a hash join, it runs a lot quicker. If you want I can post an example.

-raisa

SergioSanchez · Posted 06-05-2014 10:55 AM

Thanks all for you tips, I'll keep in mind

rachel escribió:

Sergio,

You can try a hash join, it runs a lot quicker. If you want I can post an example.

-raisa

Sure raisa, please go on

Regards

rachel · Posted 06-05-2014 11:20 AM

Sergio,

I created two fake data sets.

One with 50mil rows, the other with 30 mil rows.

Then I merge then.

Remember in a hash join for large data place the smalled data on top.

Here is the code;

******************************************

SergioSanchez · Posted 06-05-2014 01:12 PM

rachel escribió:

Sergio,

I created two fake data sets.

One with 50mil rows, the other with 30 mil rows.

Then I merge then.

Remember in a hash join for large data place the smalled data on top.

Here is the code;

******************************************

Thanks a lot rachel

SergioSanchez · Posted 06-06-2014 06:10 AM

Hi all

I get an error when running the hash code

ERROR: Hash object added 25165808 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

My PC info:

CPU :

Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz

Available Memory : 3242 MB

----------------------------------------------------

Drive List :

Drive C:

Type: Fixed

Total available space: 79 GB

Total size of drive: 232 GB

Thanks

jakarman · Posted 06-06-2014 08:29 AM

Sergio, Those specifications are of a low-end computer not very advanced these days. The speed is all right but that one does not increase anymore since 2005 or so.
I will not tease you with Server specs. The 3Gb is that little you must be very carefull with tools needing that. You have experienced this now with the hashing.

With the available disk space you should be able to process some files - merging - of a size as long it will fit with a 3-fold margin.

---->-- ja karman --<-----

bentleyj1 · Posted 06-06-2014 10:18 AM

I suspect that at least part of your performance problem is that you're using ODBC connectivity. My experience is that that is s-l-o-w for anything more than a couple hundred thousand rows.

connect to odbc as oracle

Ask around and be sure you don't have SAS Access/Oracle. That way you can control which of the datasets is moved to the other environment depending on which is the smaller dataset.

It could be that excessive time is spent creating the SAS data set A. Check the log and if you think so, when you create data set A add a Set keep= or drop= option to the where= to insure that only the fields you need are loaded into A... loading unneeded data is a common mistake that can have a huge performance impact.

data a;

set b (where = (var1<= date and var2>date and var3>date);

I've written a couple papers on using SAS and Oracle. The first one deals with SAS Options that either improve or degrade performance and I think you'll find a couple things here that will be useful to you, like how to make the read buffer larger. Each section has a quick tabular summary so that you can find the option(s) that look appropriate for your situation.

http://support.sas.com/resources/papers/proceedings13/072-2013.pdf

Another paper is not so appropriate to this problem but explains how to leverage database and system return codes. If you eventually need to automate your processes then these are critical for trapping and handling errors. http://analytics.ncsu.edu/sesug/2012/BB-03.pdf

Good luck!

John

Patrick · Posted 06-09-2014 03:31 AM

You've got already a lot of valuable advice and I'm not adding anything new but will just highlight what works best for me.

I have to work from time to time with Oracle tables and volumes as you describe it. With such volumes what you want to do is to reduce the volumes on the database side before you transfer it to SAS - and as a consequence you normally want to push processing to the data base.

What normally works best for me is using explicit pass-through SQL - and to develop and test the pass-through bit "outside" of SAS. I normally use Oracle SQL Developer for this (as you even don't need to install it and can run it directly from a flash drive).

Using Oracle SQL Developer allows you to tweak your code faster - it shows you the costs of queries and also graphically shows you the execution path (using "explain").

Once you've got your query working use it within the pass-through section of SAS Proc SQL.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!