Re: SAS Sql subquery performance problem

deleted_user · Posted 08-24-2008 07:37 PM

Hi, I am having problemms with the use of subqueries in Proc Sql using EG.
The data I want is in table b, the key for table B is in table A. Table B is a large table (20+ mill rows) The key IS indexed
The query to table A to get the right key takes only seconds.
If I query table B directly using the key I get a return in seconds, If I use a subquery (A) to get the key however the return takes an unnaceptable amount of time (20+ minutes). Using in-line or rewriting as a join does not improve performance.
The data returned by both queries is only a handful of rows. The same type of queries in SQLServer return in the expected several seconds.

Any idea what the issue with Proc Sql is?

QUERIES

/*Query to Table A */

PROC SQL;
CREATE TABLE IPJOIN AS
SELECT acct_no
, client_key
, IP_Address
FROM tableA
WHERE acct_no = &acctno
;

/*Query to Table B */

PROC SQL _METHOD;

CREATE TABLE IPBET AS
SELECT client_key
, ip_addr
FROM tableB
WHERE client_key IN (SELECT DISTINCT client_key
FROM IPJOIN)

;

deleted_user · Posted 08-24-2008 07:54 PM

The log for _METHOD for the query above is below if that sheds any light.

NOTE: SQL execution methods chosen are:

sqxcrta
sqxuniq
sqxsrc( OB_SAND.OB_FACT_BETSELECTION )

NOTE: SQL subquery execution methods chosen are:

sqxsubq
sqxuniq
sqxsrc( WORK.IPJOIN )
NOTE: Compressing data set WORK.IPBET increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: Table WORK.IPBET created, with 6 rows and 2 columns.

24
25 %LET _CLIENTTASKLABEL=;
26 %LET _EGTASKLABEL=;
2 The SAS System 08:54 Monday, August 25, 2008

27 %LET _CLIENTPROJECTNAME=;
28 %LET _SASPROGRAMFILE=;
29
30 ;*';*";*/;quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 17:28.25
cpu time 2:12.49

LinusH · Posted 08-25-2008 05:20 AM

Can't you do a Select distinct on tableA, the do a ordinary inner join? Then the SQL optimizer may use an indexed join strategy, even a hash join if the result set from tableA is small enough.

Regards,
Linus

Data never sleeps

deleted_user · Posted 08-25-2008 05:14 PM

I could, but the query show is the smallest part of what is eventually going to be a much more complicated query. The use of subqueries is one of the best features of sql, what I am trying to understand is the reason for the performance problem, I haven't come across it in other sql implementations. Is it an inherent feature of SAS sql that subqueries perform poorly? My own Db suggest so, but they are SAS coders and have little use for sql.

Report Inappropriate Content · Posted 08-27-2008 12:51 PM

Thanks for the clarification. I think I understand the issue better.

I think your problem is similar to one I ran into and has to do with whether the query is actually being processed on the database server or on your SAS machine.

When you run the query with the actual key value, it runs quickly because the entire query is passed to the database where it is processed and only the relevant records are passed back to SAS which then puts the results into a table.

When you run the query with a subquery that uses an existing SAS dataset, that query can't be entirely passed to the database because of the reference to the SAS dataset. In this case, the database sends back the entire table B (all 20 million rows) and SAS does the processing of subsetting this table with the SAS dataset created in query A. One way you can see this is by watching the temporary file that is being created while the query is running. You'll probably see it grow to some very large size, and then near the end it will be replaced with a smaller file representing the merge of table B with IPJOIN.

Subqueries work efficiently when they can be passed entirely to the database for processing. Whenever possible, I don't recommend using SAS datasets in subqueries to databases unless the database tables are small. On the other hand, I've found that using macro variables in the query doesn't hurt performance.

Can you try this query? It uses the query you used to build IPJOIN as the subquery for IPBET, but doesn't use any SAS datasets. The side benefit is that your programming will be more efficient: one less query to the database and one less SAS dataset created. I have a feeling this will run faster. Let us know how it works.

[pre]
PROC SQL;
CREATE TABLE IPBET AS
SELECT client_key, ip_addr
FROM tableB
WHERE client_key IN (SELECT DISTINCT client_key
FROM tableA
WHERE acct_no = &acctno);
QUIT;
[/pre]

Patrick · Posted 08-27-2008 07:47 PM

Consider using the SASTRACE= System Option to better see what's going on:
http://support.sas.com/onlinedoc/913/getDoc/en/acreldb.hlp/a000433982.htm

There is also a whole section in the SAS Online Doc about optimizing SQL queries:
http://support.sas.com/onlinedoc/913/getDoc/en/acreldb.hlp/a002254657.htm

And if it's "only" a table lookup then consider using the hash object:
http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002585310.htm

HTH
Patrick

deleted_user · Posted 08-27-2008 07:51 PM

No joy unfortunately execution time was the same, using the query below
Is using the pass through ability the answer?

PROC SQL _METHOD;

CREATE TABLE IPBET AS
SELECT DISTINCT client_key
,ip_addr

FROM tableB
WHERE client_key IN (SELECT DISTINCT client_key
FROM tableA
WHERE acct_no = &acctno)
;
QUIT;

The log produced was

15 PROC SQL _METHOD;
16
17 CREATE TABLE IPBET AS
18 SELECT DISTINCT b.client_key
19 , b.ip_addr
20
21 FROm ob_sand.ob_fact_betselection b
22 WHERE b.CLIENT_KEY IN (SELECT client_key
23 FROM ob_sand.ob_dim_client_current
24 WHERE acct_no = &acctno)
25
26
27 ;

NOTE: SQL execution methods chosen are:

sqxcrta
sqxuniq
sqxsrc( OB_SAND.OB_FACT_BETSELECTION(alias = B) )

NOTE: SQL subquery execution methods chosen are:

sqxsubq
sqxsrc( OB_SAND.OB_DIM_CLIENT_CURRENT )
NOTE: Compressing data set WORK.IPBET increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
2 The SAS System 09:46 Thursday, August 28, 2008

NOTE: Table WORK.IPBET created, with 2 rows and 2 columns.

28 QUIT;
NOTE: PROCEDURE SQL used (Total process time):
real time 18:52.92
cpu time 2:18.12

29
30 %LET _CLIENTTASKLABEL=;
31 %LET _EGTASKLABEL=;
32 %LET _CLIENTPROJECTNAME=;
33 %LET _SASPROGRAMFILE=;
34
35 ;*';*";*/;quit;run;
36 ODS _ALL_ CLOSE;
37
38
39 QUIT; RUN;

Patrick · Posted 08-27-2008 08:21 PM

Both tables are SAS tables. Do I get that right?
How many observations are in table A?

Have a look at the example code below. This runs (even not indexed) in reasonable time.
What's the difference to your real code?

%let acctno=1;
data tableA;
do client_key=1 to 1000;
acct_no=int(round(ranuni(0)));
output;
end;
run;
data tableB;
do ip_addr=1 to 3*10**6; /* -> 3M obs */
client_key=ceil(ranuni(0)*10000);
output;
end;
run;

PROC SQL _METHOD;
CREATE TABLE IPBET AS
SELECT DISTINCT client_key, ip_addr
FROM tableB
WHERE client_key IN
(SELECT DISTINCT client_key
FROM tableA
WHERE acct_no = &acctno)
;
QUIT; Message was edited by: Patrick

deleted_user · Posted 08-27-2008 11:53 PM

Sorry Patrick, yes both are existing SAS tables (learn somethig new everyday)
I don't know base sas, just sql.

Patrick · Posted 08-28-2008 01:28 AM

The example code I've posted does nothing else than creating two SAS tables tableA and tableB. TableA has 1000 records and 2 variables (acct_no and client_key), tableB 3million (with variables client_key and ip_addr).

After that I'm doing nothing else than applying the SQL code you've provided - and everything runs in reasonable time on my laptop (Windows XP, SAS9.1.3).

The SQL creates a table with around 150000 rows and needs around 3.5 seconds CPU (most of it will be I/O writing the rows). Even by increasing tableA to 1M rows my job takes less than 30 seconds CPU

You're talking about 20 minutes. Looking at the log it's:
real time 18:52.92
cpu time 2:18.12

This huge gap between cpu time (2.18 min CPU still seems too long) and real time looks as if this SAS server is quite busy (SAS EG is only a client, the code is processed on the server). The long CPU time could also be due to I/O. How big is the result table - and how fast is the disc (most certainly the server disc) with you SAS work space?

As you see: I assume it's less a SAS SQL issue than an infrastructure/installation issue.

What you could do is to run the code I've posted and check how much time it takes. If your close to your real job then you know that it's about the SAS Server and you should contact your SAS admin, if the CPU time is close to what I get here on my laptop then I would look where the datasets you're accessing are stored (may be on another server with a slow connection to the SAS server).
If both cases above don't provide an explanation then you should check what's different between the test code and the real code. May be one of your tables is a bit bigger than you think?

HTH
Patrick

P.S: Pass-through SQL is not a solution as this is used to pass SQL code from SAS directly to a DBMS (eg. ORACLE) and have it executed there.

deleted_user · Posted 08-28-2008 01:59 AM

Thanks for all the replies so far guys.
The SAS server is an old box (single cpu) and we are long over due for a hardware upgrade, not sure of the disk speed.

Table A is 441,267 obs 115 variables
Table B is 22,983,750 obs 57 variables.

My typical return from the query is 100 rows or so.

Your code with a 146,000 row return ran in the times below:

NOTE: DATA statement used (Total process time):
real time 3.64 seconds
cpu time 3.57 seconds

NOTE: PROCEDURE SQL used (Total process time):
real time 5.92 seconds
cpu time 5.85 seconds

Patrick · Posted 08-28-2008 03:20 PM

O.K., it's now pretty obvious:
I adopted the example code to the size of the tables you’re fighting with (number of columns and rows) and get now similar results.
NOTE: PROCEDURE SQL used (Total process time):
real time 22:22.71
cpu time 3:51.10

The reason is I/0! It just takes a long time to read tableB.

I’m using an external USB drive (slow) and the data step for writing tableB takes already a LONG time.
NOTE: The data set OB_SAND.TABLEB has 23000000 observations and 57 variables.
NOTE: DATA statement used (Total process time):
real time 25:11.86
cpu time 4:49.20

So: Everything o.k. with the SQL, it’s not a processing problem. I think this case is a good example for when data should be stored in a DBMS and not in SAS files.
It seems that SAS has to scan the whole SAS file in order to pick the few relevant fields (running the same code with fewer variables in the table decreases time immensely).

Conclusion
A faster server disk and motherboard will decrease the time needed but best would be to store the data in a DBMS (eg. Teradata, SQL server, ORACLE or also SAS SPD Server) as a DBMS can accesses data a bit more intelligent.

HTH
Patrick

P.S: And here the code with the long run-time:

libname OB_SAND 'e:\tests';
%let acctno=1;
data OB_SAND.tableA;
drop i;
array VarsA {113} $ 20;
do i=1 to dim(VarsA);
VarsA{i}='Just some content';
end;
do client_key=1 to 441267;
acct_no=int(round(ranuni(0)));
output;
end;
run;
data OB_SAND.tableB;
drop i;
array VarsB {55} $ 20;
do i=1 to dim(VarsB);
VarsB{i}='Just some content';
end;
do ip_addr=1 to 23*10**6;
client_key=ceil(ranuni(0)*5*10**10);
output;
end;
run;

PROC SQL _METHOD;
CREATE TABLE work.IPBET AS
SELECT DISTINCT client_key, ip_addr
FROM OB_SAND.tableB
WHERE client_key IN
(SELECT DISTINCT client_key
FROM OB_SAND.tableA
WHERE acct_no = &acctno)
;
QUIT;

deleted_user · Posted 08-28-2008 05:41 PM

Ok, case of right tool for the job.
Thanks for your time and assistance Patrick.

Patrick · Posted 09-02-2008 06:48 AM

I just joined a project with a similar problem: A lot of observations and variables and performance problems (also a Star model).

The solution was: Horizontal and vertical fragmentation.

So: Split the biggest SAS tables in several tables and create views. For your case take a view with as little as possible fields (or access the appropriate SAS table directly).

I'm aware that this is a design change as the ETL jobs have to be adopted - but it might be worth it if the problems persist.

Regards
Patrick

deleted_user · Posted 09-02-2008 05:21 PM

Thanks Patrick, that is exactly what I ended up asking for.

Registration is open

SAS Training: Just a Click Away