Re: SAS sql efficiency

bochen · Posted 03-02-2016 11:51 PM

Hi all,

I am running a complicated query with sas sql, and the real time is much longer than the cpu time combined. Is there anything I can do to reduce the real time and increase efficieny?

Thanks,

LinusH · Posted 03-03-2016 03:58 AM

Every query is kinda unique.
Your data model, indexes, query logic, I/O bottlenecks etc - almost anything can have impact on your query performance.
There are lots of papers about query optimization, search support.Sas.com.
If you need specific guidance you need to describe your situation in much more detail.

Data never sleeps

bochen · Posted 03-03-2016 01:22 PM

So the problem I have now is that I need to subset a table with around 5 millions of observations, (called this table A) based on another table B. The table A contains household purchase information: one row of data reprensents a household purchase of an item and a household generally buys many items (so household id varaible is repeated in many rows). Table B is just a list of households (around 30,000) that I care about and there is only one variable hhold_id that uniquely identify a single household. I would like to select those purchases in table A from the households listed in table B. I actually used a natural left join to subset table A based on Table B, but it took my computer 20 mins to do this simple subsetting wheras the cpu time is relatively short. According to this article (http://support.sas.com/resources/papers/proceedings09/333-2009.pdf), my query is very inefficient and I tried to increase buffer size and number, enable multi threading, but none of these substantially shorten the processing time.

Thanks,

LinusH · Posted 03-03-2016 02:37 PM

Doesn't sound like a complicated query.

Inner joins is often easier to optimize than outer joins. If you need all households, even those not within A, you could do a second join with this, when table size will be substantially smaller.

Seeing your fullstimer msglevel=i log output would help.

The most efficient join strategy is hash join, and indexed join is second (use _method PROC SQL option to see strategy chosen). Both requiring an index on the join column in A.

How frequent will you (and others?) do this kind of queries? If the performance is really a burden you may need to look oer your whole data platform (HW, storage engine).

Data never sleeps

bochen · Posted 03-03-2016 02:51 PM

Initially I only would like to have A which is from a complex query generated by joining 7 tables (raw data, 5GB combined), but later I realized I need to get rid of some households based on a critrion. So I tried to include the subset criterion in the very first query. However, it turned out that the processing time substantially increases even though the subsetting criterion is relatively simple. Later I decdied to do what I have described in the last post, get table A and generate the subsetted households separately and do a inner join to select the desicred housholds in table A. I was kind of surpised that such a simple inner join query could take that long time.

I need to change the criterion of selecting the households for several times for the statistical analysis later. I just do not want to run very time-consuming queries every time I decide to change the criterion.

One related question: is it more efficient to generate many intermediate tables separately and join them later to get the final table or write a complicated query including all of the subqueries that generate the intermediate tables?

Thanks,

RW9 · Posted 03-03-2016 03:25 PM

Try:
data want;

merge smalldata (in=a) bigdata (in=b);

by household;

if a then output;

run;

Assuming the data is sorted this should keep data on in the list of small data

moorsd · Posted 03-03-2016 04:24 AM

Various factors can impact your real-time query performance, using SAS/ACCESS libnames, table size, data types, network latency, location of the databases and/or your SAS servers.

I'd also take a look at the SQL optimizer to see what your query is doing. This can been looked at by using the _TREE and _METHOD on the proc sql statement.

There's a fantasic paper by Russ Lavery on this subject. See link below. It's a very long and detailed paper, but if you have the time and inclination it's well worth a read.

http://www2.sas.com/proceedings/sugi30/101-30.pdf

Cheers

David

RW9 · Posted 03-03-2016 04:37 AM

Simplify the process then. Don't do as much in the query. Avoid subqeuries. Use datastep processing rather than SQL.

bochen · Posted 03-03-2016 01:26 PM

subqeuries slows the entire query down? Or is it faster to create temp tables and use these tables in the main query?

LinusH · Posted 03-04-2016 01:22 AM

Not generally.

5gb doesn't sound that much.
Again seeing the resource consumption will help.
And it's quite difficult to give specific advice, to do so we need to be involved in detail in your work.
And still, I think that you need to reconsider your environment.

Data never sleeps

rohitdante16 · Posted 12-20-2019 05:50 AM

In general i would say instead of using a single we could accomplish the same in following steps:

1. Take dump of Table A using proc sql into SAS dataset.

2. Do the same for Table B

Load the Table B onto a proc format and apply that to table A.

This should optimize the code.

Most often the slowness of a query would depends on the connection driver either ODBC/Oracle and so on.

So the throught process should ideally be to use a method to take a dump and use the other to optimize results, so i would say use the SQL code to pull the data and not join and use SAS formatting to join the data. if it helps create use proc ds2 for parallel record processing.

Registration is open

SAS Training: Just a Click Away