03-02-2016 11:51 PM
I am running a complicated query with sas sql, and the real time is much longer than the cpu time combined. Is there anything I can do to reduce the real time and increase efficieny?
03-03-2016 03:58 AM
03-03-2016 01:22 PM
So the problem I have now is that I need to subset a table with around 5 millions of observations, (called this table A) based on another table B. The table A contains household purchase information: one row of data reprensents a household purchase of an item and a household generally buys many items (so household id varaible is repeated in many rows). Table B is just a list of households (around 30,000) that I care about and there is only one variable hhold_id that uniquely identify a single household. I would like to select those purchases in table A from the households listed in table B. I actually used a natural left join to subset table A based on Table B, but it took my computer 20 mins to do this simple subsetting wheras the cpu time is relatively short. According to this article (http://support.sas.com/resources/papers/proceedings09/333-2009.pdf), my query is very inefficient and I tried to increase buffer size and number, enable multi threading, but none of these substantially shorten the processing time.
03-03-2016 02:37 PM
Doesn't sound like a complicated query.
Inner joins is often easier to optimize than outer joins. If you need all households, even those not within A, you could do a second join with this, when table size will be substantially smaller.
Seeing your fullstimer msglevel=i log output would help.
The most efficient join strategy is hash join, and indexed join is second (use _method PROC SQL option to see strategy chosen). Both requiring an index on the join column in A.
How frequent will you (and others?) do this kind of queries? If the performance is really a burden you may need to look oer your whole data platform (HW, storage engine).
03-03-2016 02:51 PM - edited 03-03-2016 02:53 PM
Initially I only would like to have A which is from a complex query generated by joining 7 tables (raw data, 5GB combined), but later I realized I need to get rid of some households based on a critrion. So I tried to include the subset criterion in the very first query. However, it turned out that the processing time substantially increases even though the subsetting criterion is relatively simple. Later I decdied to do what I have described in the last post, get table A and generate the subsetted households separately and do a inner join to select the desicred housholds in table A. I was kind of surpised that such a simple inner join query could take that long time.
I need to change the criterion of selecting the households for several times for the statistical analysis later. I just do not want to run very time-consuming queries every time I decide to change the criterion.
One related question: is it more efficient to generate many intermediate tables separately and join them later to get the final table or write a complicated query including all of the subqueries that generate the intermediate tables?
03-03-2016 03:25 PM
merge smalldata (in=a) bigdata (in=b);
if a then output;
Assuming the data is sorted this should keep data on in the list of small data
03-03-2016 04:24 AM
Various factors can impact your real-time query performance, using SAS/ACCESS libnames, table size, data types, network latency, location of the databases and/or your SAS servers.
I'd also take a look at the SQL optimizer to see what your query is doing. This can been looked at by using the _TREE and _METHOD on the proc sql statement.
There's a fantasic paper by Russ Lavery on this subject. See link below. It's a very long and detailed paper, but if you have the time and inclination it's well worth a read.
03-04-2016 01:22 AM