Hi all,
I am running a complicated query with sas sql, and the real time is much longer than the cpu time combined. Is there anything I can do to reduce the real time and increase efficieny?
Thanks,
So the problem I have now is that I need to subset a table with around 5 millions of observations, (called this table A) based on another table B. The table A contains household purchase information: one row of data reprensents a household purchase of an item and a household generally buys many items (so household id varaible is repeated in many rows). Table B is just a list of households (around 30,000) that I care about and there is only one variable hhold_id that uniquely identify a single household. I would like to select those purchases in table A from the households listed in table B. I actually used a natural left join to subset table A based on Table B, but it took my computer 20 mins to do this simple subsetting wheras the cpu time is relatively short. According to this article (http://support.sas.com/resources/papers/proceedings09/333-2009.pdf), my query is very inefficient and I tried to increase buffer size and number, enable multi threading, but none of these substantially shorten the processing time.
Thanks,
Doesn't sound like a complicated query.
Inner joins is often easier to optimize than outer joins. If you need all households, even those not within A, you could do a second join with this, when table size will be substantially smaller.
Seeing your fullstimer msglevel=i log output would help.
The most efficient join strategy is hash join, and indexed join is second (use _method PROC SQL option to see strategy chosen). Both requiring an index on the join column in A.
How frequent will you (and others?) do this kind of queries? If the performance is really a burden you may need to look oer your whole data platform (HW, storage engine).
Initially I only would like to have A which is from a complex query generated by joining 7 tables (raw data, 5GB combined), but later I realized I need to get rid of some households based on a critrion. So I tried to include the subset criterion in the very first query. However, it turned out that the processing time substantially increases even though the subsetting criterion is relatively simple. Later I decdied to do what I have described in the last post, get table A and generate the subsetted households separately and do a inner join to select the desicred housholds in table A. I was kind of surpised that such a simple inner join query could take that long time.
I need to change the criterion of selecting the households for several times for the statistical analysis later. I just do not want to run very time-consuming queries every time I decide to change the criterion.
One related question: is it more efficient to generate many intermediate tables separately and join them later to get the final table or write a complicated query including all of the subqueries that generate the intermediate tables?
Thanks,
Try:
data want;
merge smalldata (in=a) bigdata (in=b);
by household;
if a then output;
run;
Assuming the data is sorted this should keep data on in the list of small data
Various factors can impact your real-time query performance, using SAS/ACCESS libnames, table size, data types, network latency, location of the databases and/or your SAS servers.
I'd also take a look at the SQL optimizer to see what your query is doing. This can been looked at by using the _TREE and _METHOD on the proc sql statement.
There's a fantasic paper by Russ Lavery on this subject. See link below. It's a very long and detailed paper, but if you have the time and inclination it's well worth a read.
http://www2.sas.com/proceedings/sugi30/101-30.pdf
Cheers
David
Simplify the process then. Don't do as much in the query. Avoid subqeuries. Use datastep processing rather than SQL.
In general i would say instead of using a single we could accomplish the same in following steps:
1. Take dump of Table A using proc sql into SAS dataset.
2. Do the same for Table B
Load the Table B onto a proc format and apply that to table A.
This should optimize the code.
Most often the slowness of a query would depends on the connection driver either ODBC/Oracle and so on.
So the throught process should ideally be to use a method to take a dump and use the other to optimize results, so i would say use the SQL code to pull the data and not join and use SAS formatting to join the data. if it helps create use proc ds2 for parallel record processing.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.