Hi,
We are suffering from performance issue in querying a dataset with more than 30 millions rows. It takes almost 30 minutes to select the necessary data from the dataset. Any suggestion how can we improve the performance?
Thanks.
30 millions rows is not huge.
Show us the log, with option fullstimer turned on.
Do you use a where clause, if statements to subset the data set?
Is the data set sorted? Indexed? compressed?
Buy better hardware?
Without the code you are currently using it is hardly possible to suggest something useful.
Please provide details:
An order of magnitude DOES make a difference.
Depending on the observation size and therefore the resulting file size, we could make educated guesses about the SAN bandwidth.
From your log
real time 24:43.62 user cpu time 4:23.93 system cpu time 1:55.64
you have ~7.5 minutes CPU time vs. ~25 minutes real time, so you are quite clearly I/O bound.
Is your source dataset ILIN.ACMVPF compressed? If not, consider using COMPRESS=YES on datasets that contain character variables of considerable length; also test COMPRESS=BINARY for mainly numeric/short character datasets.
PS saw that you already use compression; please run a PROC CONTENTS and tell us the reported file size.
I note that your query does not return any rows. How does the performance vary when rows are returned? How many rows are likely to be returned in a typical query on the 500m row table? If a typical query returns less than 10% of the rows than indexes on the query variables will help. That will have to be traded off against the overhead of maintaining the indexes when updating the source table.
SAS reads over 300k rows a second, that's not bad.
If the observation size is 484928084 (is it really??) that's 1.5 TB a second. This is unlikely. Can you check the values again please?.
In any case, this seems to be a textbook example of where SPDE should be used.
SPDE is more efficient when using large tables and indexes.
Run something like this:
libname SPEEDY spde "%sysfunc(pathname(ILIN))" partsize=1T compress=binary;
proc copy in=INLIN out=SPEEDY;
select ACMVPF;
run;
proc datasets lib=SPEEDY noprint;
modify ACMVPF;
index create TRANDATE;
quit;
The run time of when querying SPEEDY.ACMVPF should drop to just seconds.
Use any libname you want instead of SPEEDY.
I seriously doubt an observation size of 484928084 (almost 500 MB).
Please post the output of PROC CONTENTS.
PS we want to know the observation size, not the observation number.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.