BookmarkSubscribeRSS Feed
bekbek3128
Calcite | Level 5

Hi,

 

We are suffering from performance issue in querying a dataset with more than 30 millions rows. It takes almost 30 minutes to select the necessary data from the dataset. Any suggestion how can we improve the performance?

Thanks.

9 REPLIES 9
ChrisNZ
Tourmaline | Level 20

30 millions rows is not huge.

Show us the log, with option fullstimer turned on.

Do you use a where clause, if statements to subset the data set?

Is the data set sorted? Indexed? compressed? 

bekbek3128
Calcite | Level 5
Hi ChrisNZ,

Sorry the dataset has 500 millions rows instead. Not sure is this considered huge or not. Anyway, the dataset is compressed but not sorted and not indexed.

Here's the log:
NOTE: There were 0 observations read from the data set ILIN.ACMVPF.
WHERE (trandate='18OCT2020'D) and (orgtrcde in ('T679', 'TA69') or batctrcde in ('T679', 'TA69'));
NOTE: The data set WORK.TEST has 0 observations and 59 variables.
NOTE: DATA statement used (Total process time):
real time 24:43.62
user cpu time 4:23.93
system cpu time 1:55.64
memory 875.78k
OS Memory 20120.00k
Timestamp 10/20/2020 07:06:40 PM
Step Count 154 Switch Count 2230
Page Faults 123
Page Reclaims 592
Page Swaps 0
Voluntary Context Switches 932839
Involuntary Context Switches 150130
Block Input Operations 336232352
Block Output Operations 144
andreas_lds
Jade | Level 19

Buy better hardware?

Without the code you are currently using it is hardly possible to suggest something useful.

Kurt_Bremser
Super User

Please provide details:

  • complete log of the step with options fullstimer
  • observation size
  • is this a native SAS dataset, or data in a remote database?
  • if native SAS, stored on local disks, SAN, or network share?
  • SAS server setup: number of cores, operating system
bekbek3128
Calcite | Level 5
Hi KurtBremser,

Hre's the log:
Hi ChrisNZ,

Sorry the dataset has 500 millions rows instead. Not sure is this considered huge or not. Anyway, the dataset is compressed but not sorted and not indexed.

Here's the log:
NOTE: There were 0 observations read from the data set ILIN.ACMVPF.
WHERE (trandate='18OCT2020'D) and (orgtrcde in ('T679', 'TA69') or batctrcde in ('T679', 'TA69'));
NOTE: The data set WORK.TEST has 0 observations and 59 variables.
NOTE: DATA statement used (Total process time):
real time 24:43.62
user cpu time 4:23.93
system cpu time 1:55.64
memory 875.78k
OS Memory 20120.00k
Timestamp 10/20/2020 07:06:40 PM
Step Count 154 Switch Count 2230
Page Faults 123
Page Reclaims 592
Page Swaps 0
Voluntary Context Switches 932839
Involuntary Context Switches 150130
Block Input Operations 336232352
Block Output Operations 144

Observation size is 484928084
It is a native SAS stored on SAN.
Server V8: 4 Core , 32gb memory Operating system: Red Hat Linux Enterprise 6.3
Kurt_Bremser
Super User

An order of magnitude DOES make a difference.

Depending on the observation size and therefore the resulting file size, we could make educated guesses about the SAN bandwidth.

From your log

real time 24:43.62
user cpu time 4:23.93
system cpu time 1:55.64

you have ~7.5 minutes CPU time vs. ~25 minutes real time, so you are quite clearly I/O bound.

Is your source dataset ILIN.ACMVPF compressed? If not, consider using COMPRESS=YES on datasets that contain character variables of considerable length; also test COMPRESS=BINARY for mainly numeric/short character datasets.

 

 

PS saw that you already use compression; please run a PROC CONTENTS and tell us the reported file size.

SASKiwi
PROC Star

I note that your query does not return any rows. How does the performance vary when rows are returned? How many rows are likely to be returned in a typical query on the 500m row table? If a typical query returns less than 10% of the rows than indexes on the query variables will help. That will have to be traded off against the overhead of maintaining the indexes when updating the source table.

ChrisNZ
Tourmaline | Level 20

SAS reads over 300k rows a second, that's not bad.

 

If the observation size is 484928084 (is it really??) that's 1.5 TB a second. This is unlikely. Can you check the values again please?.

 

In any case, this seems to be a textbook example of where SPDE should be used.

SPDE is more efficient when using large tables and indexes.

Run something like this:

libname SPEEDY spde "%sysfunc(pathname(ILIN))" partsize=1T compress=binary;

proc copy in=INLIN out=SPEEDY;
  select ACMVPF;
run;

proc datasets lib=SPEEDY noprint;
  modify ACMVPF;
  index create TRANDATE;
quit;

The run time of when querying SPEEDY.ACMVPF should drop to just seconds.

Use any libname you want instead of SPEEDY.

 

Kurt_Bremser
Super User

I seriously doubt an observation size of 484928084 (almost 500 MB).

Please post the output of PROC CONTENTS.

 

PS we want to know the observation size, not the observation number.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2121 views
  • 1 like
  • 5 in conversation