Re: Query huge dataset

bekbek3128 · Posted 10-20-2020 12:01 AM

Hi,

We are suffering from performance issue in querying a dataset with more than 30 millions rows. It takes almost 30 minutes to select the necessary data from the dataset. Any suggestion how can we improve the performance?

Thanks.

ChrisNZ · Posted 10-20-2020 12:56 AM

30 millions rows is not huge.

Show us the log, with option fullstimer turned on.

Do you use a where clause, if statements to subset the data set?

Is the data set sorted? Indexed? compressed?

High-Performance SAS Coding - Third Edition

bekbek3128 · Posted 10-20-2020 07:14 AM

Hi ChrisNZ,

Sorry the dataset has 500 millions rows instead. Not sure is this considered huge or not. Anyway, the dataset is compressed but not sorted and not indexed.

Here's the log:
NOTE: There were 0 observations read from the data set ILIN.ACMVPF.
WHERE (trandate='18OCT2020'D) and (orgtrcde in ('T679', 'TA69') or batctrcde in ('T679', 'TA69'));
NOTE: The data set WORK.TEST has 0 observations and 59 variables.
NOTE: DATA statement used (Total process time):
real time 24:43.62
user cpu time 4:23.93
system cpu time 1:55.64
memory 875.78k
OS Memory 20120.00k
Timestamp 10/20/2020 07:06:40 PM
Step Count 154 Switch Count 2230
Page Faults 123
Page Reclaims 592
Page Swaps 0
Voluntary Context Switches 932839
Involuntary Context Switches 150130
Block Input Operations 336232352
Block Output Operations 144

andreas_lds · Posted 10-20-2020 12:57 AM

Buy better hardware?

Without the code you are currently using it is hardly possible to suggest something useful.

Kurt_Bremser · Posted 10-20-2020 02:54 AM

Please provide details:

complete log of the step with options fullstimer
observation size
is this a native SAS dataset, or data in a remote database?
if native SAS, stored on local disks, SAN, or network share?
SAS server setup: number of cores, operating system

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

bekbek3128 · Posted 10-20-2020 07:31 AM

Hi KurtBremser,

Hre's the log:
Hi ChrisNZ,

Sorry the dataset has 500 millions rows instead. Not sure is this considered huge or not. Anyway, the dataset is compressed but not sorted and not indexed.

Here's the log:
NOTE: There were 0 observations read from the data set ILIN.ACMVPF.
WHERE (trandate='18OCT2020'D) and (orgtrcde in ('T679', 'TA69') or batctrcde in ('T679', 'TA69'));
NOTE: The data set WORK.TEST has 0 observations and 59 variables.
NOTE: DATA statement used (Total process time):
real time 24:43.62
user cpu time 4:23.93
system cpu time 1:55.64
memory 875.78k
OS Memory 20120.00k
Timestamp 10/20/2020 07:06:40 PM
Step Count 154 Switch Count 2230
Page Faults 123
Page Reclaims 592
Page Swaps 0
Voluntary Context Switches 932839
Involuntary Context Switches 150130
Block Input Operations 336232352
Block Output Operations 144

Observation size is 484928084
It is a native SAS stored on SAN.
Server V8: 4 Core , 32gb memory Operating system: Red Hat Linux Enterprise 6.3

Kurt_Bremser · Posted 10-20-2020 07:51 AM

An order of magnitude DOES make a difference.

Depending on the observation size and therefore the resulting file size, we could make educated guesses about the SAN bandwidth.

From your log

real time 24:43.62
user cpu time 4:23.93
system cpu time 1:55.64

you have ~7.5 minutes CPU time vs. ~25 minutes real time, so you are quite clearly I/O bound.

Is your source dataset ILIN.ACMVPF compressed? If not, consider using COMPRESS=YES on datasets that contain character variables of considerable length; also test COMPRESS=BINARY for mainly numeric/short character datasets.

PS saw that you already use compression; please run a PROC CONTENTS and tell us the reported file size.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SASKiwi · Posted 10-20-2020 04:11 PM

I note that your query does not return any rows. How does the performance vary when rows are returned? How many rows are likely to be returned in a typical query on the 500m row table? If a typical query returns less than 10% of the rows than indexes on the query variables will help. That will have to be traded off against the overhead of maintaining the indexes when updating the source table.

ChrisNZ · Posted 10-20-2020 07:25 PM

SAS reads over 300k rows a second, that's not bad.

If the observation size is 484928084 (is it really??) that's 1.5 TB a second. This is unlikely. Can you check the values again please?.

In any case, this seems to be a textbook example of where SPDE should be used.

SPDE is more efficient when using large tables and indexes.

Run something like this:

libname SPEEDY spde "%sysfunc(pathname(ILIN))" partsize=1T compress=binary;

proc copy in=INLIN out=SPEEDY;
  select ACMVPF;
run;

proc datasets lib=SPEEDY noprint;
  modify ACMVPF;
  index create TRANDATE;
quit;

The run time of when querying SPEEDY.ACMVPF should drop to just seconds.

Use any libname you want instead of SPEEDY.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 10-21-2020 02:32 AM

I seriously doubt an observation size of 484928084 (almost 500 MB).

Please post the output of PROC CONTENTS.

PS we want to know the observation size, not the observation number.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away