How to improve performance in Data step where clause

Learn_SAS1 · Posted 10-22-2020 01:25 AM

Hi,

I am currently using SAS Data step.

Data test1; /* loading approx 17 lakhs records */

Set source (keep=var1 var2 var3); /* source data set has 322596778 obs */

where DATEPART(START_from_dtm)="31JAN2020"D <=datepart(START_to_dtm);

run;

The above data step is running for 25 minutes. Is there any solution to execute the query faster.

START_from_dtm has index on it. So tried by keeping only START_from_dtm column but same time.

I am using the same query multiple times and the execution is increasing.

Thanks in advance.

ChrisNZ · Posted 10-22-2020 01:29 AM

Do your datetimes have a time, or is the time just 00:00:00?

Is your source table indexed or sorted?

How many observations do you have, and how many different dates?

High-Performance SAS Coding - Third Edition

Learn_SAS1 · Posted 10-22-2020 01:31 AM

effective_from_dtm has values like DDMONYYYY:00:00:00.000.

As I have observed most of the values has 00 as time.

Index, I have mentioned in my original post, effective_from_dtm has index and no other indexes.

I can't create Index on that table as it is owned by other teams.

That data set is not sorted.

Thanks

SASKiwi · Posted 10-22-2020 02:16 AM

This might be a little more efficient:

where START_from_dtm = "31JAN2020:00:00:00"DT <= START_to_dtm

ChrisNZ · Posted 10-22-2020 02:20 AM

@SASKiwi 's syntax will make all he difference in the world:

- No function to slow down the filtering

- ~~The index will be used~~

Edit: No index. I confused the variable names, sorry.

Still, using no function will make a significant difference.

High-Performance SAS Coding - Third Edition

Learn_SAS1 · Posted 10-22-2020 02:56 AM

Hi Saskiwi,

I have used that one too. but it hasn't changed much in the performance time.

Thanks

ChrisNZ · Posted 10-22-2020 11:21 PM

> I have used that one too. but it hasn't changed much in the performance time.

Well you need to help your self here. We still haven't seen the actual code (the code posted can't work), or the log or know the row length.

Review the questions asked and answer them.

High-Performance SAS Coding - Third Edition

SASKiwi · Posted 10-24-2020 06:19 PM

@Learn_SAS1 - As already suggested by others please post your complete SAS log including the notes about run times.

ScottBass · Posted 10-22-2020 04:47 AM

@Learn_SAS1 wrote:

effective_from_dtm has values like DDMONYYYY:00:00:00.000.

As I have observed most of the values has 00 as time.

Index, I have mentioned in my original post, effective_from_dtm has index and no other indexes.

I can't create Index on that table as it is owned by other teams.

That data set is not sorted.

Thanks

You've got a source dataset with 322,596,778 records, and need to filter on start_from_dtm <= some_static_date <= start_to_dtm. Typical SCD2 type filtering.

Yet your source dataset doesn't have an index on that column.

Have you have explained the performance issue to your "other teams", requested an index be created on that column, and they refused?

That's going to be your best approach to improving the performance of this query. Google how indexes work if you need further details.

Do you have the disk space to (as a proof-of-concept) create a copy of that dataset, create your own index on start_from_dtm start_to_dtm, then test the performance results?

If you do, you may also want to test performance if the data is stored using the SPDE engine.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

andreas_lds · Posted 10-22-2020 02:06 AM

@Learn_SAS1 wrote:

Hi,

I am currently using SAS Data step.

Data test1; /* loading approx 17 lakhs records */

Set source (keep=var1 var2 var3); /* source data set has 322.596.778 obs */

where DATEPART(START_from_dtm)="31JAN2020"D <=datepart(START_to_dtm);

run;

The above data step is running for 25 minutes. Is there any solution to execute the query faster.

START_from_dtm has index on it. So tried by keeping only START_from_dtm column but same time.

I am using the same query multiple times and the execution is increasing.

Thanks in advance.

In your original-code the variable used in the where-statement are also in the keep-list, right?

Reading a dataset with 332 million obs needs some time. My before-coffee-calculation showed, that about 200000 obs are processed per second, that is not that bad at all.