Re: Filtering a dataset on an indexed date field takes long

DarrylLawrence · Posted 03-02-2023 03:27 AM

I have two very large SAS datasets that have indexes on the most commonly used columns used in filters.

Dataset1

99,209,789 records
850 columns

Dataset2

758,618,653 records
375 columns

When I select records based on a text formatted indexed column, the results are returned in under 1 second.

When I select records based on a date formatted indexed column, the results are returned in about 1 minute 22 seconds.

When I select records based on a text formatted and the date formatted indexed column, the results are returned in under 1 second.

Does anyone know why the selection on the date column, that has been indexed, takes much longer (on its own) than when combined with a character indexed column or just a character indexed column.

Could it be the format of the date in the "where" clause?

Data credit_sasma;
set SASMAHIS.SASMA_ACCOUNT_HISTORY (where=(Obs_Date = '12JAN2023'd));
run;

Date format in the SAS Datasets:

3	OBS_DATE	Num	8	DATE9.	DATE9.	OBS_DATE

See attached document for the code and the log.

s_lassen · Posted 03-02-2023 04:28 AM

That is not so strange. The queries using the text column returns very few rows. The query with the date column alone returns 2.7 million records. That has to take more time.

yabwon · Posted 03-02-2023 05:02 AM

Could you run your code with the following options on:

options fullstimer msglevel=I;

so we could see details of processing times and information which indexes were used by SAS to process data.

To force SAS to use particular index you could try the IDXNAME= data set option.

[EDIT:] One more thing, indexes are executed against raw data, so format won't affect processing.

You probably saw this article by Billy Clifford about SAS Indexes: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/008-30.pdf

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

DarrylLawrence · Posted 03-02-2023 07:33 AM

I added the index option and the options - options fullstimer msglevel=I;

Log: No difference in the runtime

28 Data credit_sasma;
29 set SASMAHIS.SASMA_ACCOUNT_HISTORY (idxname=obs_date where=(Obs_Date = '12JAN2023'd));
INFO: Index OBS_DATE selected for WHERE clause optimization.
30 run;

NOTE: There were 2789161 observations read from the data set SASMAHIS.SASMA_ACCOUNT_HISTORY.
WHERE Obs_Date='12JAN2023'D;
NOTE: The data set WORK.CREDIT_SASMA has 2789161 observations and 850 variables.
NOTE: Compressing data set WORK.CREDIT_SASMA decreased size by 67.65 percent.
Compressed is 60148 pages; un-compressed would require 185945 pages.
NOTE: DATA statement used (Total process time):
real time 1:20.43
user cpu time 1:12.15
system cpu time 7.75 seconds
memory 5980.75k
OS Memory 31400.00k
Timestamp 03/02/2023 12:16:33 PM
Step Count 17 Switch Count 133
Page Faults 2
Page Reclaims 1127
Page Swaps 0
Voluntary Context Switches 412
Involuntary Context Switches 535
Block Input Operations 12532969
Block Output Operations 15409176

ErikLund_Jensen · Posted 03-02-2023 11:58 AM

Hi @DarrylLawrence

It would be interesting to se the result of rerunning with a data _null_ step instead, so no time is used for compressing and writing output, just to see if the where clause is the real problem here.

ballardw · Posted 03-02-2023 12:18 PM

@ErikLund_Jensen wrote:

Hi @DarrylLawrence

It would be interesting to se the result of rerunning with a data _null_ step instead, so no time is used for compressing and writing output, just to see if the where clause is the real problem here.

Or maybe just KEEPing 2 or 3 variables instead of 850.

Kurt_Bremser · Posted 03-02-2023 02:52 PM

Writing 2.7 million observations simply takes longer than 1 or 37 observations (both just need one dataset page, so the same physical amount of data is written).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Filtering a dataset on an indexed date field takes long