I have two very large SAS datasets that have indexes on the most commonly used columns used in filters.
Dataset1
99,209,789 records
850 columns
Dataset2
758,618,653 records
375 columns
When I select records based on a text formatted indexed column, the results are returned in under 1 second.
When I select records based on a date formatted indexed column, the results are returned in about 1 minute 22 seconds.
When I select records based on a text formatted and the date formatted indexed column, the results are returned in under 1 second.
Does anyone know why the selection on the date column, that has been indexed, takes much longer (on its own) than when combined with a character indexed column or just a character indexed column.
Could it be the format of the date in the "where" clause?
Data credit_sasma;
set SASMAHIS.SASMA_ACCOUNT_HISTORY (where=(Obs_Date = '12JAN2023'd));
run;
Date format in the SAS Datasets:
3 | OBS_DATE | Num | 8 | DATE9. | DATE9. | OBS_DATE |
---|
See attached document for the code and the log.
That is not so strange. The queries using the text column returns very few rows. The query with the date column alone returns 2.7 million records. That has to take more time.
Could you run your code with the following options on:
options fullstimer msglevel=I;
so we could see details of processing times and information which indexes were used by SAS to process data.
To force SAS to use particular index you could try the IDXNAME= data set option.
[EDIT:] One more thing, indexes are executed against raw data, so format won't affect processing.
You probably saw this article by Billy Clifford about SAS Indexes: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/008-30.pdf
Bart
I added the index option and the options - options fullstimer msglevel=I;
Log: No difference in the runtime
28 Data credit_sasma;
29 set SASMAHIS.SASMA_ACCOUNT_HISTORY (idxname=obs_date where=(Obs_Date = '12JAN2023'd));
INFO: Index OBS_DATE selected for WHERE clause optimization.
30 run;
NOTE: There were 2789161 observations read from the data set SASMAHIS.SASMA_ACCOUNT_HISTORY.
WHERE Obs_Date='12JAN2023'D;
NOTE: The data set WORK.CREDIT_SASMA has 2789161 observations and 850 variables.
NOTE: Compressing data set WORK.CREDIT_SASMA decreased size by 67.65 percent.
Compressed is 60148 pages; un-compressed would require 185945 pages.
NOTE: DATA statement used (Total process time):
real time 1:20.43
user cpu time 1:12.15
system cpu time 7.75 seconds
memory 5980.75k
OS Memory 31400.00k
Timestamp 03/02/2023 12:16:33 PM
Step Count 17 Switch Count 133
Page Faults 2
Page Reclaims 1127
Page Swaps 0
Voluntary Context Switches 412
Involuntary Context Switches 535
Block Input Operations 12532969
Block Output Operations 15409176
It would be interesting to se the result of rerunning with a data _null_ step instead, so no time is used for compressing and writing output, just to see if the where clause is the real problem here.
@ErikLund_Jensen wrote:
It would be interesting to se the result of rerunning with a data _null_ step instead, so no time is used for compressing and writing output, just to see if the where clause is the real problem here.
Or maybe just KEEPing 2 or 3 variables instead of 850.
Writing 2.7 million observations simply takes longer than 1 or 37 observations (both just need one dataset page, so the same physical amount of data is written).
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.