Hello.
I have this situation: there are two similar tables in SAS on the same server, both have about 300 million observations and simple indexes on numeric fields with date type data. In the first table index "load_date1" has 2000 unique values, the second table has index "load_date2" with 80 unique values, distribution of observations on different index values in both tables is mostly even. So both these indexes should improve queries with WHERE expression significantly, but it is true only for the second table and I have no idea why. In both cases tables are unsorted, but data was added in them incrementally with increasing values of indexes, so there is not much difference in data distribution, based on index values. Here is some data on them (I had to change it a bit):
Data Set Name | table_1 | Observations | 294416780 |
---|---|---|---|
Member Type | DATA | Variables | 10 |
Engine | V9 | Indexes | 1 |
Created | 05 Апрель 2013 г. пятница 05:23:54 | Observation Length | 140 |
Last Modified | 05 Апрель 2013 г. пятница 05:23:54 | Deleted Observations | 0 |
Protection | Compressed | CHAR | |
Data Set Type | Reuse Space | NO | |
Label | Point to Observations | YES | |
Data Representation | WINDOWS_64 | Sorted | NO |
Encoding | wcyrillic Cyrillic (Windows) |
Data Set Page Size | 131072 |
---|---|
Number of Data Set Pages | 192865 |
Index File Page Size | 4096 |
Number of Index File Pages | 581713 |
Number of Data Set Repairs | 0 |
Filename | somewhere\table_1.sas7bdat |
Release Created | |
Host Created |
Data Set Name | table_2 | Observations | 297622992 |
---|---|---|---|
Member Type | DATA | Variables | 33 |
Engine | V9 | Indexes | 1 |
Created | 03 Апрель 2013 г. среда 00:58:54 | Observation Length | 289 |
Last Modified | 03 Апрель 2013 г. среда 00:58:54 | Deleted Observations | 0 |
Protection | Compressed | BINARY | |
Data Set Type | Reuse Space | NO | |
Label | Point to Observations | YES | |
Data Representation | WINDOWS_64 | Sorted | NO |
Encoding | wcyrillic Cyrillic (Windows) |
Data Set Page Size | 16384 |
---|---|
Number of Data Set Pages | 1372810 |
Index File Page Size | 4096 |
Number of Index File Pages | 587074 |
Number of Data Set Repairs | 0 |
Filename | somewhere\table_2.sas7bdat |
Release Created | |
Host Created |
And query is like this(to get data for the last month):
data result;
set table_2(idx_name=load_date2);
where load_date2>=&date_month_ago;
run;
Does anybody have any idea, why this type of code is working with the second table (code is running for 1-2 minutes) and isn't working with the first one (code is running for about 50 minutes - sometimes even longer than without usage of index)?
This may be key: "data was added in them incrementally with increasing values of indexes,"
In a number of database systems, when data are appended, the index just does a linear search (e.g. does not really use the index) for observations after the first appended one. If SAS works that way, then dropping the index and rebuilding it will cure the problem.
If you try it, let the list know what you find out.
Doc Muhlbaier
Duke
Index was dropped and rebuilt, alas there is no improvement.
Demoxe
try PROC CONTENTS with the option CENTILES
it will demonstrate the distribution - which is likely to be one part of the data used by optimisation engines to choose whether an index provides the fastest way to deliver the result data set..
Before each query using the indexed table(s), use
option MSGLEVEL=i ;
to get feedback on the choice of indexes made by optimisation engines.
On PROC SQL, use option _METHOD to see whether it chooses indexJoin or SortMergeJoin
I would be surprised if COMPRESS=BINARY makes the second table_2 index unsuitable. (table_1 has compress=CHAR)
The index page size is 4K on both tables, and there are similar number of index pages for both tables.
However for table_2 there are 1.37M pages of data compared with 0.19M pages for table_1
So I think this is enough to make the difference.
The indexes point to dataset pages.
The shape of the observations and indexes are clearly very different between the tables so their index-use is likely to be different too.
If you want to treat them in a similar way, give the tables similar PAGESIZE - and for that volume of data, I would not recommend anything as small as table_2's 16K. Table_1 has 128K pagesize.
The indexes point to pages.
When reading only a small (under 5%) subset small pages are helpful, but that is a lot of pages to shuffle for a larger subset.
good luck looking further into your data-usage.
hope we'll hear more updates after you have tried some of these ideas
peterC
.
As for centiles I have this distribution for table_1:
As you see, distribution is quite even.
I've already used this option "MSGLEVEL=i" and it was printed in log, that "load_date1" is used as index.
As for changing PAGESIZE of table_1, I'm not sure how to do it and wasn't able to find it in Help or somewhere else. Can you can tell me the way?
And I'm sorry for the delay in my answers, I have to wait for assistance from our admins and they are not so eager to help.
these are unusual date values!
what percentage of the table is extracted by
where load_date2>=&date_month_ago;
????
when it uses an index to read the data, it practically executes the output stage of a sort - returning data in index order
does it need to visit a great many pages to get the one month's data?
I've posted the number of observations for each centile, not the values of boundaries And this query gets about 1-2% of all data.
Data is added with increasing values of index each time, so with condition like "where load_date2>=&date_month_ago" only the last small part of table_1 is extracted (not bits and pieces from all over the table).
Please share the log for each query (with msglevel=i).
Also adding fullstimer could give some hints about bottlenecks in the system.
You also may look in Windows task manager during the queries, to see the total usage of system resources.
And just to test the read performance, use data _null_, so you can rule out any write data bottlenecks.
But, even if you/we can't get to the bottom of this, there is the solution to move your data to SPDE, which will probably increase performance of bot loading and definitely querying.
12 options fullstimer;
13
14 data _NULL_;
15 set table_1(idxname=load_date1);
16 where load_date1>=&date_month_ago;
INFO: Index load_date1 selected for WHERE clause optimization.
17 run;
NOTE: There were 3746291 observations read from the data set table_1.
WHERE load_date1>=19446;
NOTE: DATA statement used (Total process time):
real time 52:41.22
user cpu time 25.17 seconds
system cpu time 1:58.37
memory 256276.06k
OS Memory 399160.00k
Timestamp 12.04.2013 14:50:54
18
19 data _NULL_;
20 set table_2(idxname=load_date2);
21 where load_date2>=&date_month_ago;
INFO: Index load_date2 selected for WHERE clause optimization.
22 run;
NOTE: DATA statement used (Total process time):
real time 27.91 seconds
user cpu time 11.77 seconds
system cpu time 3.10 seconds
memory 34019.84k
OS Memory 52404.00k
Timestamp 12.04.2013 14:51:22
NOTE: There were 6973693 observations read from the data set table_2.
WHERE load_date2>=19446;
In any case server can't be used so much on the first query and free on the second one, so there will be such a difference in process time. As for SPDE.. it's a solution indeed, but in my case I don't think, that admins will help me that way.
Even as Peter mentioned that there are differences between your tables, this is still a mysterious.
To fins out whether this is a index problem, read as many observations from each table, corresponding to your index search result set:
data _null_;
set table_1(obs=3746291);
run;
And corresponding obs= for table_2;
I'm not sure what your are referring to what kind of help you need from admins. Technically, it sounds that you should be able to define a SPDE lib by your self. Just crate some directories, and then just lib it up...!? If you lack permissions, just set it up in the C:\temp for testing purposes...
Demoxe
I see that you force the use of the index.
Would the index not be chosen by SAS if you had not demanded it?
It is possible that SAS has some good reaon to reject use of the index
- and we have still to identify that reason.
peterC
an alternative approach to this use of indexes might be possible and might greatly improve access times.
The index is used to retrieve rows added since "some point in time" using LOAD_DATE.
Without deleting rows,and not having the REUSE option in effect, INSERT has the affect of APPEND by adding new rows at the end of the table.
So assuming LOAD_DATE represents the order in which rows are added to the table, then rather than use the index, the results could be achieved by using the FIRSTOBS= dataset option on a SET statement.
To obtain the value to use for the FIRSTOBS option could be part of the "data-loading-process" in future. Historic load counts could easily be obtained in a single-one-off-pass. Keep these in a table with load_date, row_count and since the volume is small (this table of load_date and row_count), the FIRSTOBS= value could be derived when needed (cumulative row_count +1).
To prove the issue about FIRSTOBS= versus WHERE= with the index, I prepared a 1.5GB table 10M rows with a bland mixture of strings, dates and row_number
The table BIGGIE has 10M rows created by
data biggie ;
length rows 8 string1-string5 $20 ;
format date1-date5 date9. ;
do rows = 1 to 1e7 ;
output ;
end;
run;
I prepared a unique index on ROWS with
proc datasets details; run;
modify biggie ;
create index rows/unique ;
run ;
quit ;
proc datasets details; run;
reveals the filesize of data and index are
Member Obs, Entries
# Name Type or Indexes Vars Label File Size Last Modified
1 BIGGIE DATA 10000000 11 1536025600 15-Apr-13 10:13:37
BIGGIE INDEX 1 161702912 15-Apr-13 10:13:37
The data and index are created with the defaults for buffers and pages.
I have run separate extracts with FIRSTOBS= and WHERE= for 1500, 15000, 150000 and 1500000 row subsets.
FIRSTOBS is a clear winner
subset size | REAL FIRSTOBS | REAL WHERE | CPU FIRSTOBS | CPU WHERE |
---|---|---|---|---|
1500 | 0.08 | 0.12 | 0.04 | 0.02 |
15000 | 0.04 | 0.05 | 0.05 | 0.02 |
150000 | 0.17 | 0.36 | 0.15 | 0.20 |
1500000 | 7.43 | 40.99 | 1.31 | 3.08 |
Here I presume to use no compresion and judge by REAL Time because the data are so artificial.
Your experience WILL vary.
I expect FIRSTOBS is the winner because the system has less to do - just collect all rows following the start point -
and index processing
makes no assumption about the order of index values
checks the index for every row that satisfies the WHERE
Clearly the performance on my old machine varied greatly when the nsize of the subset grew to 1.5M, being 15% of the data - and notice in the logs below - MSGLEVEL=i indicates that the index was still used
The SASlog showing the 8 datasteps follows:
280 %let prior_posn = %sysevalf( 1e7 - 1500 ) ;
281 option msglevel=i ;
282 data to_be_analysed ;
283 set biggie( firstOBS=%eval(1+&prior_posn) ) ;
284 run ;
NOTE: There were 1500 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 1500 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
cpu time 0.04 seconds
285
286 data to_be_analysed ;
287 set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
288 run ;
NOTE: There were 1500 observations read from the data set WORK.BIGGIE.
WHERE rows>9998500;
NOTE: The data set WORK.TO_BE_ANALYSED has 1500 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.12 seconds
cpu time 0.03 seconds
289 %let prior_posn = %sysevalf( 1e7 - 15000 ) ;
290 option msglevel=i ;
291 data to_be_analysed ;
292 set biggie( firstOBS=%eval(1+&prior_posn) ) ;
293 run ;
NOTE: There were 15000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 15000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
294 data to_be_analysed ;
295 set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
296 run ;
NOTE: There were 15000 observations read from the data set WORK.BIGGIE.
WHERE rows>9985000;
NOTE: The data set WORK.TO_BE_ANALYSED has 15000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.03 seconds
297 %let prior_posn = %sysevalf( 1e7 - 150000 ) ;
298 option msglevel=i ;
299 data to_be_analysed ;
300 set biggie( firstOBS=%eval(1+&prior_posn) ) ;
301 run ;
NOTE: There were 150000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 150000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.15 seconds
302 data to_be_analysed ;
303 set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
304 run ;
NOTE: There were 150000 observations read from the data set WORK.BIGGIE.
WHERE rows>9850000;
NOTE: The data set WORK.TO_BE_ANALYSED has 150000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.36 seconds
cpu time 0.20 seconds
305 %let prior_posn = %sysevalf( 1e7 - 1500000 ) ;
306 option msglevel=i ;
307 data to_be_analysed ;
308 set biggie( firstOBS=%eval(1+&prior_posn) ) ;
309 run ;
NOTE: There were 1500000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 1500000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 7.43 seconds
cpu time 1.31 seconds
310 data to_be_analysed ;
311 set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
312 run ;
NOTE: There were 1500000 observations read from the data set WORK.BIGGIE.
WHERE rows>8500000;
NOTE: The data set WORK.TO_BE_ANALYSED has 1500000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 40.99 seconds
cpu time 3.08 seconds
Additionally to what Peter said it might also be worth to increase the index page size (option IBUFSIZE) and eventually "playing around" with "IBUFNO" could also have positive effects.
I've tried to use "ibufsize=max" and "ibufno=max" options and it also haven't changed anything. Maybe i need to set some specific values?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.