Ineffective index

Demoxe · Posted 04-05-2013 10:32 AM

Hello.

I have this situation: there are two similar tables in SAS on the same server, both have about 300 million observations and simple indexes on numeric fields with date type data. In the first table index "load_date1" has 2000 unique values, the second table has index "load_date2" with 80 unique values, distribution of observations on different index values in both tables is mostly even. So both these indexes should improve queries with WHERE expression significantly, but it is true only for the second table and I have no idea why. In both cases tables are unsorted, but data was added in them incrementally with increasing values of indexes, so there is not much difference in data distribution, based on index values. Here is some data on them (I had to change it a bit):

Data Set Name	table_1	Observations	294416780
Member Type	DATA	Variables	10
Engine	V9	Indexes	1
Created	05 Апрель 2013 г. пятница 05:23:54	Observation Length	140
Last Modified	05 Апрель 2013 г. пятница 05:23:54	Deleted Observations	0
Protection		Compressed	CHAR
Data Set Type		Reuse Space	NO
Label		Point to Observations	YES
Data Representation	WINDOWS_64	Sorted	NO
Encoding	wcyrillic Cyrillic (Windows)

Data Set Page Size	131072
Number of Data Set Pages	192865
Index File Page Size	4096
Number of Index File Pages	581713
Number of Data Set Repairs	0
Filename	somewhere\table_1.sas7bdat
Release Created
Host Created

Data Set Name	table_2	Observations	297622992
Member Type	DATA	Variables	33
Engine	V9	Indexes	1
Created	03 Апрель 2013 г. среда 00:58:54	Observation Length	289
Last Modified	03 Апрель 2013 г. среда 00:58:54	Deleted Observations	0
Protection		Compressed	BINARY
Data Set Type		Reuse Space	NO
Label		Point to Observations	YES
Data Representation	WINDOWS_64	Sorted	NO
Encoding	wcyrillic Cyrillic (Windows)

Data Set Page Size	16384
Number of Data Set Pages	1372810
Index File Page Size	4096
Number of Index File Pages	587074
Number of Data Set Repairs	0
Filename	somewhere\table_2.sas7bdat
Release Created
Host Created

And query is like this(to get data for the last month):

data result;

set table_2(idx_name=load_date2);

where load_date2>=&date_month_ago;

run;

Does anybody have any idea, why this type of code is working with the second table (code is running for 1-2 minutes) and isn't working with the first one (code is running for about 50 minutes - sometimes even longer than without usage of index)?

Doc_Duke · Posted 04-05-2013 01:35 PM

This may be key: "data was added in them incrementally with increasing values of indexes,"

In a number of database systems, when data are appended, the index just does a linear search (e.g. does not really use the index) for observations after the first appended one. If SAS works that way, then dropping the index and rebuilding it will cure the problem.

If you try it, let the list know what you find out.

Doc Muhlbaier

Duke

Demoxe · Posted 04-11-2013 05:09 AM

Index was dropped and rebuilt, alas there is no improvement.

Peter_C · Posted 04-05-2013 03:25 PM

Demoxe

try PROC CONTENTS with the option CENTILES

it will demonstrate the distribution - which is likely to be one part of the data used by optimisation engines to choose whether an index provides the fastest way to deliver the result data set..

Before each query using the indexed table(s), use

option MSGLEVEL=i ;

to get feedback on the choice of indexes made by optimisation engines.

On PROC SQL, use option _METHOD to see whether it chooses indexJoin or SortMergeJoin

I would be surprised if COMPRESS=BINARY makes the second table_2 index unsuitable. (table_1 has compress=CHAR)

The index page size is 4K on both tables, and there are similar number of index pages for both tables.

However for table_2 there are 1.37M pages of data compared with 0.19M pages for table_1

So I think this is enough to make the difference.

The indexes point to dataset pages.

The shape of the observations and indexes are clearly very different between the tables so their index-use is likely to be different too.

If you want to treat them in a similar way, give the tables similar PAGESIZE - and for that volume of data, I would not recommend anything as small as table_2's 16K. Table_1 has 128K pagesize.

The indexes point to pages.

When reading only a small (under 5%) subset small pages are helpful, but that is a lot of pages to shuffle for a larger subset.

good luck looking further into your data-usage.

hope we'll hear more updates after you have tried some of these ideas

peterC

.

Demoxe · Posted 04-11-2013 09:23 AM

As for centiles I have this distribution for table_1:

14751405
14435254
14568859
15172170
14978557
14797476
14705349
14890806
14747523
14805950
14709806
14766747
14675302
14949089
14707188
14799584
14919089
14604405
14806380
14974194

As you see, distribution is quite even.

I've already used this option "MSGLEVEL=i" and it was printed in log, that "load_date1" is used as index.

As for changing PAGESIZE of table_1, I'm not sure how to do it and wasn't able to find it in Help or somewhere else. Can you can tell me the way?

And I'm sorry for the delay in my answers, I have to wait for assistance from our admins and they are not so eager to help.

Peter_C · Posted 04-11-2013 06:55 PM

these are unusual date values!

what percentage of the table is extracted by

where load_date2>=&date_month_ago;

????

when it uses an index to read the data, it practically executes the output stage of a sort - returning data in index order

does it need to visit a great many pages to get the one month's data?

Demoxe · Posted 04-12-2013 03:56 AM

I've posted the number of observations for each centile, not the values of boundaries And this query gets about 1-2% of all data.

Data is added with increasing values of index each time, so with condition like "where load_date2>=&date_month_ago" only the last small part of table_1 is extracted (not bits and pieces from all over the table).

LinusH · Posted 04-12-2013 04:04 AM

Please share the log for each query (with msglevel=i).

Also adding fullstimer could give some hints about bottlenecks in the system.

You also may look in Windows task manager during the queries, to see the total usage of system resources.

And just to test the read performance, use data _null_, so you can rule out any write data bottlenecks.

But, even if you/we can't get to the bottom of this, there is the solution to move your data to SPDE, which will probably increase performance of bot loading and definitely querying.

Data never sleeps

Demoxe · Posted 04-12-2013 10:33 AM

12 options fullstimer;

13

14 data _NULL_;

15 set table_1(idxname=load_date1);

16 where load_date1>=&date_month_ago;

INFO: Index load_date1 selected for WHERE clause optimization.

17 run;

NOTE: There were 3746291 observations read from the data set table_1.

WHERE load_date1>=19446;

NOTE: DATA statement used (Total process time):

real time 52:41.22

user cpu time 25.17 seconds

system cpu time 1:58.37

memory 256276.06k

OS Memory 399160.00k

Timestamp 12.04.2013 14:50:54

18

19 data _NULL_;

20 set table_2(idxname=load_date2);

21 where load_date2>=&date_month_ago;

INFO: Index load_date2 selected for WHERE clause optimization.

22 run;

NOTE: DATA statement used (Total process time):

real time 27.91 seconds

user cpu time 11.77 seconds

system cpu time 3.10 seconds

memory 34019.84k

OS Memory 52404.00k

Timestamp 12.04.2013 14:51:22

NOTE: There were 6973693 observations read from the data set table_2.

WHERE load_date2>=19446;

In any case server can't be used so much on the first query and free on the second one, so there will be such a difference in process time. As for SPDE.. it's a solution indeed, but in my case I don't think, that admins will help me that way.

LinusH · Posted 04-12-2013 05:24 PM

Even as Peter mentioned that there are differences between your tables, this is still a mysterious.

To fins out whether this is a index problem, read as many observations from each table, corresponding to your index search result set:

data _null_;

set table_1(obs=3746291);

run;

And corresponding obs= for table_2;

I'm not sure what your are referring to what kind of help you need from admins. Technically, it sounds that you should be able to define a SPDE lib by your self. Just crate some directories, and then just lib it up...!? If you lack permissions, just set it up in the C:\temp for testing purposes...

Data never sleeps

Peter_C · Posted 04-15-2013 03:39 AM

Demoxe

I see that you force the use of the index.

Would the index not be chosen by SAS if you had not demanded it?

It is possible that SAS has some good reaon to reject use of the index

- and we have still to identify that reason.

peterC

Peter_C · Posted 04-15-2013 06:44 AM

an alternative approach to this use of indexes might be possible and might greatly improve access times.

The index is used to retrieve rows added since "some point in time" using LOAD_DATE.

Without deleting rows,and not having the REUSE option in effect, INSERT has the affect of APPEND by adding new rows at the end of the table.

So assuming LOAD_DATE represents the order in which rows are added to the table, then rather than use the index, the results could be achieved by using the FIRSTOBS= dataset option on a SET statement.

To obtain the value to use for the FIRSTOBS option could be part of the "data-loading-process" in future. Historic load counts could easily be obtained in a single-one-off-pass. Keep these in a table with load_date, row_count and since the volume is small (this table of load_date and row_count), the FIRSTOBS= value could be derived when needed (cumulative row_count +1).

To prove the issue about FIRSTOBS= versus WHERE= with the index, I prepared a 1.5GB table 10M rows with a bland mixture of strings, dates and row_number

The table BIGGIE has 10M rows created by

data biggie ;

length rows 8 string1-string5 $20 ;

format date1-date5 date9. ;

do rows = 1 to 1e7 ;

output ;

end;

run;

I prepared a unique index on ROWS with

proc datasets details; run;

modify biggie ;

create index rows/unique ;

run ;

quit ;

proc datasets details; run;

reveals the filesize of data and index are

Member Obs, Entries
# Name Type or Indexes Vars Label File Size Last Modified

1 BIGGIE DATA 10000000 11 1536025600 15-Apr-13 10:13:37
BIGGIE INDEX 1 161702912 15-Apr-13 10:13:37

The data and index are created with the defaults for buffers and pages.

I have run separate extracts with FIRSTOBS= and WHERE= for 1500, 15000, 150000 and 1500000 row subsets.
FIRSTOBS is a clear winner

subset size	REAL FIRSTOBS	REAL WHERE	CPU FIRSTOBS	CPU WHERE
1500	0.08	0.12	0.04	0.02
15000	0.04	0.05	0.05	0.02
150000	0.17	0.36	0.15	0.20
1500000	7.43	40.99	1.31	3.08

Here I presume to use no compresion and judge by REAL Time because the data are so artificial.
Your experience WILL vary.

I expect FIRSTOBS is the winner because the system has less to do - just collect all rows following the start point -
and index processing
makes no assumption about the order of index values
checks the index for every row that satisfies the WHERE

Clearly the performance on my old machine varied greatly when the nsize of the subset grew to 1.5M, being 15% of the data - and notice in the logs below - MSGLEVEL=i indicates that the index was still used
The SASlog showing the 8 datasteps follows:

280 %let prior_posn = %sysevalf( 1e7 - 1500 ) ;
281 option msglevel=i ;
282 data to_be_analysed ;
283 set biggie( firstOBS=%eval(1+&prior_posn) ) ;
284 run ;

NOTE: There were 1500 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 1500 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
cpu time 0.04 seconds

285
286 data to_be_analysed ;
287 set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
288 run ;

NOTE: There were 1500 observations read from the data set WORK.BIGGIE.
      WHERE rows>9998500;
NOTE: The data set WORK.TO_BE_ANALYSED has 1500 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.12 seconds
      cpu time            0.03 seconds

289 %let prior_posn = %sysevalf( 1e7 - 15000 ) ;
290 option msglevel=i ;
291 data to_be_analysed ;
292 set biggie( firstOBS=%eval(1+&prior_posn) ) ;
293 run ;

NOTE: There were 15000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 15000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds

294 data to_be_analysed ;
295 set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
296 run ;

NOTE: There were 15000 observations read from the data set WORK.BIGGIE.
      WHERE rows>9985000;
NOTE: The data set WORK.TO_BE_ANALYSED has 15000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.03 seconds

297 %let prior_posn = %sysevalf( 1e7 - 150000 ) ;
298 option msglevel=i ;
299 data to_be_analysed ;
300 set biggie( firstOBS=%eval(1+&prior_posn) ) ;
301 run ;

NOTE: There were 150000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 150000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.15 seconds

302 data to_be_analysed ;
303 set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
304 run ;

NOTE: There were 150000 observations read from the data set WORK.BIGGIE.
      WHERE rows>9850000;
NOTE: The data set WORK.TO_BE_ANALYSED has 150000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.36 seconds
      cpu time            0.20 seconds

305 %let prior_posn = %sysevalf( 1e7 - 1500000 ) ;
306 option msglevel=i ;
307 data to_be_analysed ;
308 set biggie( firstOBS=%eval(1+&prior_posn) ) ;
309 run ;

NOTE: There were 1500000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 1500000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 7.43 seconds
cpu time 1.31 seconds

310 data to_be_analysed ;
311 set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
312 run ;

NOTE: There were 1500000 observations read from the data set WORK.BIGGIE.
      WHERE rows>8500000;
NOTE: The data set WORK.TO_BE_ANALYSED has 1500000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           40.99 seconds
      cpu time            3.08 seconds

Patrick · Posted 04-05-2013 09:01 PM

Additionally to what Peter said it might also be worth to increase the index page size (option IBUFSIZE) and eventually "playing around" with "IBUFNO" could also have positive effects.

SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition

Demoxe · Posted 04-11-2013 07:27 AM

I've tried to use "ibufsize=max" and "ibufno=max" options and it also haven't changed anything. Maybe i need to set some specific values?

Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

Re: Ineffective index

SAS Innovate 2025: Call for Content

Classroom Training Available!