an alternative approach to this use of indexes might be possible and might greatly improve access times. The index is used to retrieve rows added since "some point in time" using LOAD_DATE. Without deleting rows,and not having the REUSE option in effect, INSERT has the affect of APPEND by adding new rows at the end of the table. So assuming LOAD_DATE represents the order in which rows are added to the table, then rather than use the index, the results could be achieved by using the FIRSTOBS= dataset option on a SET statement. To obtain the value to use for the FIRSTOBS option could be part of the "data-loading-process" in future. Historic load counts could easily be obtained in a single-one-off-pass. Keep these in a table with load_date, row_count and since the volume is small (this table of load_date and row_count), the FIRSTOBS= value could be derived when needed (cumulative row_count +1). To prove the issue about FIRSTOBS= versus WHERE= with the index, I prepared a 1.5GB table 10M rows with a bland mixture of strings, dates and row_number The table BIGGIE has 10M rows created by data biggie ; length rows 8 string1-string5 $20 ; format date1-date5 date9. ; do rows = 1 to 1e7 ; output ; end; run; I prepared a unique index on ROWS with proc datasets details; run; modify biggie ; create index rows/unique ; run ; quit ; proc datasets details; run; reveals the filesize of data and index are Member Obs, Entries # Name Type or Indexes Vars Label File Size Last Modified 1 BIGGIE DATA 10000000 11 1536025600 15-Apr-13 10:13:37 BIGGIE INDEX 1 161702912 15-Apr-13 10:13:37 The data and index are created with the defaults for buffers and pages. I have run separate extracts with FIRSTOBS= and WHERE= for 1500, 15000, 150000 and 1500000 row subsets. FIRSTOBS is a clear winner subset size REAL FIRSTOBS REAL WHERE CPU FIRSTOBS CPU WHERE 1500 0.08 0.12 0.04 0.02 15000 0.04 0.05 0.05 0.02 150000 0.17 0.36 0.15 0.20 1500000 7.43 40.99 1.31 3.08 Here I presume to use no compresion and judge by REAL Time because the data are so artificial. Your experience WILL vary. I expect FIRSTOBS is the winner because the system has less to do - just collect all rows following the start point - and index processing makes no assumption about the order of index values checks the index for every row that satisfies the WHERE Clearly the performance on my old machine varied greatly when the nsize of the subset grew to 1.5M, being 15% of the data - and notice in the logs below - MSGLEVEL=i indicates that the index was still used The SASlog showing the 8 datasteps follows: 280 %let prior_posn = %sysevalf( 1e7 - 1500 ) ; 281 option msglevel=i ; 282 data to_be_analysed ; 283 set biggie( firstOBS=%eval(1+&prior_posn) ) ; 284 run ; NOTE: There were 1500 observations read from the data set WORK.BIGGIE. NOTE: The data set WORK.TO_BE_ANALYSED has 1500 observations and 11 variables. NOTE: DATA statement used (Total process time): real time 0.08 seconds cpu time 0.04 seconds 285 286 data to_be_analysed ; 287 set biggie( where=( rows >&prior_posn) ) ; INFO: Index rows selected for WHERE clause optimization. 288 run ; NOTE: There were 1500 observations read from the data set WORK.BIGGIE. WHERE rows>9998500; NOTE: The data set WORK.TO_BE_ANALYSED has 1500 observations and 11 variables. NOTE: DATA statement used (Total process time): real time 0.12 seconds cpu time 0.03 seconds 289 %let prior_posn = %sysevalf( 1e7 - 15000 ) ; 290 option msglevel=i ; 291 data to_be_analysed ; 292 set biggie( firstOBS=%eval(1+&prior_posn) ) ; 293 run ; NOTE: There were 15000 observations read from the data set WORK.BIGGIE. NOTE: The data set WORK.TO_BE_ANALYSED has 15000 observations and 11 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.04 seconds 294 data to_be_analysed ; 295 set biggie( where=( rows >&prior_posn) ) ; INFO: Index rows selected for WHERE clause optimization. 296 run ; NOTE: There were 15000 observations read from the data set WORK.BIGGIE. WHERE rows>9985000; NOTE: The data set WORK.TO_BE_ANALYSED has 15000 observations and 11 variables. NOTE: DATA statement used (Total process time): real time 0.05 seconds cpu time 0.03 seconds 297 %let prior_posn = %sysevalf( 1e7 - 150000 ) ; 298 option msglevel=i ; 299 data to_be_analysed ; 300 set biggie( firstOBS=%eval(1+&prior_posn) ) ; 301 run ; NOTE: There were 150000 observations read from the data set WORK.BIGGIE. NOTE: The data set WORK.TO_BE_ANALYSED has 150000 observations and 11 variables. NOTE: DATA statement used (Total process time): real time 0.17 seconds cpu time 0.15 seconds 302 data to_be_analysed ; 303 set biggie( where=( rows >&prior_posn) ) ; INFO: Index rows selected for WHERE clause optimization. 304 run ; NOTE: There were 150000 observations read from the data set WORK.BIGGIE. WHERE rows>9850000; NOTE: The data set WORK.TO_BE_ANALYSED has 150000 observations and 11 variables. NOTE: DATA statement used (Total process time): real time 0.36 seconds cpu time 0.20 seconds 305 %let prior_posn = %sysevalf( 1e7 - 1500000 ) ; 306 option msglevel=i ; 307 data to_be_analysed ; 308 set biggie( firstOBS=%eval(1+&prior_posn) ) ; 309 run ; NOTE: There were 1500000 observations read from the data set WORK.BIGGIE. NOTE: The data set WORK.TO_BE_ANALYSED has 1500000 observations and 11 variables. NOTE: DATA statement used (Total process time): real time 7.43 seconds cpu time 1.31 seconds 310 data to_be_analysed ; 311 set biggie( where=( rows >&prior_posn) ) ; INFO: Index rows selected for WHERE clause optimization. 312 run ; NOTE: There were 1500000 observations read from the data set WORK.BIGGIE. WHERE rows>8500000; NOTE: The data set WORK.TO_BE_ANALYSED has 1500000 observations and 11 variables. NOTE: DATA statement used (Total process time): real time 40.99 seconds cpu time 3.08 seconds
... View more