BookmarkSubscribeRSS Feed
Demoxe
Calcite | Level 5

Hello.

I have this situation: there are two similar tables in SAS on the same server, both have about 300 million observations and simple indexes on numeric fields with date type data. In the first table index "load_date1" has 2000 unique values, the second table has index "load_date2" with 80 unique values, distribution of observations on different index values in both tables is mostly even. So both these indexes should improve queries with WHERE expression significantly, but it is true only for the second table and I have no idea why. In both cases tables are unsorted, but data was added in them incrementally with increasing values of indexes, so there is not much difference in data distribution, based on index values. Here is some data on them (I had to change it a bit):

Data Set Nametable_1Observations294416780
Member TypeDATAVariables10
EngineV9Indexes1
Created05 Апрель 2013 г. пятница 05:23:54Observation Length140
Last Modified05 Апрель 2013 г. пятница 05:23:54Deleted Observations0
Protection CompressedCHAR
Data Set Type Reuse SpaceNO
Label Point to ObservationsYES
Data RepresentationWINDOWS_64SortedNO
Encodingwcyrillic Cyrillic (Windows)
Data Set Page Size131072
Number of Data Set Pages192865
Index File Page Size4096
Number of Index File Pages581713
Number of Data Set Repairs0
Filenamesomewhere\table_1.sas7bdat
Release Created
Host Created
Data Set Nametable_2Observations297622992
Member TypeDATAVariables33
EngineV9Indexes1
Created03 Апрель 2013 г. среда 00:58:54Observation Length289
Last Modified03 Апрель 2013 г. среда 00:58:54Deleted Observations0
Protection CompressedBINARY
Data Set Type Reuse SpaceNO
Label Point to ObservationsYES
Data RepresentationWINDOWS_64SortedNO
Encodingwcyrillic Cyrillic (Windows)
Data Set Page Size16384
Number of Data Set Pages1372810
Index File Page Size4096
Number of Index File Pages587074
Number of Data Set Repairs0
Filenamesomewhere\table_2.sas7bdat
Release Created
Host Created

And query is like this(to get data for the last month):

data result;

     set table_2(idx_name=load_date2);

     where load_date2>=&date_month_ago;

run;

Does anybody have any idea, why this type of code is working with the second table (code is running for 1-2 minutes) and isn't working with the first one (code is running for about 50 minutes - sometimes even longer than without usage of index)?

13 REPLIES 13
Doc_Duke
Rhodochrosite | Level 12

This may be key:  "data was added in them incrementally with increasing values of indexes,"

In a number of database systems, when data are appended, the index just does a linear search (e.g. does not really use the index) for observations after the first appended one.  If SAS works that way, then dropping the index and rebuilding it will cure the problem.

If you try it, let the list know what you find out.

Doc Muhlbaier

Duke

Demoxe
Calcite | Level 5

Index was dropped and rebuilt, alas there is no improvement.

Peter_C
Rhodochrosite | Level 12

Demoxe

try PROC CONTENTS with the option CENTILES

it will demonstrate the distribution  - which is likely to be one part of the data used by optimisation engines to choose whether an index provides the fastest way to deliver the result data set..

Before each query using the indexed table(s), use

     option MSGLEVEL=i ;

to get feedback on the choice of indexes made by optimisation engines.

On PROC SQL, use option _METHOD to see whether it chooses indexJoin or SortMergeJoin

I would be surprised if COMPRESS=BINARY makes the second table_2 index unsuitable. (table_1 has compress=CHAR)

The index page size is 4K on both tables, and there are similar number of index pages for both tables.

However for table_2 there are 1.37M pages of data compared with  0.19M pages for table_1

So I think this is enough to make the difference.

The indexes point to dataset pages.

The shape of the observations and indexes are clearly very different between the tables so their index-use is likely to be different too.

If you want to treat them in a similar way, give the tables similar PAGESIZE - and for that volume of data, I would not recommend anything as small as table_2's 16K. Table_1 has 128K pagesize.

The indexes point to pages.

When reading only a small (under 5%) subset small pages are helpful, but that is a lot of pages to shuffle for a larger subset.

good luck looking further into your data-usage.

hope we'll hear more updates after you have tried some of these ideas

peterC

.

Demoxe
Calcite | Level 5

As for centiles I have this distribution for table_1:

   

  1. 14751405
  2. 14435254
  3. 14568859
  4. 15172170
  5. 14978557
  6. 14797476
  7. 14705349
  8. 14890806
  9. 14747523
  10. 14805950
  11. 14709806
  12. 14766747
  13. 14675302
  14. 14949089
  15. 14707188
  16. 14799584
  17. 14919089
  18. 14604405
  19. 14806380
  20. 14974194

As you see, distribution is quite even.

I've already used this option "MSGLEVEL=i" and it was printed in log, that "load_date1" is used as index.

As for changing PAGESIZE of table_1, I'm not sure how to do it and wasn't able to find it in Help or somewhere else. Can you can tell me the way?

And I'm sorry for the delay in my answers, I have to wait for assistance from our admins and they are not so eager to help.

Peter_C
Rhodochrosite | Level 12

these are unusual date values!

what percentage of the table is extracted by

where load_date2>=&date_month_ago;

????

when it uses an index to read the data, it practically executes the output stage of a sort - returning data in index order

does it need to visit a great many pages to get the one month's data?

Demoxe
Calcite | Level 5

I've posted the number of observations for each centile, not the values of boundaries Smiley Happy And this query gets about 1-2% of all data.

Data is added with increasing values of index each time, so with condition like "where load_date2>=&date_month_ago" only the last small part of table_1 is extracted (not bits and pieces from all over the table).

LinusH
Tourmaline | Level 20

Please share the log for each query (with msglevel=i).

Also adding fullstimer could give some hints about bottlenecks in the system.

You also may look in Windows task manager during the queries, to see the total usage of system resources.

And just to test the read performance, use data _null_, so you can rule out any write data bottlenecks.

But, even if you/we can't get to the bottom of this, there is the solution to move your data to SPDE, which will probably increase performance of bot loading and definitely querying.

Data never sleeps
Demoxe
Calcite | Level 5

12         options fullstimer;

13    

14         data _NULL_;   

15         set table_1(idxname=load_date1);

16         where load_date1>=&date_month_ago;

INFO: Index load_date1 selected for WHERE clause optimization.

17         run;

NOTE: There were 3746291 observations read from the data set table_1.

      WHERE load_date1>=19446;

NOTE: DATA statement used (Total process time):

      real time           52:41.22

      user cpu time       25.17 seconds

      system cpu time     1:58.37

      memory              256276.06k

      OS Memory           399160.00k

      Timestamp           12.04.2013 14:50:54

     

18        

19         data _NULL_;

20         set table_2(idxname=load_date2);

21         where load_date2>=&date_month_ago;

INFO: Index load_date2 selected for WHERE clause optimization.

22         run;

NOTE: DATA statement used (Total process time):

      real time           27.91 seconds

      user cpu time       11.77 seconds

      system cpu time     3.10 seconds

      memory              34019.84k

      OS Memory           52404.00k

      Timestamp           12.04.2013 14:51:22

     

NOTE: There were 6973693 observations read from the data set table_2.

      WHERE load_date2>=19446;

In any case server can't be used so much on the first query and free on the second one, so there will be such a difference in process time. As for SPDE.. it's a solution indeed, but in my case I don't think, that admins will help me that way.

LinusH
Tourmaline | Level 20

Even as Peter mentioned that there are differences between your tables, this is still a mysterious.

To fins out whether this is a index problem, read as many observations from each table, corresponding to your index search result set:

data _null_;

     set table_1(obs=3746291);

run;


And corresponding obs= for table_2;


I'm not sure what your are referring to what kind of help you need from admins. Technically, it sounds that you should be able to define a SPDE lib by your self. Just crate some directories, and then just lib it up...!? If you lack permissions, just set it up in the C:\temp for testing purposes...


Data never sleeps
Peter_C
Rhodochrosite | Level 12

Demoxe

I see that you force the use of the index.

Would the index not be chosen by SAS if you had not demanded it?

It is possible that SAS has some good reaon to reject use of the index

- and we have still to identify that reason.


peterC

Peter_C
Rhodochrosite | Level 12

an alternative approach to this use of indexes might be possible and might greatly improve access times.

The index is used to retrieve rows added since "some point in time" using LOAD_DATE.

Without deleting rows,and not having the REUSE option in effect, INSERT has the affect of APPEND by adding new rows at the end of the table.

So assuming LOAD_DATE represents the order in which rows are added to the table, then rather than use the index, the results could be achieved by using the FIRSTOBS= dataset option on a SET statement.

To obtain the value to use for the FIRSTOBS option could be part of the "data-loading-process" in future. Historic load counts could easily be obtained in a single-one-off-pass. Keep these in a table with load_date, row_count and since the volume is small (this table of load_date and row_count), the FIRSTOBS= value could be derived when needed (cumulative row_count +1).

To prove the issue about FIRSTOBS= versus WHERE= with the index, I prepared a 1.5GB table 10M rows with a bland mixture of strings, dates and row_number

The table BIGGIE has 10M rows created by

data biggie ;

length rows 8 string1-string5 $20 ;

format date1-date5 date9. ;

do rows = 1 to 1e7 ;

output ;

end;

run;

I prepared a unique index on ROWS with

proc datasets details; run;

modify biggie ;

create  index rows/unique ;

run ;

quit ;

proc datasets details; run;

reveals the filesize of data and index are

                 Member  Obs, Entries
#  Name          Type     or Indexes   Vars  Label    File Size  Last Modified

1  BIGGIE        DATA      10000000     11           1536025600  15-Apr-13 10:13:37
   BIGGIE        INDEX            1                   161702912  15-Apr-13 10:13:37

The data and index are created with the defaults for buffers and pages.


I have run separate extracts with FIRSTOBS= and WHERE= for 1500, 15000, 150000 and 1500000 row subsets.
FIRSTOBS is a clear winner

subset sizeREAL FIRSTOBS REAL WHERECPU FIRSTOBSCPU WHERE
15000.080.120.040.02
150000.040.050.050.02
1500000.170.360.150.20
15000007.4340.991.313.08

Here I presume to use no compresion and judge by REAL Time because the data are so artificial.
Your experience WILL vary.

I expect FIRSTOBS is the winner because the system has less to do - just collect all rows following the start point -
and index processing
   makes no assumption about the order of index values
   checks the index for every row that satisfies the WHERE

Clearly the performance on my old machine varied greatly when the nsize of the subset grew to 1.5M, being 15% of the data - and notice in the logs below - MSGLEVEL=i indicates that the index was still used
The SASlog showing the 8 datasteps follows:

280  %let prior_posn = %sysevalf( 1e7 - 1500 ) ;
281  option msglevel=i ;
282  data to_be_analysed ;
283    set biggie( firstOBS=%eval(1+&prior_posn) ) ;
284  run ;

NOTE: There were 1500 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 1500 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.08 seconds
      cpu time            0.04 seconds


285
286  data to_be_analysed ;
287    set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
288  run ;

NOTE: There were 1500 observations read from the data set WORK.BIGGIE.
      WHERE rows>9998500;
NOTE: The data set WORK.TO_BE_ANALYSED has 1500 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.12 seconds
      cpu time            0.03 seconds


289  %let prior_posn = %sysevalf( 1e7 - 15000 ) ;
290  option msglevel=i ;
291  data to_be_analysed ;
292    set biggie( firstOBS=%eval(1+&prior_posn) ) ;
293  run ;

NOTE: There were 15000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 15000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds


294  data to_be_analysed ;
295    set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
296  run ;

NOTE: There were 15000 observations read from the data set WORK.BIGGIE.
      WHERE rows>9985000;
NOTE: The data set WORK.TO_BE_ANALYSED has 15000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.03 seconds


297  %let prior_posn = %sysevalf( 1e7 - 150000 ) ;
298  option msglevel=i ;
299  data to_be_analysed ;
300    set biggie( firstOBS=%eval(1+&prior_posn) ) ;
301  run ;

NOTE: There were 150000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 150000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


302  data to_be_analysed ;
303    set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
304  run ;

NOTE: There were 150000 observations read from the data set WORK.BIGGIE.
      WHERE rows>9850000;
NOTE: The data set WORK.TO_BE_ANALYSED has 150000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.36 seconds
      cpu time            0.20 seconds


305  %let prior_posn = %sysevalf( 1e7 - 1500000 ) ;
306  option msglevel=i ;
307  data to_be_analysed ;
308    set biggie( firstOBS=%eval(1+&prior_posn) ) ;
309  run ;

NOTE: There were 1500000 observations read from the data set WORK.BIGGIE.
NOTE: The data set WORK.TO_BE_ANALYSED has 1500000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           7.43 seconds
      cpu time            1.31 seconds


310  data to_be_analysed ;
311    set biggie( where=( rows >&prior_posn) ) ;
INFO: Index rows selected for WHERE clause optimization.
312  run ;

NOTE: There were 1500000 observations read from the data set WORK.BIGGIE.
      WHERE rows>8500000;
NOTE: The data set WORK.TO_BE_ANALYSED has 1500000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           40.99 seconds
      cpu time            3.08 seconds

Patrick
Opal | Level 21

Additionally to what Peter said it might also be worth to increase the index page size (option IBUFSIZE) and eventually "playing around" with "IBUFNO" could also have positive effects.

SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition

Demoxe
Calcite | Level 5

I've tried to use "ibufsize=max" and "ibufno=max" options and it also haven't changed anything. Maybe i need to set some specific values?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1537 views
  • 2 likes
  • 5 in conversation