Re: Does the speed of a query decrease with the number of indexes?

harlananelson · Posted 07-19-2018 11:09 AM

I looked around for an answer to this question. If I have one index called ENROLID for a very large data set or 10 indexes, one of which is ENROLID, does the query where I select a list of ENROLID values from the data set with one index differ in speed from the same query on the data set with 9 other indexed fields?

Do I need to pick and choose my indexes carefully, or should I throw in every possible index I might need?

PGStats · Posted 07-19-2018 02:15 PM

I would worry more about the cost of creating and maintaining the indexes. Each operation updating, inserting, or deleting a record must also adjust the indexes.

PG

harlananelson · Posted 07-19-2018 02:31 PM

These get updated once per month.

harlananelson · Posted 07-19-2018 02:38 PM

I am worried about the query because it is taking hours. I only need one index, so if it would be faster with only one, that is what I want to have.

PGStats · Posted 07-19-2018 03:45 PM

In my opinion, if your query is handled by SAS/SQL, the number of indexes is unlikely to be the problem. If this is about a Pass-Through query, then optimization is up to your DBMS.

PG

TomKari · Posted 07-19-2018 03:48 PM

1. SAS dataset or DBMS table?

2. How many rows?

3. How many fields and bytes per record?

4. How complex is the query?

I very much doubt the existence of the indexes is causing your problems.

harlananelson · Posted 07-19-2018 04:35 PM

Re: Does the speed of a query decrease with the number of indexes?

1. SAS dataset or DBMS table?

I am using the SAS Scalable Performance Data Engine.

2. How many rows?

Some examples for the files I am accessing.

1,794,533,270

5,080,553,136

1,602,072,953

Largest 10,125,182,001

3. How many fields and bytes per record?

largest Data

Observations 10125182001

Data Partsize 53477375K

There are three parts, so multiply by three

That looks like 15 bytes per record.

4. How complex is the query?

I use DS2. Create a hash out of my lookup table which has one field: the index field, then load the larger table and output if index value is found.

f = h.find();
if (f = 0) then output;

 

proc ds2;
data cpt_enrolid (overwrite=yes);
dcl double rc f enrolid;
declare char(7) proc1;
declare char(8) cpt;
dcl package hash h(8, '{select distinct proc1 from work.cpt}');
method init();
rc = h.keys([proc1]);
rc = h.data([proc1]);
rc = h.defineDone();
end;
method run();
set {select enrolid, proc1, proctype from DWSC1116.CEMR_O_2012_2016};
f = h.find();
if (f = 0) then output;
end;
enddata;
run;
quit;

PGStats · Posted 07-20-2018 01:18 AM

There is no database index used in your procedure. My guess is that most of the time is spent reading the whole data table, which might be wasteful, depending on the size of work.cpt.

If the large table is indexed on proc1, the simple query

create table cpt_enrolid as

select enrolid, proc1, proctype from DWSC1116.CEMR_O_2012_2016
where proc1 in (select proc1 from work.cpt);

might execute faster.

hth.

PG

harlananelson · Posted 07-25-2018 04:11 PM

I appreciate your suggestion and used it. The speed did increase. However I noticed something odd. When I use a sub query in the where statement as you suggest, the speed is not as fast as when I take your idea one step further and use a simple string instead of the sub query. This is not what I had expected. But I will start using the SAS macro facility to replace the sub query with a text string on my queries. Three minutes vs three hours to get the same 243570 observations.

Is this a SAS bug or am I missing something?

243570 rows and 3 columns.

1584
1585  proc sql _method;
1586  create table test as
1587  select enrolid, proc1, proctyp from DWSC1116.CEMR_O_2012_2016 where
1588  PROC1 in ('64615')
1589  ;

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxsrc( DWSC1116.CEMR_O_2012_2016 )

whinit: WHERE (PROC1='64615')
whinit: INDEX PROC1 uses 13% of segs (WITHIN maxsegratio 75%)
whinit returns: ALL EVAL1(w/SEGLIST)
NOTE: Table WORK.TEST created, with 243570 rows and 3 columns.

1590
1591  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           3:31.14
      user cpu time       1.51 seconds
      system cpu time     10.01 seconds
      memory              161854.62k
      OS Memory           191904.00k
      Timestamp           07/25/2018 02:45:25 PM
      Step Count                        79  Switch Count  1


1592
1593  proc sql _method;
1594  create table test as
1595  select enrolid, proc1, proctyp from DWSC1116.CEMR_O_2012_2016 where
1596  PROC1 in (select distinct PROC1 from search_results)
1597  ;

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxfil
              sqxsrc( DWSC1116.CEMR_O_2012_2016 )

NOTE: SQL subquery execution methods chosen are:

          sqxsubq
              sqxunqs
                  sqxsrc( WORK.SEARCH_RESULTS )
NOTE: Table WORK.TEST created, with 243570 rows and 3 columns.

1598
1599  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           1:11:16.42
      user cpu time       13:11.53
      system cpu time     5:04.06
      memory              6516.34k
      OS Memory           36260.00k
      Timestamp           07/25/2018 03:58:42 PM
      Step Count                        80  Switch Count  0

PGStats · Posted 07-25-2018 05:57 PM

When your query says

select enrolid, proc1, proctyp from DWSC1116.CEMR_O_2012_2016 where PROC1 in ('64615')

and your table is indexed on PROC1, SAS knows where those records are in your large table and only needs to read those. The time required becomes proportional to the size of the result table and not the size of the database table.

PG

harlananelson · Posted 07-25-2018 08:57 PM

SAS should be able to use the index in the other query that uses the subquery.

TomKari · Posted 07-19-2018 02:30 PM

I agree with @PGStats. Also, indexes tend to be more useful on very diverse fields. An index on "Customer Number", where you frequently have to look up individual numbers, can be fantastically useful. An index on "Gender" or "State", where there are a large number of repeated values, tends to be pretty much useless.

Tom

Ready to join fellow brilliant minds for the SAS Hackathon?