I looked around for an answer to this question. If I have one index called ENROLID for a very large data set or 10 indexes, one of which is ENROLID, does the query where I select a list of ENROLID values from the data set with one index differ in speed from the same query on the data set with 9 other indexed fields?
Do I need to pick and choose my indexes carefully, or should I throw in every possible index I might need?
I would worry more about the cost of creating and maintaining the indexes. Each operation updating, inserting, or deleting a record must also adjust the indexes.
These get updated once per month.
I am worried about the query because it is taking hours. I only need one index, so if it would be faster with only one, that is what I want to have.
In my opinion, if your query is handled by SAS/SQL, the number of indexes is unlikely to be the problem. If this is about a Pass-Through query, then optimization is up to your DBMS.
1. SAS dataset or DBMS table?
2. How many rows?
3. How many fields and bytes per record?
4. How complex is the query?
I very much doubt the existence of the indexes is causing your problems.
1. SAS dataset or DBMS table?
I am using the SAS Scalable Performance Data Engine.
2. How many rows?
Some examples for the files I am accessing.
1,794,533,270
5,080,553,136
1,602,072,953
Largest 10,125,182,001
3. How many fields and bytes per record?
largest Data
Observations 10125182001
Data Partsize 53477375K
There are three parts, so multiply by three
That looks like 15 bytes per record.
4. How complex is the query?
I use DS2. Create a hash out of my lookup table which has one field: the index field, then load the larger table and output if index value is found.
f = h.find(); if (f = 0) then output; proc ds2; data cpt_enrolid (overwrite=yes); dcl double rc f enrolid; declare char(7) proc1; declare char(8) cpt; dcl package hash h(8, '{select distinct proc1 from work.cpt}'); method init(); rc = h.keys([proc1]); rc = h.data([proc1]); rc = h.defineDone(); end; method run(); set {select enrolid, proc1, proctype from DWSC1116.CEMR_O_2012_2016}; f = h.find(); if (f = 0) then output; end; enddata; run; quit;
There is no database index used in your procedure. My guess is that most of the time is spent reading the whole data table, which might be wasteful, depending on the size of work.cpt.
If the large table is indexed on proc1, the simple query
create table cpt_enrolid as
select enrolid, proc1, proctype from DWSC1116.CEMR_O_2012_2016
where proc1 in (select proc1 from work.cpt);
might execute faster.
hth.
I appreciate your suggestion and used it. The speed did increase. However I noticed something odd. When I use a sub query in the where statement as you suggest, the speed is not as fast as when I take your idea one step further and use a simple string instead of the sub query. This is not what I had expected. But I will start using the SAS macro facility to replace the sub query with a text string on my queries. Three minutes vs three hours to get the same 243570 observations.
Is this a SAS bug or am I missing something?
243570 rows and 3 columns.
1584 1585 proc sql _method; 1586 create table test as 1587 select enrolid, proc1, proctyp from DWSC1116.CEMR_O_2012_2016 where 1588 PROC1 in ('64615') 1589 ; NOTE: SQL execution methods chosen are: sqxcrta sqxsrc( DWSC1116.CEMR_O_2012_2016 ) whinit: WHERE (PROC1='64615') whinit: INDEX PROC1 uses 13% of segs (WITHIN maxsegratio 75%) whinit returns: ALL EVAL1(w/SEGLIST) NOTE: Table WORK.TEST created, with 243570 rows and 3 columns. 1590 1591 quit; NOTE: PROCEDURE SQL used (Total process time): real time 3:31.14 user cpu time 1.51 seconds system cpu time 10.01 seconds memory 161854.62k OS Memory 191904.00k Timestamp 07/25/2018 02:45:25 PM Step Count 79 Switch Count 1 1592 1593 proc sql _method; 1594 create table test as 1595 select enrolid, proc1, proctyp from DWSC1116.CEMR_O_2012_2016 where 1596 PROC1 in (select distinct PROC1 from search_results) 1597 ; NOTE: SQL execution methods chosen are: sqxcrta sqxfil sqxsrc( DWSC1116.CEMR_O_2012_2016 ) NOTE: SQL subquery execution methods chosen are: sqxsubq sqxunqs sqxsrc( WORK.SEARCH_RESULTS ) NOTE: Table WORK.TEST created, with 243570 rows and 3 columns. 1598 1599 quit; NOTE: PROCEDURE SQL used (Total process time): real time 1:11:16.42 user cpu time 13:11.53 system cpu time 5:04.06 memory 6516.34k OS Memory 36260.00k Timestamp 07/25/2018 03:58:42 PM Step Count 80 Switch Count 0
When your query says
select enrolid, proc1, proctyp from DWSC1116.CEMR_O_2012_2016 where PROC1 in ('64615')
and your table is indexed on PROC1, SAS knows where those records are in your large table and only needs to read those. The time required becomes proportional to the size of the result table and not the size of the database table.
SAS should be able to use the index in the other query that uses the subquery.
I agree with @PGStats. Also, indexes tend to be more useful on very diverse fields. An index on "Customer Number", where you frequently have to look up individual numbers, can be fantastically useful. An index on "Gender" or "State", where there are a large number of repeated values, tends to be pretty much useless.
Tom
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.