Re: Creating Teradata Database Tables Using Proc SQL

buechler66 · Posted 08-20-2019 04:57 PM

I'm not sure if we have any Teradata database folks out there, but if we do I have a question for you...

On my project we create Teradata database tables via Proc SQL using the 'Primary Index' option? For example: (FASTLOAD=YES DBCREATE_TABLE_OPTS='PRIMARY INDEX (var1 var2 var3 var4 var5)')?

It seems to me that most of our Teradata tables do not have a large enough number of rows that they would require a primary index, a pre-sort that would benefit us in speed/efficiency, when later querying or joining to these tables later in our SAS programs. In fact it seems that in most cases we are just picking 4 or 5 variables that we think have the greatest number of distinct values and putting them in as parameters to this option.

I'm just curious if any of you might know why we do this when creating Teradata tables?

Reeza · Posted 08-20-2019 05:04 PM

Are you asking for the logic of why indexes are created that way in teradata?

Indexes aren't just a teradata concept so it would be a generic question if that's the case.

buechler66 · Posted 08-20-2019 05:13 PM

Specifically to TeraData as I think it might have some impact on table
storage across Amps (processors) - I think 🤔

Reeza · Posted 08-20-2019 05:18 PM

Ok, well, I'll say that this part of the process is concerning:

In fact it seems that in most cases we are just picking 4 or 5 variables that we think have the greatest number of distinct values and putting them in as parameters to this option.

You don't always need indexes, but when you do have indexes, they should be on variables that are likely to be used as filters or joins.

Looking at queries to see what the JOIN ON key variables are and which variables are likely to be used usually requires knowledge of the data and/subject matter. For example for a health care data base, you're likely pulling by dates, by facility of service or by diagnosis types for starters. You likely don't want indexes on patient ID because that's too unique and you index will be almost as big as your main data set.

You'll probably need to expand on your question and how it relates to Teradata because that's still not clear IMO. That being said, i haven't worked on Teradata, just DB2, MS SQL and Oracle. so feel free to ignore me!

buechler66 · Posted 08-20-2019 05:42 PM

One term I keep hearing is table 'skewness'. I'm not sure I fully understand how that would relate to indexing teradata tables.

Reeza · Posted 08-20-2019 05:46 PM

It means you have data where a few filters or criteria will capture most of the data and a lot of outliers. I suspect that has to do with the distribution of the index variable and making a decision on whether to use it or not. But it's an educated guess at best 🙂

If you do some reading on general indexes, I think that part may make more sense.

buechler66 · Posted 08-20-2019 05:18 PM

However you bring up another separate question for this novice...at what point, in terms of number of rows, does it make sense to index a database table (any database) or SAS dataset? Is there a magic number (e.g. > 20 million rows)?

Reeza · Posted 08-20-2019 05:24 PM

There's no hard and fast rule, and in fact it will almost certainly be related to what type of hardware you're running on. For example if you're using In memory tech, I'm not sure indices are as relevant or even possible.

You can search the topic at LexJansen.com and find many examples.

Honestly, I usually start creating indexes for things that start hitting over 2 million and/or when I start running into issues with speed and want faster performance. But we're often doing things a bit more on the fly than strategically 😉

@buechler66 wrote:

However you bring up another separate question for this novice...at what point, in terms of number of rows, does it make sense to index a database table (any database) or SAS dataset? Is there a magic number (e.g. > 20 million rows)?

buechler66 · Posted 08-26-2019 08:51 AM

Ok, so I've found that the reason we create these indexes is to evenly distribute the teradata table across processors (or AMPs). This is apparently a Teradata-specific storage 'thing'. So my follow up question would be are there cases when creating indexes (on teradata or otherwise) are inefficient? I ask because there is a new Teradata option where I can specify DBCREATE_TABLE_OPTS='NO PRIMARY INDEX' which randomly spreads the table across processors without creating a primary index.

If this is the case then perhaps in some instances we are creating teradata tables with indexes that aren't specifically targeting how the table will be queried later in the program. This would be inefficient, right? Also doesn't the creation of an index take resources in itself? Like isn't the database having to sort the table? Or is the index itself and object on the database that takes up space resources?

Reeza · Posted 08-26-2019 11:15 AM

If this is the case then perhaps in some instances we are creating teradata tables with indexes that aren't specifically targeting how the table will be queried later in the program. This would be inefficient, right?

Yes, indexes take up space and creating/maintaining indexes is work

Also doesn't the creation of an index take resources in itself?

Yes

Like isn't the database having to sort the table? Or is the index itself and object on the database that takes up space resources

Indexes are essentially objects. I like to think of a DB as a really large textbook. The index is the table of glossary at the back which has the key words and you can search by the word in the back and know exactly what page you need to jump to in the DB. I don't consider it the Table of Contents because that's manually curated, whereas an index is more a data driven approach and doesn't order based on context but follows a series of rules. It takes up resources to create, store, maintian and use an index. There are cases where an index can decrease efficiency - you'll need to test for your use cases.

That all being said, RDBMS are starting to get a lot better at storage and retrieval and some of the new tech is meaning that everything is in memory so indices may not be as useful in those cases. You have to consider your data storage, retrieval and usage when deciding which may be the best approach. And this is why data architects still have jobs 🙂

buechler66 · Posted 08-26-2019 06:47 PM

Are indexes equal in size/space to the table? Are they simply a separate,
sorted version of the table?

I think when sorting a sas dataset that sas creates a temporary version of
the table that I think is the same size of the table being sorted - however
it's only a temporary table and is removed when the sort is completed I
think.

Tom · Posted 08-26-2019 10:29 PM

Tradition RDMS indexes have nothing to do with Teradata's PRIMARY INDEX.

As you have stated Teradata will distribute the data to the AMPS (logical processing units that provide massive parallelism that make TD queries run so fast). You want all of the related data together on the same AMP so that cross AMP communication is reduced. You want to avoid SKEWing the data so that each AMP has approximately the same amount of data to process. Again this makes the queries run faster and the avoids overloading one AMP so that it runs out of storage even while the other AMPs have plenty of space to load more data.

Note that do NOT have to define a PRIMARY INDEX. You can let Teradata distribute the data across the AMPS randomly. Unless you are running a very old version of Teradata.

Deciding what variable(s) to use as the PRIMARY INDEX is probably more of an art than a science. But I am sure your Teradata support team can help you with any specific tables that you want to analyze and suggest how to proceed.

ChrisNZ · Posted 08-26-2019 09:34 PM

@Reeza > For example if you're using In memory tech, I'm not sure indices are as relevant or even possible.

In my experience, loading a data set in memory using SASFILE will conserve the usability of existing indexes, and these indexes still help a lot speeding up random reads.

High-Performance SAS Coding - Third Edition

Classroom Training Available!