BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ronein
Onyx | Level 15

Hello

In Data warehouse (Tera data ) there is a table that contain millions or rows (800 million rows) and many variables.

I want to work with this table (a few columns and for specific dates) in SAS.

Which code do you think will take less run time ??Option1 or option2?

OR maybe there is another way that can take less run time?

 

 

DATA V0500_option1;
SET teradata.V0500_1_FINANCIAL_EVENT(KEEP=Agreement_Id  Event_Date  Start_Date  Event_Time  Type_Code
WHERE=(Event_Date>='01SERP2020'd AND Type_Code  IN  (11002,11004,11050,11024,31510,11005)    AND Agreement_Id ne  -1 
 ));
Run;



 
proc sql;  
connect to teradata (server=dwprod schema=DWP_vall authdomain=TeraDataAuth);
create table V0500_option2  as 
select * from connection to teradata
(
SELECT   distinct Agreement_Id,Event_Date,Start_Date,Event_Time,Type_Code
from V0500_1_FINANCIAL_EVENT
where Event_Date>= date '2020-09-01'    
      AND Type_Code  IN  (11002,11004,11050,11024,31510,11005)   
	  AND Agreement_Id ne  -1  
);
disconnect from teradata;
quit ;
1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

What did your tests reveal?

 

Keep in mind that nobody here has an exact copy of your environment, so you are the best source of information about relative performance.

As always, Maxim 4.

 

As a side note, why do you use DISTINCT in the pass-through? If it's not absolutely necessary, it's only a performance drain.

View solution in original post

16 REPLIES 16
SASKiwi
PROC Star

Typically your Option 2 Passthru should be quicker as you can be sure filtering happens in Teradata before downloading. However why are you using DISCRETE which will only slow processing? I never use this if I can avoid it.

 

Also how many rows do you expect from this query? Even if your query selects only 1% of the rows that is still 8 million and will take significant time to select and download to SAS. Do a COUNT(*) first to see how many rows will be downloaded. 

Tom
Super User Tom
Super User

Actually I think you have to use special instructions to Teradata if you DON'T want it to return the distinct observations.  At least you do when defining a TABLE.  Its normal mode of operation is pure SET logic.

Ronein
Onyx | Level 15

Sorry I dont understand,

where is discrete? I dont see this word in my code, can you show me please how to change the code ?

proc sql;  
connect to teradata (server=dwprod schema=DWP_vall authdomain=TeraDataAuth);
create table V0500  as 
select * from connection to teradata
(
SELECT   distinct Agreement_Id,Event_Date,Start_Date,Event_Time,Type_Code
from V0500_1_FINANCIAL_EVENT
where Event_Date>= date '2020-09-01'    
      AND Type_Code  IN  (11002,11004,11050,11024,31510,11005)   
	  AND Agreement_Id ne  -1  
);
disconnect from teradata;
quit ;

 

 

SASKiwi
PROC Star

Apologies, I meant DISTINCT. No doubt you've realised that now.

 

Tom
Super User Tom
Super User

Those two should be equivalent.  Unless you have a lot of duplicate observations that the DISTINCT keyword is eliminating since then there will be fewer observations to transfer.

 

Personally I wouldn't bother to type the extra characters needed to use the WHERE= dataset option and would instead just use a WHERE statement.  The result is the same.

 

Plus when using SAS code instead of Teradata code you don't have to type all of those commas for the IN operator to work. Spaces will do just fine. That can make it much easier to use a macro variable to contain the list of values.

data V0500_option1;
  set teradata.V0500_1_FINANCIAL_EVENT(keep=Agreement_Id Event_Date Start_Date Event_Time Type_Code);
  where Event_Date>='01SEP2020'd
    and Type_Code IN (11002 11004 11050 11024 31510 11005)
    and Agreement_Id ne  -1 
  ;
run;

But the real savings will come from doing even more processing in the Teradata server. So what is it that you want to do with this subset of the data?  Perhaps you can summarize on the Teradata side and transfer even less information between the two servers.

 

Note that some procs like PROC SUMMARY will automatically push some of the processing into the Teradata server.

 

Also note that there is no need to make a NEW connection to Teradata.  Instead you re reuse the connection you made for the libref by using the CONNECT USING syntax in your SQL code.

libname mytd teradata ..... ;
proc sql;  
connect using mytd;
create table V0500_option2  as 
select * from connection to mytd
(select distinct Agreement_Id,Event_Date,Start_Date,Event_Time,Type_Code
 from V0500_1_FINANCIAL_EVENT
 where Event_Date>= date '2020-09-01'    
   and Type_Code IN (11002,11004,11050,11024,31510,11005)   
   and Agreement_Id ne -1  
);
quit ;

 

Ronein
Onyx | Level 15

Thank you,

Few questions please-

1- If using In operator with spaces instead of commas- will it save run time?

2-Create   libref   instead of making NEW connection to Teradata-will it save run time?

3-Regarding your question- yes, I need subset  of rows and few columns only.

Do you recommend me to do it directly in tera? and then how to transfer the data from tera to sas??

Also it is not good idea because the program will run in futures automatically in sas and then I cannot go to tera directly...

 

Ronein
Onyx | Level 15

I didnt understand -

Personally I wouldn't bother to type the extra characters needed to use the WHERE= dataset option and would instead just use a WHERE statement.  The result is the same

Tom
Super User Tom
Super User

@Ronein wrote:

I didnt understand -

Personally I wouldn't bother to type the extra characters needed to use the WHERE= dataset option and would instead just use a WHERE statement.  The result is the same


The code should work the same.  It is just more confusing and difficult for the programmer to have to type the extra () and = and make sure the nesting of the () are properly placed.  It also leads to extremely long statements that also make the code hard to type, edit and understand.

LinusH
Tourmaline | Level 20
  1. In SAS no, in TD, probably not either.
  2. The connection time is probably the least of your concerns. Data processing and movement are.

The only difference between the two is the DISTINCT. If your TD setup is good this operation will go faster there compared to SAS (don't know your SAS setup either...).

 

If you don't want the trouble of writing explicit pass through code, you could use SAS SQL, your query will be passed down automatically to TD.

The most optimal (from performance perspective) is to store the result in TD and continue working there, but then it depends on your use case, what you intend to do with the data next.

If you wish to optimize the query further, speak with your TD DBA to see if anything can be done in TD.

Data never sleeps
Tom
Super User Tom
Super User

@Ronein wrote:

Thank you,

Few questions please-

1- If using In operator with spaces instead of commas- will it save run time?

No. Just reduce user frustration in typing, editing and reading the code.

 

2-Create   libref   instead of making NEW connection to Teradata-will it save run time?

Minimally.  You can skip the time it takes to make a new connection.  It will make you program easier to maintain since there will only be one place where you have to place your signon information.  The Teradata DBA will probably be happier with you for putting less burden on their system.

3-Regarding your question- yes, I need subset  of rows and few columns only.

That was not the question. The question is WHY do you need the subset? What do you intend to do with it next? If the next step is something that would reduce the size of the data then doing that step inside of Teradata will save a lot of data transfer.  Transferring the data is what takes the most time.

Do you recommend me to do it directly in tera? and then how to transfer the data from tera to sas??

Also it is not good idea because the program will run in futures automatically in sas and then I cannot go to tera directly...

Not sure what that means.  If the data is in Teradata then you will always have to go to Teradata to get the data.

 


 

Ronein
Onyx | Level 15
Thanks.
I use it later to merge it with sas data sets so I must stay in sas platform.
Kurt_Bremser
Super User

What did your tests reveal?

 

Keep in mind that nobody here has an exact copy of your environment, so you are the best source of information about relative performance.

As always, Maxim 4.

 

As a side note, why do you use DISTINCT in the pass-through? If it's not absolutely necessary, it's only a performance drain.

Ronein
Onyx | Level 15
Sometimes there are some problems in data and have duplications. Is it better to select all rows in the query that create sas data set from tera table (not using distinct) and only then use sas proc sort nodupkey?
Tom
Super User Tom
Super User

@Ronein wrote:
Sometimes there are some problems in data and have duplications. Is it better to select all rows in the query that create sas data set from tera table (not using distinct) and only then use sas proc sort nodupkey?

I think the point @Kurt_Bremser was trying to make is that eliminating the duplicates can take time.   That is especially true of a system like Teradata that tries to split the task across multiple modules running in parallel.  To eliminate the duplicates it would need to have those modules communicate, thus reducing some (or all) of the performance gained by splitting up the task, even when there are no duplicates to be eliminated.  

 

So if you know there will not be duplicates then do not use the DISTINCT keyword so that Teradata knows it does not have to worry about eliminating them.

 

Which is better depends on the performance of the two systems.  But in general I would say if eliminating duplicates reduces the number of observations that need to move from Teradata to SAS then it is probably worth asking Teradata to do it.  You could ask your local Teradata DBA for help with finding the best Teradata code to use.

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 16 replies
  • 9114 views
  • 4 likes
  • 5 in conversation