DATA Step, Macro, Functions and more

Coding Efficiency Where Clause

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 19
Accepted Solution

Coding Efficiency Where Clause

Hello,

 

I have a large dataset and I am looking to run a query on it using a variable which includes inputted comments by an operational based team. I want to keep all records which have a specific word within Description (for example everything with the word 'Test'). The word does not sit in a certain postion and I have chosen to use Contains within the Where clause. There are no other variables i can filter the data down further on.

This has created performance issues and the code takes a considerable amount of time to run. Is there a more efficient method then using the contains function? I have kept the minimum number of variables possible in my outputted dataset (see current code below). 

 

PROC SQL;
CREATE TABLE WORK.QUERY AS
SELECT t1.Identifier,
t1.date,
t1.var,
t1.DESCRIPTION
FROM data.set t1
WHERE t1.DESCRIPTION CONTAINS 'Test';
QUIT;

 


Accepted Solutions
Solution
‎08-05-2016 09:11 AM
PROC Star
Posts: 1,759

Re: Coding Efficiency Where Clause

If the data is not in SAS the most important performance aspect is to ensure that only the necessary data is passed on to SAS.

So you must ensure that the where clasue you use will be passed on the the DBMS, not run in SAS.

The most universally supported SQL operator must be:

where DESCRIPTION like '%Test%'

 

 

 

View solution in original post


All Replies
Super User
Super User
Posts: 7,942

Re: Coding Efficiency Where Clause

Try using Base SAS, SQL has certain resource issues when dealing with large data:

data want;
  set have (where=(index(description,"Test") keep=identifier date var description);
run;

Alternatively you don't need to create a dataset, you can create a view - this is basically storing only the code to view results:

proc sql;
  create view WANT as
  select...;
quit;
Super User
Posts: 5,424

Re: Coding Efficiency Where Clause

A alternative to CONTAINS is the scan function, but I would be surprised if that would be much more efficient.
Unlike @RW9 I don't think that a straight SQL with selct-from-where is any less efficient than a data step, but it doesn't hurt to compare.
Are you sure that the CONTAINS operation is the resource monster? Adding
Options fullstimer;
will hint you.
Data never sleeps
Super User
Posts: 7,762

Re: Coding Efficiency Where Clause

Just ran a test with two similar sized datasets (to avoid performance advantages caused by data residing in cache after the first read) and found that a straight SQL with contains is only marginally slower than a data step with a where= dataset option. And that may have been caused by different loads on the SAS server during the two steps. Call it "within statistical noise".

 

So, to dig further:

What is the (logical and physical) size of your input dataset?

How long does a simple data _null_ step take that just reads the dataset?

Where does your dataset reside (locally, remote, in a DBMS)?

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Occasional Contributor
Posts: 19

Re: Coding Efficiency Where Clause

Posted in reply to KurtBremser

Thanks for this.

 

I have run other steps off of this dataset. For example a SQL step using case whens and a simple where statement restricting on specific components of a different variable runs in around 10 minutes.

 

The SAS dataset is stored in a DBMS. I am not sure how to work out the size of it as it does not come up in contents and dataset procedures the number 9.0071993E15 came up when I looked for observations.

Super User
Posts: 7,762

Re: Coding Efficiency Where Clause

So the dataset is stored in a DBMS. As these often reside on other hosts, you may have to consider network bandwidth.

One other thing to consider when using a DBMS is: what part of the query can be implicitly handed over to the DBMS? If you use something SAS-specific, SAS will have to read all the data from the DBMS before applying conditions and functions.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Solution
‎08-05-2016 09:11 AM
PROC Star
Posts: 1,759

Re: Coding Efficiency Where Clause

If the data is not in SAS the most important performance aspect is to ensure that only the necessary data is passed on to SAS.

So you must ensure that the where clasue you use will be passed on the the DBMS, not run in SAS.

The most universally supported SQL operator must be:

where DESCRIPTION like '%Test%'

 

 

 

Occasional Contributor
Posts: 19

Re: Coding Efficiency Where Clause

Using like instead of contains made it run in 2 minutes. Thanks for your help.

SAS Employee
Posts: 340

Re: Coding Efficiency Where Clause

I agree with previous replies: firs of all ensure, that the WHERE part of the query runs in the database (aka SQL pushdown, SQL pass-through).

SAS is able to pussh down the query if you put it into an SQL WHERE clause, or into a data step WHERE statement.

 

There are SAS options that you can use to track, what parts of the querz was pushed down. (I don't remember the name of the otion right now.)

 

I suppose you (or users) will want to run different queries using different words.

 

Even if the query runs in database, it is still very expensive. If you have a simpleindex built on that coulumn, it is still not much help, since the query could be optimized only if you search for the beginning of the column.

On the other hand there are thechnologies that support and accelerate this kind of query. Maybe your database supports it.

Or you have to swich to another database.

Keyword: inverted index

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 393 views
  • 5 likes
  • 6 in conversation