BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mk131190
Obsidian | Level 7

Hello,

 

I have a large dataset and I am looking to run a query on it using a variable which includes inputted comments by an operational based team. I want to keep all records which have a specific word within Description (for example everything with the word 'Test'). The word does not sit in a certain postion and I have chosen to use Contains within the Where clause. There are no other variables i can filter the data down further on.

This has created performance issues and the code takes a considerable amount of time to run. Is there a more efficient method then using the contains function? I have kept the minimum number of variables possible in my outputted dataset (see current code below). 

 

PROC SQL;
CREATE TABLE WORK.QUERY AS
SELECT t1.Identifier,
t1.date,
t1.var,
t1.DESCRIPTION
FROM data.set t1
WHERE t1.DESCRIPTION CONTAINS 'Test';
QUIT;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

If the data is not in SAS the most important performance aspect is to ensure that only the necessary data is passed on to SAS.

So you must ensure that the where clasue you use will be passed on the the DBMS, not run in SAS.

The most universally supported SQL operator must be:

where DESCRIPTION like '%Test%'

 

 

 

View solution in original post

8 REPLIES 8
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Try using Base SAS, SQL has certain resource issues when dealing with large data:

data want;
  set have (where=(index(description,"Test") keep=identifier date var description);
run;

Alternatively you don't need to create a dataset, you can create a view - this is basically storing only the code to view results:

proc sql;
  create view WANT as
  select...;
quit;
LinusH
Tourmaline | Level 20
A alternative to CONTAINS is the scan function, but I would be surprised if that would be much more efficient.
Unlike @RW9 I don't think that a straight SQL with selct-from-where is any less efficient than a data step, but it doesn't hurt to compare.
Are you sure that the CONTAINS operation is the resource monster? Adding
Options fullstimer;
will hint you.
Data never sleeps
Kurt_Bremser
Super User

Just ran a test with two similar sized datasets (to avoid performance advantages caused by data residing in cache after the first read) and found that a straight SQL with contains is only marginally slower than a data step with a where= dataset option. And that may have been caused by different loads on the SAS server during the two steps. Call it "within statistical noise".

 

So, to dig further:

What is the (logical and physical) size of your input dataset?

How long does a simple data _null_ step take that just reads the dataset?

Where does your dataset reside (locally, remote, in a DBMS)?

 

mk131190
Obsidian | Level 7

Thanks for this.

 

I have run other steps off of this dataset. For example a SQL step using case whens and a simple where statement restricting on specific components of a different variable runs in around 10 minutes.

 

The SAS dataset is stored in a DBMS. I am not sure how to work out the size of it as it does not come up in contents and dataset procedures the number 9.0071993E15 came up when I looked for observations.

Kurt_Bremser
Super User

So the dataset is stored in a DBMS. As these often reside on other hosts, you may have to consider network bandwidth.

One other thing to consider when using a DBMS is: what part of the query can be implicitly handed over to the DBMS? If you use something SAS-specific, SAS will have to read all the data from the DBMS before applying conditions and functions.

ChrisNZ
Tourmaline | Level 20

If the data is not in SAS the most important performance aspect is to ensure that only the necessary data is passed on to SAS.

So you must ensure that the where clasue you use will be passed on the the DBMS, not run in SAS.

The most universally supported SQL operator must be:

where DESCRIPTION like '%Test%'

 

 

 

mk131190
Obsidian | Level 7

Using like instead of contains made it run in 2 minutes. Thanks for your help.

gergely_batho
SAS Employee

I agree with previous replies: firs of all ensure, that the WHERE part of the query runs in the database (aka SQL pushdown, SQL pass-through).

SAS is able to pussh down the query if you put it into an SQL WHERE clause, or into a data step WHERE statement.

 

There are SAS options that you can use to track, what parts of the querz was pushed down. (I don't remember the name of the otion right now.)

 

I suppose you (or users) will want to run different queries using different words.

 

Even if the query runs in database, it is still very expensive. If you have a simpleindex built on that coulumn, it is still not much help, since the query could be optimized only if you search for the beginning of the column.

On the other hand there are thechnologies that support and accelerate this kind of query. Maybe your database supports it.

Or you have to swich to another database.

Keyword: inverted index

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1658 views
  • 5 likes
  • 6 in conversation