Hello,
I have a large dataset and I am looking to run a query on it using a variable which includes inputted comments by an operational based team. I want to keep all records which have a specific word within Description (for example everything with the word 'Test'). The word does not sit in a certain postion and I have chosen to use Contains within the Where clause. There are no other variables i can filter the data down further on.
This has created performance issues and the code takes a considerable amount of time to run. Is there a more efficient method then using the contains function? I have kept the minimum number of variables possible in my outputted dataset (see current code below).
PROC SQL;
CREATE TABLE WORK.QUERY AS
SELECT t1.Identifier,
t1.date,
t1.var,
t1.DESCRIPTION
FROM data.set t1
WHERE t1.DESCRIPTION CONTAINS 'Test';
QUIT;
If the data is not in SAS the most important performance aspect is to ensure that only the necessary data is passed on to SAS.
So you must ensure that the where clasue you use will be passed on the the DBMS, not run in SAS.
The most universally supported SQL operator must be:
where DESCRIPTION like '%Test%'
Try using Base SAS, SQL has certain resource issues when dealing with large data:
data want; set have (where=(index(description,"Test") keep=identifier date var description); run;
Alternatively you don't need to create a dataset, you can create a view - this is basically storing only the code to view results:
proc sql; create view WANT as select...; quit;
Just ran a test with two similar sized datasets (to avoid performance advantages caused by data residing in cache after the first read) and found that a straight SQL with contains is only marginally slower than a data step with a where= dataset option. And that may have been caused by different loads on the SAS server during the two steps. Call it "within statistical noise".
So, to dig further:
What is the (logical and physical) size of your input dataset?
How long does a simple data _null_ step take that just reads the dataset?
Where does your dataset reside (locally, remote, in a DBMS)?
Thanks for this.
I have run other steps off of this dataset. For example a SQL step using case whens and a simple where statement restricting on specific components of a different variable runs in around 10 minutes.
The SAS dataset is stored in a DBMS. I am not sure how to work out the size of it as it does not come up in contents and dataset procedures the number 9.0071993E15 came up when I looked for observations.
So the dataset is stored in a DBMS. As these often reside on other hosts, you may have to consider network bandwidth.
One other thing to consider when using a DBMS is: what part of the query can be implicitly handed over to the DBMS? If you use something SAS-specific, SAS will have to read all the data from the DBMS before applying conditions and functions.
If the data is not in SAS the most important performance aspect is to ensure that only the necessary data is passed on to SAS.
So you must ensure that the where clasue you use will be passed on the the DBMS, not run in SAS.
The most universally supported SQL operator must be:
where DESCRIPTION like '%Test%'
Using like instead of contains made it run in 2 minutes. Thanks for your help.
I agree with previous replies: firs of all ensure, that the WHERE part of the query runs in the database (aka SQL pushdown, SQL pass-through).
SAS is able to pussh down the query if you put it into an SQL WHERE clause, or into a data step WHERE statement.
There are SAS options that you can use to track, what parts of the querz was pushed down. (I don't remember the name of the otion right now.)
I suppose you (or users) will want to run different queries using different words.
Even if the query runs in database, it is still very expensive. If you have a simpleindex built on that coulumn, it is still not much help, since the query could be optimized only if you search for the beginning of the column.
On the other hand there are thechnologies that support and accelerate this kind of query. Maybe your database supports it.
Or you have to swich to another database.
Keyword: inverted index
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.