Solved: Re: replacement for "contains" operator for a dataset

Ivan555 · Posted 06-04-2019 12:17 PM

Hello!

I do have a code:

proc sql;
create table t1 (id int, name char(30));
insert into t1 
values (1, 'Peter')
values (2, 'Piper')
values (3, 'picked')
values (4, 'a peck)
values (5, 'of pickled')
values (6, 'peppers')
;
quit; 

proc sql;
create table t2 as select * from t1
where (t1.name contains 'Pet') OR (t1.name contains 'pic');
quit;

If I put the text that which I want to search in sourse in a separate dataset like

data inf;
length text $3; /*all values have the same length, this restriction is correct*/
input text;
datalines;
Pet
pic
;
run;

what will be the most appropriate way to find it in the t1 table?

(I want to get table equal t2 using inf.text)

I've found this way: (also using find(,,))

proc sql;
create table t2 as
select * from t1
where exists (select * from inf where index(t1.name, inf.text)<>0);
quit;

But I can't say - does it work in the right way?

It seems that this code is rather slow in my queries and I cannot understand - is there something wrong with it?

Tom · Posted 06-04-2019 04:25 PM

What are the relative sizes of the actual tables involved. Are they all SAS datasets? Or are some in external database? If so are both tables in the same external database? Can you push the query into the database using explicit pass thru SQL?

It looks like you want to do an INNER JOIN which should be pretty fast.

Make sure the include DISTINCT keyword to avoid replicating observations with multiple "hits".

proc sql;
create table t2 as 
  select distinct t1.* 
  from t1
  inner join INF 
    on t1.name contains inf.text
;
quit;

If values of INF.TEXT are not always exactly three characters long then you might need to worry about the trailing spaces that SAS will add.

View solution in original post

Ivan555 · Posted 06-04-2019 01:41 PM

I've found also one more choise:

proc sql;
create table t2 as
select * from t1;
inner join inf on t1.name like '%'||inf.text||'%';
quit;

still can't choose which one is preferred..

VDD · Posted 06-04-2019 01:49 PM

@Ivan555 this code below is slower because of he enter selection process

proc sql;
create table t2 as
select * from t1
where exists (select * from inf where index(t1.name, inf.text)<>0);
quit;

the manner of using a data step or a SQL statement can be mainly based on preformance, user needs and controll, and sometimes using a data step is easier to explain to management whom may have no clue how code works but are required to signoff on the process.

Tom · Posted 06-04-2019 04:25 PM

What are the relative sizes of the actual tables involved. Are they all SAS datasets? Or are some in external database? If so are both tables in the same external database? Can you push the query into the database using explicit pass thru SQL?

It looks like you want to do an INNER JOIN which should be pretty fast.

Make sure the include DISTINCT keyword to avoid replicating observations with multiple "hits".

proc sql;
create table t2 as 
  select distinct t1.* 
  from t1
  inner join INF 
    on t1.name contains inf.text
;
quit;

If values of INF.TEXT are not always exactly three characters long then you might need to worry about the trailing spaces that SAS will add.

Ivan555 · Posted 06-05-2019 12:51 PM

@VDD thank you much, understood

@Tom

> approximate size of t1-table is close to 10 mio rows, t2-table is a few dozens rows

> both tables are sas datasets

> about "distinct" - I agree, understood

> about "characters long" - I agree, understood

> about "the query into the database using explicit " I think I can't. Due to the fact that I am representing the side of the business, I can't see databases. Maybe in future I'll be able to ask our IT-division, but in moment the only people which I can ask for help with my code are here - that is the SAS Community 🙂

>your code seems works better than mine, thank you

replacement for "contains" operator for a dataset

Re: replacement for "contains" operator for a dataset

Re: replacement for "contains" operator for a dataset

Re: replacement for "contains" operator for a dataset

Re: replacement for "contains" operator for a dataset

Re: replacement for "contains" operator for a dataset

Registration is open

SAS Training: Just a Click Away