BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Angmar
Obsidian | Level 7

Hello,

I'm having a lot of difficulty with this. I'm not a very experienced SAS user and I'm using SAS EG 9.4 (not SQL).

I'm trying to identify observations with specific ICD-10 (e.g., T690) codes from large datasets. I've been using where statements with contains or like but they identify different numbers of records. Is one more accurate than the other? I can't really compare the new datasets because they are so large, so I'm not sure what's going on. Below are examples of my code:

(WHERE=(MAIN_PROBLEM CONTAINS 'S' OR MAIN_PROBLEM CONTAINS 'T0%' OR MAIN_PROBLEM CONTAINS 'T11%' OR MAIN_PROBLEM CONTAINS 'T12%'
OR MAIN_PROBLEM CONTAINS 'T13%' OR MAIN_PROBLEM CONTAINS 'T14%'));

 

and 

 

(WHERE=(MAIN_PROBLEM LIKE 'S' OR MAIN_PROBLEM LIKE 'T0%' OR MAIN_PROBLEM LIKE 'T11%' OR MAIN_PROBLEM LIKE 'T12%'
OR MAIN_PROBLEM LIKE 'T13%' OR MAIN_PROBLEM LIKE 'T14%'));

 

Any advice is greatly appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Using CONTAINS 'S' will find strings with capital S anywhere.  The way you coded LIKE 'S' will only find strings that are just the letter S.  It is the same as = 'S'.  To find strings that start with the letter S you would use LIKE 'S%'.

 

It is probably better to not use the CONTAINS operator.  If you have any messy values, such as comments, in the variable you might get some false positives.  It is better to test that the codes begin with the required characters.  For that you can use with the LIKE operator or the IN operator with the colon modifier.  The advantage of using the IN: operator is you do not have to re-type the variable name for each string you want to test for.

 

View solution in original post

13 REPLIES 13
Tom
Super User Tom
Super User

First let's recode your WHERE= dataset options as WHERE statements instead.  They are just easier to deal with since you don't need to add those extra parentheses.  Plus you can augment your WHERE statement with WHERE ALSO statements.

/* Version 1 */
WHERE
    MAIN_PROBLEM CONTAINS 'S' 
 OR MAIN_PROBLEM CONTAINS 'T0%'
 OR MAIN_PROBLEM CONTAINS 'T11%'
 OR MAIN_PROBLEM CONTAINS 'T12%'
 OR MAIN_PROBLEM CONTAINS 'T13%'
 OR MAIN_PROBLEM CONTAINS 'T14%'
;

/* Version 2 */ 
WHERE
    MAIN_PROBLEM LIKE 'S'
 OR MAIN_PROBLEM LIKE 'T0%'
 OR MAIN_PROBLEM LIKE 'T11%'
 OR MAIN_PROBLEM LIKE 'T12%'
 OR MAIN_PROBLEM LIKE 'T13%'
 OR MAIN_PROBLEM LIKE 'T14%'
; 

So version 1 is checking whether the variable contains the string anywhere in the variable, plus it is literally testing if the variable contains the percent sign character.

Version 2 is testing if the variable begins with one of those strings.

Since you are looking at ICD10 codes I would assume that Version 2 is more likely what you want.  But I don't think there is any ICD10 code that is just the letter S.  Is that some local convention your dataset is using?

Here is an easier way to do version 2 in SAS code.

WHERE
    MAIN_PROBLEM = 'S' OR MAIN_PROBLEM in: ('T0' 'T11' 'T12' 'T13' 'T14')
; 

 

Angmar
Obsidian | Level 7
Hi, and thank you for taking the time to answer. However, when I look at my dataset created using contains and the %, the dataset contains records, for example, T11-T119, as though the % is functioning as a wildcard like it should with 'like'. Using 'contains' + % finds more records than just using 'contains' or 'like' + %.
:S :S
Tom
Super User Tom
Super User

Please share the actual code you ran.  Make sure to show what dataset you are reading from.  Is it using a libref that is pointing to some foreign database?  SAS will frequently push WHERE conditions into the the remote database to handle. 

 

Also show the SAS log.  I know that with the WHERE statement SAS will echo the way it might have re-written the where condition.  For example see this log:

39    proc print data=sashelp.class ;
40     where name =: 'A';
41    run;

NOTE: There were 2 observations read from the data set SASHELP.CLASS.
      WHERE name=:'A';
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.14 seconds
      cpu time            0.06 seconds


42    proc print data=sashelp.class ;
43     where name =: 'A' or 1=1;
44    run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
      WHERE 1 /* an obviously TRUE WHERE clause */ ;
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

Show some examples of values that are selected by one and not the other.  See if you can re-create the program using just those values.

Angmar
Obsidian | Level 7
Hello, I'm receiving similar responses, so:
When I look at my dataset created using contains and the %, the dataset contains records, for example, T11-T119, as though the % is functioning as a wildcard like it should with 'like'. Using 'contains' + % finds more records than just using 'contains' or 'like' + %.
:S :S
Tom
Super User Tom
Super User

Adding the % into the text definitely changes how SAS interprets the CONTAINS operation.

 

Example:

66    proc print data=sashelp.class ;
67     where name contains 'A%' ;
68    run;

NOTE: No observations were selected from data set SASHELP.CLASS.
NOTE: There were 0 observations read from the data set SASHELP.CLASS.
      WHERE name contains 'A%';
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


69    proc print data=sashelp.class ;
70     where name contains 'A' ;
71    run;

NOTE: There were 2 observations read from the data set SASHELP.CLASS.
      WHERE name contains 'A';
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds


72    proc print data=sashelp.class ;
73     where upcase(name) contains 'A' ;
74    run;

NOTE: There were 11 observations read from the data set SASHELP.CLASS.
      WHERE UPCASE(name) contains 'A';
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
Kurt_Bremser
Super User

With CONTAINS, the % is part of the search string, not a wildcard:

data have;
input code $;
datalines;
T1
T2
T119
S1
;

data want1;
set have;
where code contains 'T%';
run;

proc sql;
create table want2 as
  select *
  from have
  where code contains 'T%'
;
quit;

Log:

 72         
 73         data have;
 74         input code $;
 75         datalines;
 
 NOTE: The data set WORK.HAVE has 4 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.00 seconds
       
 
 80         ;
 81         
 82         data want1;
 83         set have;
 84         where code contains 'T%';
 85         run;
 
 NOTE: There were 0 observations read from the data set WORK.HAVE.
       WHERE code contains 'T%';
 NOTE: The data set WORK.WANT1 has 0 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.01 seconds

The CONTAINS operator does not match patterns, like the LIKE operator does.

 

When I do not use the %, the condition finds all entries starting with the letter:

 73         data want3;
 74         set have;
 75         where code contains "T";
 76         run;
 
 NOTE: There were 3 observations read from the data set WORK.HAVE.
       WHERE code contains 'T';
 NOTE: The data set WORK.WANT3 has 3 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.01 seconds
Angmar
Obsidian | Level 7

Thanks @Tom ,

You're right, there are no ICD-10 codes that are just 'S' - I need to identify all records beginning with 'S' or 'T33', so I thought using 'contains' would find all responses beginning with 'S' and include them. The ICD-10 variable is 4 characters long, so I was trying to avoid typing hundreds of ICD-10 codes (for example, 'S445' 'S446' 'T331' 'T332' 'T333', etc).

Would the 'in' command be the best way to do so?

Shmuel
Garnet | Level 18

If you are looking how " to identify all records beginning with 'S' or 'T33' " you can use the SUBSTR() function:

where substr(icd10,1,1) = 'S' or substr(icd10,1,3)='T33';
Tom
Super User Tom
Super User

Using CONTAINS 'S' will find strings with capital S anywhere.  The way you coded LIKE 'S' will only find strings that are just the letter S.  It is the same as = 'S'.  To find strings that start with the letter S you would use LIKE 'S%'.

 

It is probably better to not use the CONTAINS operator.  If you have any messy values, such as comments, in the variable you might get some false positives.  It is better to test that the codes begin with the required characters.  For that you can use with the LIKE operator or the IN operator with the colon modifier.  The advantage of using the IN: operator is you do not have to re-type the variable name for each string you want to test for.

 

KathyKiraly
SAS Employee

The LIKE operator accepts the % wildcard, but the CONTAINS does not. 

 

where Main_Problem contains 'T0%' means you are specifically looking for the character string T0% in the value. This would return T0%, AT0%, LT0%2g, etc.

 

where Main_Program like 'T0%' means you are looking for all values that start with T0 with zero or more characters following. This would return T0A, T032, T0ALL, etc.

Shmuel
Garnet | Level 18

Pay attention that both CONTAINS and LIKE are case sensitive.

Check difference behavior as shown in next code:

data have;
  input str $;
cards;
S
s
so
USA
usage ; run; data check; set have(where=(str like 's')); run; data check; set have(where=(str contains 's')); run;
 

 

Tom
Super User Tom
Super User

@Shmuel wrote:

Pay attention that CONTAINS is case sensitive while LIKE is not, as shown in next code:

data have;
  input str $;
cards;
S
s
so
USA
;
run;
data check;
 set have(where=(str like 's'));
run;
data check;
 set have(where=(str contains 's'));
run;

Not with SAS code.

Perhaps you are used to using some other database? I know that Teradata has a case insensitive mode.  But even there I don't think it changes based on what operator you use.  Some database systems do have a LIKE and an ILIKE operator, the later being the case insensitive version.

Shmuel
Garnet | Level 18

@Tom I have rechecked my code and results and edited my post.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 7548 views
  • 1 like
  • 5 in conversation