Hello,
I'm having a lot of difficulty with this. I'm not a very experienced SAS user and I'm using SAS EG 9.4 (not SQL).
I'm trying to identify observations with specific ICD-10 (e.g., T690) codes from large datasets. I've been using where statements with contains or like but they identify different numbers of records. Is one more accurate than the other? I can't really compare the new datasets because they are so large, so I'm not sure what's going on. Below are examples of my code:
(WHERE=(MAIN_PROBLEM CONTAINS 'S' OR MAIN_PROBLEM CONTAINS 'T0%' OR MAIN_PROBLEM CONTAINS 'T11%' OR MAIN_PROBLEM CONTAINS 'T12%' 
OR MAIN_PROBLEM CONTAINS 'T13%' OR MAIN_PROBLEM CONTAINS 'T14%'));
and
(WHERE=(MAIN_PROBLEM LIKE 'S' OR MAIN_PROBLEM LIKE 'T0%' OR MAIN_PROBLEM LIKE 'T11%' OR MAIN_PROBLEM LIKE 'T12%' 
OR MAIN_PROBLEM LIKE 'T13%' OR MAIN_PROBLEM LIKE 'T14%'));
Any advice is greatly appreciated!
Using CONTAINS 'S' will find strings with capital S anywhere. The way you coded LIKE 'S' will only find strings that are just the letter S. It is the same as = 'S'. To find strings that start with the letter S you would use LIKE 'S%'.
It is probably better to not use the CONTAINS operator. If you have any messy values, such as comments, in the variable you might get some false positives. It is better to test that the codes begin with the required characters. For that you can use with the LIKE operator or the IN operator with the colon modifier. The advantage of using the IN: operator is you do not have to re-type the variable name for each string you want to test for.
First let's recode your WHERE= dataset options as WHERE statements instead. They are just easier to deal with since you don't need to add those extra parentheses. Plus you can augment your WHERE statement with WHERE ALSO statements.
/* Version 1 */
WHERE
    MAIN_PROBLEM CONTAINS 'S' 
 OR MAIN_PROBLEM CONTAINS 'T0%'
 OR MAIN_PROBLEM CONTAINS 'T11%'
 OR MAIN_PROBLEM CONTAINS 'T12%'
 OR MAIN_PROBLEM CONTAINS 'T13%'
 OR MAIN_PROBLEM CONTAINS 'T14%'
;
/* Version 2 */ 
WHERE
    MAIN_PROBLEM LIKE 'S'
 OR MAIN_PROBLEM LIKE 'T0%'
 OR MAIN_PROBLEM LIKE 'T11%'
 OR MAIN_PROBLEM LIKE 'T12%'
 OR MAIN_PROBLEM LIKE 'T13%'
 OR MAIN_PROBLEM LIKE 'T14%'
; So version 1 is checking whether the variable contains the string anywhere in the variable, plus it is literally testing if the variable contains the percent sign character.
Version 2 is testing if the variable begins with one of those strings.
Since you are looking at ICD10 codes I would assume that Version 2 is more likely what you want. But I don't think there is any ICD10 code that is just the letter S. Is that some local convention your dataset is using?
Here is an easier way to do version 2 in SAS code.
WHERE
    MAIN_PROBLEM = 'S' OR MAIN_PROBLEM in: ('T0' 'T11' 'T12' 'T13' 'T14')
; 
Please share the actual code you ran. Make sure to show what dataset you are reading from. Is it using a libref that is pointing to some foreign database? SAS will frequently push WHERE conditions into the the remote database to handle.
Also show the SAS log. I know that with the WHERE statement SAS will echo the way it might have re-written the where condition. For example see this log:
39    proc print data=sashelp.class ;
40     where name =: 'A';
41    run;
NOTE: There were 2 observations read from the data set SASHELP.CLASS.
      WHERE name=:'A';
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.14 seconds
      cpu time            0.06 seconds
42    proc print data=sashelp.class ;
43     where name =: 'A' or 1=1;
44    run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
      WHERE 1 /* an obviously TRUE WHERE clause */ ;
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds
Show some examples of values that are selected by one and not the other. See if you can re-create the program using just those values.
Adding the % into the text definitely changes how SAS interprets the CONTAINS operation.
Example:
66    proc print data=sashelp.class ;
67     where name contains 'A%' ;
68    run;
NOTE: No observations were selected from data set SASHELP.CLASS.
NOTE: There were 0 observations read from the data set SASHELP.CLASS.
      WHERE name contains 'A%';
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
69    proc print data=sashelp.class ;
70     where name contains 'A' ;
71    run;
NOTE: There were 2 observations read from the data set SASHELP.CLASS.
      WHERE name contains 'A';
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds
72    proc print data=sashelp.class ;
73     where upcase(name) contains 'A' ;
74    run;
NOTE: There were 11 observations read from the data set SASHELP.CLASS.
      WHERE UPCASE(name) contains 'A';
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
					
				
			
			
				
			
			
			
			
			
			
			
		With CONTAINS, the % is part of the search string, not a wildcard:
data have;
input code $;
datalines;
T1
T2
T119
S1
;
data want1;
set have;
where code contains 'T%';
run;
proc sql;
create table want2 as
  select *
  from have
  where code contains 'T%'
;
quit;Log:
 72         
 73         data have;
 74         input code $;
 75         datalines;
 
 NOTE: The data set WORK.HAVE has 4 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.00 seconds
       
 
 80         ;
 81         
 82         data want1;
 83         set have;
 84         where code contains 'T%';
 85         run;
 
 NOTE: There were 0 observations read from the data set WORK.HAVE.
       WHERE code contains 'T%';
 NOTE: The data set WORK.WANT1 has 0 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.01 seconds
The CONTAINS operator does not match patterns, like the LIKE operator does.
When I do not use the %, the condition finds all entries starting with the letter:
 73         data want3;
 74         set have;
 75         where code contains "T";
 76         run;
 
 NOTE: There were 3 observations read from the data set WORK.HAVE.
       WHERE code contains 'T';
 NOTE: The data set WORK.WANT3 has 3 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.01 seconds
					
				
			
			
				
			
			
			
			
			
			
			
		Thanks @Tom ,
You're right, there are no ICD-10 codes that are just 'S' - I need to identify all records beginning with 'S' or 'T33', so I thought using 'contains' would find all responses beginning with 'S' and include them. The ICD-10 variable is 4 characters long, so I was trying to avoid typing hundreds of ICD-10 codes (for example, 'S445' 'S446' 'T331' 'T332' 'T333', etc).
Would the 'in' command be the best way to do so?
If you are looking how " to identify all records beginning with 'S' or 'T33' " you can use the SUBSTR() function:
where substr(icd10,1,1) = 'S' or substr(icd10,1,3)='T33';Using CONTAINS 'S' will find strings with capital S anywhere. The way you coded LIKE 'S' will only find strings that are just the letter S. It is the same as = 'S'. To find strings that start with the letter S you would use LIKE 'S%'.
It is probably better to not use the CONTAINS operator. If you have any messy values, such as comments, in the variable you might get some false positives. It is better to test that the codes begin with the required characters. For that you can use with the LIKE operator or the IN operator with the colon modifier. The advantage of using the IN: operator is you do not have to re-type the variable name for each string you want to test for.
The LIKE operator accepts the % wildcard, but the CONTAINS does not.
where Main_Problem contains 'T0%' means you are specifically looking for the character string T0% in the value. This would return T0%, AT0%, LT0%2g, etc.
where Main_Program like 'T0%' means you are looking for all values that start with T0 with zero or more characters following. This would return T0A, T032, T0ALL, etc.
Pay attention that both CONTAINS and LIKE are case sensitive.
Check difference behavior as shown in next code:
data have;
  input str $;
cards;
S
s
so
USA
usage
;
run;
data check;
 set have(where=(str like 's'));
run;
data check;
 set have(where=(str contains 's'));
run; 
@Shmuel wrote:
Pay attention that CONTAINS is case sensitive while LIKE is not, as shown in next code:
data have; input str $; cards; S s so USA ; run; data check; set have(where=(str like 's')); run; data check; set have(where=(str contains 's')); run;
Not with SAS code.
Perhaps you are used to using some other database? I know that Teradata has a case insensitive mode. But even there I don't think it changes based on what operator you use. Some database systems do have a LIKE and an ILIKE operator, the later being the case insensitive version.
@Tom I have rechecked my code and results and edited my post.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
