BookmarkSubscribeRSS Feed
Jarvin99
Obsidian | Level 7

I have a list of values to filter for a variable called codeID. Normally, what I will do is to use proc sql and say if codeID= 'a111' or 'a112' or 'a113'. However unlike the example with just 3 values, with 55 values to select, I am not sure if there is a faster way to do. The values are not ordinal, meaning that they can be 'S009' or 'a111' or 'S712'.

 

Thank you for your help.

8 REPLIES 8
AMSAS
SAS Super FREQ

Several ways you could do this

Merge

Hash object 

ballardw
Super User

Without an example of why you are applying that filter (which BTW your example is syntactically incorrect) it is hard to provide a best solution.

 

The IN operator is one way, just list all of the values.

 

if codeid in ('a111' 'a112' 'a113' 's009' 'S712') then <what ever>

Another approach might be to create a data set with the values, one per record, and Merge (data step) by the variable. This would require sorting both sets by the value of Codeid.

data want;
    merge have
               codelist (in=incodes)
   ;
   by codeid;
   if incodes then <do what ever>;
;

The (in = ) data set option creates a variable with the value of one when the current record has contribution from the set and 0 when not so can be used with the IF as SAS treats the 1 as "true".

A similar approach could be used with Proc SQL and a Join on the variable.

Jarvin99
Obsidian | Level 7
Sorry that I am a rookie for SAS. May I know why my example is syntactically incorrect? Why cannot I use 'if' but 'in'? I always cannot distinguish the difference, and the program seems to run okay for 'if' selection. Thank you, and you provide a great solution.
Reeza
Super User

The code would be similar to the following, not what you had posted:

 

if codeID= 'a111' or codeID='a112' or codeID= 'a113' then ....;

Which is equivalent to:

if codeID in ('a111'  'a112'  'a113') then ...;

@Jarvin99 wrote:
Sorry that I am a rookie for SAS. May I know why my example is syntactically incorrect? Why cannot I use 'if' but 'in'? I always cannot distinguish the difference, and the program seems to run okay for 'if' selection. Thank you, and you provide a great solution.

 

ballardw
Super User

@Jarvin99 wrote:
Sorry that I am a rookie for SAS. May I know why my example is syntactically incorrect? Why cannot I use 'if' but 'in'? I always cannot distinguish the difference, and the program seems to run okay for 'if' selection. Thank you, and you provide a great solution.

Let's look at a data step that assigns a value to the variable and then uses the logic you provided in the example. This shows the LOG

 

31   data example;
32      codeid='a112';
33      if codeID= 'a111' or 'a112' or 'a113' then put "Value found";
34      else put "Not found";
35   run;

NOTE: Character values have been converted to numeric values at the places given by:
      (Line):(Column).
      33:25   33:35
NOTE: Invalid numeric data, 'a112' , at line 33 column 25.
NOTE: Invalid numeric data, 'a113' , at line 33 column 35.
Not found
codeid=a112 _ERROR_=1 _N_=1
NOTE: The data set WORK.EXAMPLE has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

 

I highlighted some text. SAS internally treats all logical comparisons as 1 or 0 for "True" or "False". So the values separated by logical operators such as OR , AND and NOT expect a numeric result. So "a112", which is character is not a numeric 1/0 value but the OR preceding tells SAS to expect such.

 

Note that the result of the logic is "Not found" even though the value of the variable was set to one of the values of interest other than the "codeid=" comparison.

 

You log should show similar issues.

 

Kurt_Bremser
Super User

First of all, you talk of SQL, and IF is not a valid keyword in SQL language. To subset observations in SQL, you need to use WHERE:

data have;
input codeid $4.;
datalines;
a111
a112
a113
a114
;

proc sql;
create table want1 as
  select *
  from have
  where codeid = "a111" or "a112" or "a113"
;
quit;

But then look at the log:

 78         proc sql;
 79         create table want1 as
 80           select *
 81           from have
 82           where codeid = "a111" or "a112" or "a113"
 83         ;
 NOTE: Table WORK.WANT1 created, with 4 rows and 1 columns.

No observation was filtered out, although "a114" is not in your condition.

The reason becomes clear when WHERE is used in a DATA step:

data want2;
set have;
where codeid = "a111" or "a112" or "a113";
run;

because now the log tells you this:

 69         data want2;
 70         set have;
 71         where codeid = "a111" or "a112" or "a113";
 72         run;
 
 NOTE: There were 4 observations read from the data set WORK.HAVE.
       WHERE 1 /* eine offensichtlich WAHRE Where-Bedingung (TRUE) */ ;
 NOTE: The data set WORK.WANT2 has 4 observations and 1 variables.

There's an obvious TRUE condition in your WHERE.

Why? The condition is equivalent to this:

where (codeid = "a111") or ("a112") or ("a113");

OR separates several conditions to form a compound condition, so the second and third value become conditions on their own. By definition, any non-missing character value evaluates to TRUE.

 

While WHERE in a DATA step is handed off to the dataset engine, which accepts SQL syntax, the IF is compiled by the data step compiler as data step code, and there conditions must be SAS Boolean values (numeric, zero or missing is FALSE, everything else is TRUE).

data want3;
set have;
if codeid = "a111" or "a112" or "a113";
run;

Log:

 69         data want3;
 70         set have;
 71         if codeid = "a111" or "a112" or "a113";
 72         run;
 
 NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
       71:23   71:33   
 NOTE: Invalid numeric data, 'a112' , at Zeile 71 Spalte 23.
 NOTE: Invalid numeric data, 'a113' , at Zeile 71 Spalte 33.
 codeid=a112 _ERROR_=1 _N_=2
 NOTE: Invalid numeric data, 'a112' , at Zeile 71 Spalte 23.
 NOTE: Invalid numeric data, 'a113' , at Zeile 71 Spalte 33.
 codeid=a113 _ERROR_=1 _N_=3
 NOTE: Invalid numeric data, 'a112' , at Zeile 71 Spalte 23.
 NOTE: Invalid numeric data, 'a113' , at Zeile 71 Spalte 33.
 codeid=a114 _ERROR_=1 _N_=4
 NOTE: There were 4 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT3 has 1 observations and 1 variables.

The data step compiler compiles an implicit conversion of character to numeric, but at runtime invalid numeric values are encountered, so the conversion results in missing (FALSE) values, and only when the first part of the condition is met, the whole condition is TRUE.

 

So this is the reason why you have to use the IN operator and a list of values in parentheses for your code to work as intended.

 

Now, anytime you need to work with lists, it is a good idea to store them in their own dataset and use that to make your selection, either by JOINing/MERGEing, by using a hash object, a format, or by creating dynamic code through CALL EXECUTE in a DATA step or SELECT INTO in SQL.

 

My personal favorite is the hash object:

data lookup;
input codeid $4.;
datalines;
a111
a112
a113
;

data want4;
set have;
if _n_ = 1
then do;
  declare hash l (dataset:"lookup");
  l.definekey("codeid");
  l.definedone();
end;
if l.check() = 0; /* zero means key was found */
run;

Log:

 77         data want4;
 78         set have;
 79         if _n_ = 1
 80         then do;
 81           declare hash l (dataset:"lookup");
 82           l.definekey("codeid");
 83           l.definedone();
 84         end;
 85         if l.check() = 0; /* zero means key was found */
 86         run;
 
 NOTE: There were 3 observations read from the data set WORK.LOOKUP.
 NOTE: There were 4 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT4 has 3 observations and 1 variables.

The lookup table is sorted in memory into a b-tree, and no sorting has to be done before this step; the order of the "have" dataset is kept. This is the fastest method which SAS provides for lookup tasks. You can even combine an arbitrary number of lookups in one step, as long as the lookups fit into available memory.

 

japelin
Rhodochrosite | Level 12

 

If you want to get the filter from the data, you could put it in a macro variable or something, but if you want to simply code it would be faster to use the in operator rather than a bunch of if's and ='s and OR's.

 

 

where codeID in ('a111' 'a112' 'a113');

 

 

andreas_lds
Jade | Level 19

I would use a format:

proc format;
   invalue valid_key (upcase)
      'S009',
      'A111',
      'S712',
      'A123',
      'B100' = 1
      other = 0
   ;
run;


data work.filtered;
   set have;
   
   if input(key, valid_key.);
run;

If you need a case-sensitive compare, then remove the upcase option from proc format and write the keys as you expect them. If you want to ignore case of keys (s009 ne S009) insert all keys with upcase-letters only.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 4160 views
  • 2 likes
  • 7 in conversation