Data step where any of the variables has one of a list of values

sas_novice2 · Posted 09-27-2023 11:46 PM

I'd like to write a where statement as follows:

data test;

set data1-data60

where (var1 or var2 or var3 or var4) in (1,2,3,4)

essentially, what I'm trying to accomplish is to stack data1, ...., data60 and keep rows that has at least one of var1, var2, var3, var4 to have values equal to 1, 2, 3 and/or 4. Is that the correct syntax, and if not, what would the correct one be?

Thanks!

andreas_lds · Posted 09-28-2023 02:26 AM

No the syntax is not correct.

You have to compare each variable with the list of values:

where var1 in (1, 2, 3, 4) or var2 in (1, 2, 3, 4) ....

ballardw · Posted 09-28-2023 03:44 AM

Which is longer your list of variable names or your list of values? And how long are they.

If you are doing enough comparisons a where may not be the easiest code to follow or possibly even write. It may be that look for one value, output when found and move on.

An example

data want;
   set data1-data60;
   array vars (*) var1 - var60;
   array vals (10) _temporary_ (1, 3, 5, 6, 7,8,9,10,12,14);
   do i=1 to dim(vals);
       if whichn(vals[i], of vars(*)) > 0 then do;
          output;
          leave;
      end;
  end;
  drop i;
run;

Whichn, and the companion Whichc for character, looks to see if the first parameter occurs in any of the others and returns the position number. This would be an alternate of IN if you have many fewer values than variables and want a WHERE. If there are more values than variables then IN would be a bit quicker but WHICHN will work with variables or literal values.

This uses the LEAVE instruction to stop the execution of the Do loop the first time the condition is true so only outputs one observation when a value matches.

I intentionally picked a number of variables large enough to cause one to think about how many OR conditions you may want on a Where (or If) .

mkeintz · Posted 09-28-2023 10:46 AM

I know your question concerns how to efficiently specify a filtering expression. On that subject, I'd advise using a WHERE expression if you expect to exclude a large share of your original data.

Yes, if you have a lot of values or variable in the filter, it might be a messy looking condition, compared to the IF then DO approach mentioned by @ballardw. But the WHERE expression outsources the filtering to the data engine, while the IF approach waits for the data to be received by the DATA step. If a large portion of the data is excluded, there can be a performance cost to the IF approach.

But there is another aspect of your sample code worth mentioning, given you are reading 60 datasets.

If all of your 60 datasets have the same variables, then you can save resources by using the OPEN=DEFER option on the set statement, as in:

set date1-date60  OPEN=defer;

Ordinarily the SET statement would open all 60 datasets at once, establishing 60 corresponding buffers in memory. But OPEN=DEFER tells SAS to defer opening data2 until data1 is exhausted, and similarly deferring opening the subsequent data sets. So instead of 60 buffers, SAS just re-uses the same buffer for each dataset in succession. It can save a lot of resources, especially with wide datasets.

Actually "all datasets have the same variables" is a more stringent than actually needed. Just be sure the first dataset listed in the SET statement has all the variables of interest. Subsequent datasets, under OPEN=DEFER, will have any variables they have in common with the first dataset kept, and all other variables ignored, with notes in the log.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Data step where any of the variables has one of a list of values

Re: Data step where any of the variables has one of a list of values

Re: Data step where any of the variables has one of a list of values

Re: Data step where any of the variables has one of a list of values

Data step where any of the variables has one of a list of values

Re: Data step where any of the variables has one of a list of values

Re: Data step where any of the variables has one of a list of values

Re: Data step where any of the variables has one of a list of values

Registration is open

SAS Training: Just a Click Away