I'd like to write a where statement as follows:
data test;
set data1-data60
where (var1 or var2 or var3 or var4) in (1,2,3,4)
essentially, what I'm trying to accomplish is to stack data1, ...., data60 and keep rows that has at least one of var1, var2, var3, var4 to have values equal to 1, 2, 3 and/or 4. Is that the correct syntax, and if not, what would the correct one be?
Thanks!
No the syntax is not correct.
You have to compare each variable with the list of values:
where var1 in (1, 2, 3, 4) or var2 in (1, 2, 3, 4) ....
Which is longer your list of variable names or your list of values? And how long are they.
If you are doing enough comparisons a where may not be the easiest code to follow or possibly even write. It may be that look for one value, output when found and move on.
An example
data want; set data1-data60; array vars (*) var1 - var60; array vals (10) _temporary_ (1, 3, 5, 6, 7,8,9,10,12,14); do i=1 to dim(vals); if whichn(vals[i], of vars(*)) > 0 then do; output; leave; end; end; drop i; run;
Whichn, and the companion Whichc for character, looks to see if the first parameter occurs in any of the others and returns the position number. This would be an alternate of IN if you have many fewer values than variables and want a WHERE. If there are more values than variables then IN would be a bit quicker but WHICHN will work with variables or literal values.
This uses the LEAVE instruction to stop the execution of the Do loop the first time the condition is true so only outputs one observation when a value matches.
I intentionally picked a number of variables large enough to cause one to think about how many OR conditions you may want on a Where (or If) .
I know your question concerns how to efficiently specify a filtering expression. On that subject, I'd advise using a WHERE expression if you expect to exclude a large share of your original data.
Yes, if you have a lot of values or variable in the filter, it might be a messy looking condition, compared to the IF then DO approach mentioned by @ballardw. But the WHERE expression outsources the filtering to the data engine, while the IF approach waits for the data to be received by the DATA step. If a large portion of the data is excluded, there can be a performance cost to the IF approach.
But there is another aspect of your sample code worth mentioning, given you are reading 60 datasets.
If all of your 60 datasets have the same variables, then you can save resources by using the OPEN=DEFER option on the set statement, as in:
set date1-date60 OPEN=defer;
Ordinarily the SET statement would open all 60 datasets at once, establishing 60 corresponding buffers in memory. But OPEN=DEFER tells SAS to defer opening data2 until data1 is exhausted, and similarly deferring opening the subsequent data sets. So instead of 60 buffers, SAS just re-uses the same buffer for each dataset in succession. It can save a lot of resources, especially with wide datasets.
Actually "all datasets have the same variables" is a more stringent than actually needed. Just be sure the first dataset listed in the SET statement has all the variables of interest. Subsequent datasets, under OPEN=DEFER, will have any variables they have in common with the first dataset kept, and all other variables ignored, with notes in the log.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.