I'm trying to select all obs that have a particular value across a multiple variables in PROC SQL. I know it can be done with a do loop in a data step, but in this case I'm using proc sql because I'm trying to avoid doing a data step and then a proc print. The sample code below works, but I'm wondering if there is a more efficient way.
/*create data set*/
data a;
input x 1. y 1. z 1.;
datalines;
000
100
213
;
run;
/*list if any variable has a
value of 1*/
proc sql;
select *
from a
where x=1
or y=1
or z=1
;
quit;
A little less programming, you can do this with the WHICHN function.
proc sql;
select *
from a
where whichn(1,x,y,z);
quit;
Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.
(What is so terrible about PROC PRINT?)
A little less programming, you can do this with the WHICHN function.
proc sql;
select *
from a
where whichn(1,x,y,z);
quit;
Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.
(What is so terrible about PROC PRINT?)
Hey @Batman! This is one of those situations where efficiency is probably going to be minsiscule depending on which way you go. You can use whichn as @PaigeMiller showed, or you can add them to gether and only return rows where the sum is > 0. So, let's go ahead and test this. We'll create a decent-sized dataset and compare methods so we can gauge how it scales.
This dataset will have 500k rows with exactly 100 rows where all values of x are 1.
data have;
array x[100];
do i = 1 to 500000;
do j = 1 to dim(x);
if(50000 LE i LE 50100) then x[j]=1;
else x[j] = 0;
end;
output;
end;
drop i j;
run;
/* Helper function so we don't need to type x1, x2, x3, ... etc. */
%macro delimx(d);
%local i;
x1
%do i = 2 %to 100;
&d x&i
%end;
%mend;
%macro timeit;
%do i = 1 %to 100;
/* Test 1: Whichn */
%let start=%sysfunc(datetime());
proc sql noprint;
select *
from have
where whichn(1, %delimx(%str(,)))
;
quit;
%let time1 = %sysevalf(%sysfunc(datetime())-&start);
/* Test 2: Sum function */
%let start=%sysfunc(datetime());
proc sql noprint;
select *
from have
where sum(%delimx(%str(,))) > 0
;
quit;
%let time2 = %sysevalf(%sysfunc(datetime())-&start);
/* Test 3: Addition */
%let start=%sysfunc(datetime());
proc sql noprint;
select *
from have
where %delimx(+) > 0
;
quit;
%let time3 = %sysevalf(%sysfunc(datetime())-&start);
data time;
time1 = &time1;
time2 = &time2;
time3 = &time3;
run;
proc append base=times data=time;
run;
%end;
proc sql;
select mean(time1) as avg_time1 label='Avg: Whichn'
, std(time1) as std_time1 label='Std: Whichn'
, mean(time2) as avg_time2 label='Avg: Sum'
, std(time2) as std_time2 label='Std: Sum'
, mean(time3) as avg_time3 label='Avg: Add'
, std(time3) as std_time3 label='Std: Add'
from times;
quit;
proc datasets lib=work nolist;
delete times;
quit;
%mend;
%timeit;
On my machine, we can see that there is nearly no difference between using whichn and the sum function across 100 trials in a SAS dataset, but whichn had a higher standard deviation. Straight addition was the slowest. So, what's the most efficient way of these three when working with SAS datasets? Either whichn or the sum function - but depending on the size of your dataset, you may be looking at milliseconds of difference that you'll never notice.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.