I'm trying to select all obs that have a particular value across a multiple variables in PROC SQL. I know it can be done with a do loop in a data step, but in this case I'm using proc sql because I'm trying to avoid doing a data step and then a proc print. The sample code below works, but I'm wondering if there is a more efficient way.
/*create data set*/
data a;
input x 1. y 1. z 1.;
datalines;
000
100
213
;
run;
/*list if any variable has a
value of 1*/
proc sql;
select *
from a
where x=1
or y=1
or z=1
;
quit;
A little less programming, you can do this with the WHICHN function.
proc sql;
select *
from a
where whichn(1,x,y,z);
quit;
Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.
(What is so terrible about PROC PRINT?)
A little less programming, you can do this with the WHICHN function.
proc sql;
select *
from a
where whichn(1,x,y,z);
quit;
Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.
(What is so terrible about PROC PRINT?)
Hey @Batman! This is one of those situations where efficiency is probably going to be minsiscule depending on which way you go. You can use whichn as @PaigeMiller showed, or you can add them to gether and only return rows where the sum is > 0. So, let's go ahead and test this. We'll create a decent-sized dataset and compare methods so we can gauge how it scales.
This dataset will have 500k rows with exactly 100 rows where all values of x are 1.
data have;
array x[100];
do i = 1 to 500000;
do j = 1 to dim(x);
if(50000 LE i LE 50100) then x[j]=1;
else x[j] = 0;
end;
output;
end;
drop i j;
run;
/* Helper function so we don't need to type x1, x2, x3, ... etc. */
%macro delimx(d);
%local i;
x1
%do i = 2 %to 100;
&d x&i
%end;
%mend;
%macro timeit;
%do i = 1 %to 100;
/* Test 1: Whichn */
%let start=%sysfunc(datetime());
proc sql noprint;
select *
from have
where whichn(1, %delimx(%str(,)))
;
quit;
%let time1 = %sysevalf(%sysfunc(datetime())-&start);
/* Test 2: Sum function */
%let start=%sysfunc(datetime());
proc sql noprint;
select *
from have
where sum(%delimx(%str(,))) > 0
;
quit;
%let time2 = %sysevalf(%sysfunc(datetime())-&start);
/* Test 3: Addition */
%let start=%sysfunc(datetime());
proc sql noprint;
select *
from have
where %delimx(+) > 0
;
quit;
%let time3 = %sysevalf(%sysfunc(datetime())-&start);
data time;
time1 = &time1;
time2 = &time2;
time3 = &time3;
run;
proc append base=times data=time;
run;
%end;
proc sql;
select mean(time1) as avg_time1 label='Avg: Whichn'
, std(time1) as std_time1 label='Std: Whichn'
, mean(time2) as avg_time2 label='Avg: Sum'
, std(time2) as std_time2 label='Std: Sum'
, mean(time3) as avg_time3 label='Avg: Add'
, std(time3) as std_time3 label='Std: Add'
from times;
quit;
proc datasets lib=work nolist;
delete times;
quit;
%mend;
%timeit;
On my machine, we can see that there is nearly no difference between using whichn and the sum function across 100 trials in a SAS dataset, but whichn had a higher standard deviation. Straight addition was the slowest. So, what's the most efficient way of these three when working with SAS datasets? Either whichn or the sum function - but depending on the size of your dataset, you may be looking at milliseconds of difference that you'll never notice.
Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.
Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.