Solved: Search for a value across a group of columns in SQL

Batman · Posted 11-26-2024 01:05 PM

I'm trying to select all obs that have a particular value across a multiple variables in PROC SQL. I know it can be done with a do loop in a data step, but in this case I'm using proc sql because I'm trying to avoid doing a data step and then a proc print. The sample code below works, but I'm wondering if there is a more efficient way.

/*create data set*/
data a;
input x 1. y 1. z 1.;
datalines;
000
100
213
;
run;

/*list if any variable has a
value of 1*/
proc sql;
select *
from a
where x=1
or y=1
or z=1
;
quit;

PaigeMiller · Posted 11-26-2024 01:09 PM

A little less programming, you can do this with the WHICHN function.

proc sql;
    select *
    from a
    where whichn(1,x,y,z);
quit;

Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.

(What is so terrible about PROC PRINT?)

--
Paige Miller

View solution in original post

PaigeMiller · Posted 11-26-2024 01:09 PM

A little less programming, you can do this with the WHICHN function.

proc sql;
    select *
    from a
    where whichn(1,x,y,z);
quit;

Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.

(What is so terrible about PROC PRINT?)

--
Paige Miller

Stu_SAS · Posted 11-26-2024 02:21 PM

Hey @Batman! This is one of those situations where efficiency is probably going to be minsiscule depending on which way you go. You can use whichn as @PaigeMiller showed, or you can add them to gether and only return rows where the sum is > 0. So, let's go ahead and test this. We'll create a decent-sized dataset and compare methods so we can gauge how it scales.

This dataset will have 500k rows with exactly 100 rows where all values of x are 1.

data have;
    array x[100];
    do i = 1 to 500000;
        do j = 1 to dim(x); 
            if(50000 LE i LE 50100) then x[j]=1;
                else x[j] = 0;
        end;
        output;
    end;
    drop i j;
run;

/* Helper function so we don't need to type x1, x2, x3, ... etc. */
%macro delimx(d);
    %local i;
    x1
    %do i = 2 %to 100;
        &d x&i
    %end;
%mend;

%macro timeit;
    %do i = 1 %to 100;

        /* Test 1: Whichn */
        %let start=%sysfunc(datetime());
            proc sql noprint;
                select *
                from have
                where whichn(1, %delimx(%str(,)))
                ;
            quit;
        %let time1 = %sysevalf(%sysfunc(datetime())-&start);

        /* Test 2: Sum function */
        %let start=%sysfunc(datetime());
            proc sql noprint;
                select *
                from have
                where sum(%delimx(%str(,))) > 0
                ;
            quit;
        %let time2 = %sysevalf(%sysfunc(datetime())-&start);

        /* Test 3: Addition */
        %let start=%sysfunc(datetime());
           proc sql noprint;
                select *
                from have
                where %delimx(+) > 0
                ;
            quit;
        %let time3 = %sysevalf(%sysfunc(datetime())-&start);

        data time;
            time1 = &time1;
            time2 = &time2;
            time3 = &time3;
        run;

        proc append base=times data=time;
        run;
    %end;

    proc sql;
        select mean(time1) as avg_time1 label='Avg: Whichn'
             , std(time1)  as std_time1 label='Std: Whichn'
             , mean(time2) as avg_time2 label='Avg: Sum'
             , std(time2)  as std_time2 label='Std: Sum'
             , mean(time3) as avg_time3 label='Avg: Add'
             , std(time3)  as std_time3 label='Std: Add'
        from times;
    quit;

    proc datasets lib=work nolist;
        delete times;
    quit;

%mend;

%timeit;

On my machine, we can see that there is nearly no difference between using whichn and the sum function across 100 trials in a SAS dataset, but whichn had a higher standard deviation. Straight addition was the slowest. So, what's the most efficient way of these three when working with SAS datasets? Either whichn or the sum function - but depending on the size of your dataset, you may be looking at milliseconds of difference that you'll never notice.

Follow me on LinkedIn for SAS tips with #SASTipTuesday
Check out my GitHub for SAS tips and tricks
Read my blog for cool SAS and open source data science projects

Search for a value across a group of columns in SQL

Re: Search for a value across a group of columns in SQL

Re: Search for a value across a group of columns in SQL

Re: Search for a value across a group of columns in SQL

Registration is open

Search for a value across a group of columns in SQL

Re: Search for a value across a group of columns in SQL

Re: Search for a value across a group of columns in SQL

Re: Search for a value across a group of columns in SQL

Registration is open

SAS Training: Just a Click Away