BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Batman
Quartz | Level 8

I'm trying to select all obs that have a particular value across a multiple variables in PROC SQL.   I know it can be done with a do loop in a data step, but in this case I'm using proc sql because I'm trying to avoid doing a data step and then a proc print.   The sample code below works, but I'm wondering if there is a more efficient way.

 

/*create data set*/
data a;
input x 1. y 1. z 1.;
datalines;
000
100
213
;
run;

/*list if any variable has a
value of 1*/
proc sql;
select *
from a
where x=1
or y=1
or z=1
;
quit;

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

A little less programming, you can do this with the WHICHN function.

 

proc sql;
    select *
    from a
    where whichn(1,x,y,z);
quit;

 

Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.

 

(What is so terrible about PROC PRINT?)

--
Paige Miller

View solution in original post

2 REPLIES 2
PaigeMiller
Diamond | Level 26

A little less programming, you can do this with the WHICHN function.

 

proc sql;
    select *
    from a
    where whichn(1,x,y,z);
quit;

 

Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.

 

(What is so terrible about PROC PRINT?)

--
Paige Miller
Stu_SAS
SAS Employee

Hey @Batman! This is one of those situations where efficiency is probably going to be minsiscule depending on which way you go. You can use whichn as @PaigeMiller showed, or you can add them to gether and only return rows where the sum is > 0.  So, let's go ahead and test this. We'll create a decent-sized dataset and compare methods so we can gauge how it scales.

 

This dataset will have 500k rows with exactly 100 rows where all values of x are 1.

data have;
    array x[100];
    do i = 1 to 500000;
        do j = 1 to dim(x); 
            if(50000 LE i LE 50100) then x[j]=1;
                else x[j] = 0;
        end;
        output;
    end;
    drop i j;
run;

/* Helper function so we don't need to type x1, x2, x3, ... etc. */
%macro delimx(d);
    %local i;
    x1
    %do i = 2 %to 100;
        &d x&i
    %end;
%mend;

%macro timeit;
    %do i = 1 %to 100;

        /* Test 1: Whichn */
        %let start=%sysfunc(datetime());
            proc sql noprint;
                select *
                from have
                where whichn(1, %delimx(%str(,)))
                ;
            quit;
        %let time1 = %sysevalf(%sysfunc(datetime())-&start);

        /* Test 2: Sum function */
        %let start=%sysfunc(datetime());
            proc sql noprint;
                select *
                from have
                where sum(%delimx(%str(,))) > 0
                ;
            quit;
        %let time2 = %sysevalf(%sysfunc(datetime())-&start);

        /* Test 3: Addition */
        %let start=%sysfunc(datetime());
           proc sql noprint;
                select *
                from have
                where %delimx(+) > 0
                ;
            quit;
        %let time3 = %sysevalf(%sysfunc(datetime())-&start);

        data time;
            time1 = &time1;
            time2 = &time2;
            time3 = &time3;
        run;

        proc append base=times data=time;
        run;
    %end;

    proc sql;
        select mean(time1) as avg_time1 label='Avg: Whichn'
             , std(time1)  as std_time1 label='Std: Whichn'
             , mean(time2) as avg_time2 label='Avg: Sum'
             , std(time2)  as std_time2 label='Std: Sum'
             , mean(time3) as avg_time3 label='Avg: Add'
             , std(time3)  as std_time3 label='Std: Add'
        from times;
    quit;

    proc datasets lib=work nolist;
        delete times;
    quit;

%mend;

%timeit;

Stu_SAS_1-1732648733266.png

 

On my machine, we can see that there is nearly no difference between using whichn and the sum function across 100 trials in a SAS dataset, but whichn had a higher standard deviation. Straight addition was the slowest. So, what's the most efficient way of these three when working with SAS datasets? Either whichn or the sum function - but depending on the size of your dataset, you may be looking at milliseconds of difference that you'll never notice.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 273 views
  • 0 likes
  • 3 in conversation