BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Batman
Quartz | Level 8

I'm trying to select all obs that have a particular value across a multiple variables in PROC SQL.   I know it can be done with a do loop in a data step, but in this case I'm using proc sql because I'm trying to avoid doing a data step and then a proc print.   The sample code below works, but I'm wondering if there is a more efficient way.

 

/*create data set*/
data a;
input x 1. y 1. z 1.;
datalines;
000
100
213
;
run;

/*list if any variable has a
value of 1*/
proc sql;
select *
from a
where x=1
or y=1
or z=1
;
quit;

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

A little less programming, you can do this with the WHICHN function.

 

proc sql;
    select *
    from a
    where whichn(1,x,y,z);
quit;

 

Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.

 

(What is so terrible about PROC PRINT?)

--
Paige Miller

View solution in original post

2 REPLIES 2
PaigeMiller
Diamond | Level 26

A little less programming, you can do this with the WHICHN function.

 

proc sql;
    select *
    from a
    where whichn(1,x,y,z);
quit;

 

Which fails to be less efficient if you have a lot of variables, and then you would probably want to go back to a DATA step.

 

(What is so terrible about PROC PRINT?)

--
Paige Miller
Stu_SAS
SAS Employee

Hey @Batman! This is one of those situations where efficiency is probably going to be minsiscule depending on which way you go. You can use whichn as @PaigeMiller showed, or you can add them to gether and only return rows where the sum is > 0.  So, let's go ahead and test this. We'll create a decent-sized dataset and compare methods so we can gauge how it scales.

 

This dataset will have 500k rows with exactly 100 rows where all values of x are 1.

data have;
    array x[100];
    do i = 1 to 500000;
        do j = 1 to dim(x); 
            if(50000 LE i LE 50100) then x[j]=1;
                else x[j] = 0;
        end;
        output;
    end;
    drop i j;
run;

/* Helper function so we don't need to type x1, x2, x3, ... etc. */
%macro delimx(d);
    %local i;
    x1
    %do i = 2 %to 100;
        &d x&i
    %end;
%mend;

%macro timeit;
    %do i = 1 %to 100;

        /* Test 1: Whichn */
        %let start=%sysfunc(datetime());
            proc sql noprint;
                select *
                from have
                where whichn(1, %delimx(%str(,)))
                ;
            quit;
        %let time1 = %sysevalf(%sysfunc(datetime())-&start);

        /* Test 2: Sum function */
        %let start=%sysfunc(datetime());
            proc sql noprint;
                select *
                from have
                where sum(%delimx(%str(,))) > 0
                ;
            quit;
        %let time2 = %sysevalf(%sysfunc(datetime())-&start);

        /* Test 3: Addition */
        %let start=%sysfunc(datetime());
           proc sql noprint;
                select *
                from have
                where %delimx(+) > 0
                ;
            quit;
        %let time3 = %sysevalf(%sysfunc(datetime())-&start);

        data time;
            time1 = &time1;
            time2 = &time2;
            time3 = &time3;
        run;

        proc append base=times data=time;
        run;
    %end;

    proc sql;
        select mean(time1) as avg_time1 label='Avg: Whichn'
             , std(time1)  as std_time1 label='Std: Whichn'
             , mean(time2) as avg_time2 label='Avg: Sum'
             , std(time2)  as std_time2 label='Std: Sum'
             , mean(time3) as avg_time3 label='Avg: Add'
             , std(time3)  as std_time3 label='Std: Add'
        from times;
    quit;

    proc datasets lib=work nolist;
        delete times;
    quit;

%mend;

%timeit;

Stu_SAS_1-1732648733266.png

 

On my machine, we can see that there is nearly no difference between using whichn and the sum function across 100 trials in a SAS dataset, but whichn had a higher standard deviation. Straight addition was the slowest. So, what's the most efficient way of these three when working with SAS datasets? Either whichn or the sum function - but depending on the size of your dataset, you may be looking at milliseconds of difference that you'll never notice.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 561 views
  • 0 likes
  • 3 in conversation