DATA Step, Macro, Functions and more

Detecting unique values across variables

Reply
Valued Guide
Posts: 632

Detecting unique values across variables

I have a list of numeric variables (always integers) and would like to determine when they have distinct values. In the following code the three variables are I, J, K, but in reality there can be a large number of variables. Thoughts on an easier way to replace the [pre]i ne j & i ne k & j ne k[/pre]

[pre]data alldiff;
do i = 1 to 4;
do j = 1 to 4;
do k = 1 to 4;
if i ne j & i ne k & j ne k then output alldiff;
end;end;end;
run;
proc print data=alldiff;
run;[/pre]
Regular Contributor
Posts: 241

Re: Detecting unique values across variables

Here is one way. In the worst case, you will make 0.5*n*(n-1) comparisons per obs.



   /* test data */


   data one;


     do i = 1 to 4;


       do j = 1 to 4;


         do k = 1 to 4;


           output;


         end;


       end;


     end;


   run;


 


   /* flag if all ijk values are different */


   data two;


     set one;


     array var(*) i j k;


     unique = 1


     drop p q;


     do p = 1 to dim(var)-1 while(unique);


        do q = p + 1 to dim(var) while(unique);


           unique = var(p)^=var(q);


        end;


     end


   run;


 


   /* check */


   proc print data=two(obs=10);


   run;


   /* on lst


   Obs    i    j    k    unique


 


     1    1    1    1       0


     2    1    1    2       0


     3    1    1    3       0


     4    1    1    4       0


     5    1    2    1       0


     6    1    2    2       0


     7    1    2    3       1


     8    1    2    4       1


     9    1    3    1       0


    10    1    3    2       1


   */ 

Respected Advisor
Posts: 3,777

Re: Detecting unique values across variables

I don't think this is really any different from Chang's program just "more statements" :-)

I don't unstand why I had to define a data item for the hash in order to use the ADD method syntax with the KEY argument, h.add(key:v,data:' ')
It works without DEFINEDATA and the DATA argument but you have assign Y and then call the ADD method, y=v; rc=h.add(), but h.add(key:v) does not.

[pre]
data alldiff;
length y 8 d $1;
call missing(y,d);
declare hash h();
h.definekey('y');
h.definedata('d');
h.definedone();
do i = 1 to 4;
do j = 1 to 4;
do k = 1 to 4;
h.clear();
array v
  • i j k;
    do l=1 to dim(v) until(h.add(key:v,data:' ') gt 0);
    end;
    unique = l eq dim(v)+1;
    output;
    end;
    end;
    end;
    drop y d l;
    run;
    [/pre]
  • Valued Guide
    Posts: 632

    Re: Detecting unique values across variables

    Thank you Chang and Data _NULL_. I was afraid (and hoping) that I might have been missing something like a uniquevarvalues function. For either solution there is quite a bit of checking that needs to take place. It will be interesting to see how the hash solution scales for larger data sets and numbers of vars.
    Thanks
    Art
    Regular Contributor
    Posts: 171

    Re: Detecting unique values across variables

    I was also curious to see how the hash solution compares for larger datasets with more variables. I ran a test on a dataset with 1000000 observations and 20 variables. Also, I tried testing a third solution which uses a simple macro to create a single statement in the data step. Here are the results.

    [pre]
    1 data test;
    2 array var{20};
    3 do i = 1 to 1000000;
    4 do j=1 to dim(var);
    5 var{j} = floor(ranuni(0)*300);
    6 end;
    7 output;
    8 end;
    9 drop i j;
    10 run;

    NOTE: The data set WORK.TEST has 1000000 observations and 20 variables.
    NOTE: DATA statement used (Total process time):
    real time 3.57 seconds
    cpu time 3.57 seconds


    11
    12 data chang;
    13 set test;
    14 array var(*) var:;
    NOTE: The array var has the same name as a SAS-supplied or user-defined function. Parentheses
    following this name are treated as array references and not function references.
    15 unique = 1;
    16 drop p q;
    17 do p = 1 to dim(var)-1 while(unique);
    18 do q = p + 1 to dim(var) while(unique);
    19 unique = var(p)^=var(q);
    20 end;
    21 end;
    22 run;

    NOTE: There were 1000000 observations read from the data set WORK.TEST.
    NOTE: The data set WORK.CHANG has 1000000 observations and 21 variables.
    NOTE: DATA statement used (Total process time):
    real time 24.56 seconds
    cpu time 25.00 seconds


    23
    24 data data_null_;
    25 length y 8 d $1;
    26 if _n_ = 1 then do;
    27 call missing(y,d);
    28 declare hash h();
    29 h.definekey('y');
    30 h.definedata('d');
    31 h.definedone();
    32 end;
    33 set test;
    34 h.clear();
    35 array v
  • var:;
    36 do l=1 to dim(v) until(h.add(key:v,data:' ') gt 0);
    37 end;
    38 unique = l eq dim(v)+1;
    39 output;
    40 drop y d l;
    41 run;

    NOTE: There were 1000000 observations read from the data set WORK.TEST.
    NOTE: The data set WORK.DATA_NULL_ has 1000000 observations and 21 variables.
    NOTE: DATA statement used (Total process time):
    real time 36.87 seconds
    cpu time 37.23 seconds


    42
    43 %macro ne(array, elements);
    44 %do i = 1 %to %eval(&elements-2);
    45 %do j = %eval(&i+1) %to &elements;
    46 (&array[&i] ne &array[&j]) &
    47 %end;
    48 %end;
    49 (&array[%eval(&elements-1)] ne &array[&elements])
    50 %mend;
    51
    52 data _null_;
    53 if 0 then set test;
    54 array var
  • var:;
    55 call symputx('num_vars', dim(var));
    56 run;

    NOTE: DATA STEP stopped due to looping.
    NOTE: DATA statement used (Total process time):
    real time 0.01 seconds
    cpu time 0.01 seconds


    57
    58 data polingjw;
    59 set test;
    60 array var
  • var:;
    61 unique = (%ne(var, &num_vars));
    62 run;

    NOTE: There were 1000000 observations read from the data set WORK.TEST.
    NOTE: The data set WORK.POLINGJW has 1000000 observations and 21 variables.
    NOTE: DATA statement used (Total process time):
    real time 22.06 seconds
    cpu time 22.15 seconds
    [/pre]
  • Respected Advisor
    Posts: 3,777

    Re: Detecting unique values across variables

    See how you single statement performs if you use variable name references instead of array reference.
    Regular Contributor
    Posts: 171

    Re: Detecting unique values across variables

    Hi data _null_,

    I don’t get it. Don’t all three programs rely on array references? Please help me to understand.

    Overall, it seems to me that Chang’s program is probably the best answer. His is the simplest program, it is comparable to the macro program that I posted in terms of efficiency (for my test dataset anyway), and I can see how it’s less problematic than using macros. I created the macro only because I was curious to see how using a single statement would compare to the other two excellent suggestions in terms of efficiency. Since I personally like Chang’s solution better than my own, I almost did not post my solution. But, then again, I suspected that if I did post the macro this thread could turn into a good learning experience for myself :-)

    Thanks again.
    Respected Advisor
    Posts: 3,777

    Re: Detecting unique values across variables

    My hypothesis is that

    var1 eq var2
    is faster than
    var[1] eq var[2]

    I will easy to test since your variable names are VARn, a bit harder to generalize.
    Regular Contributor
    Posts: 241

    Re: Detecting unique values across variables

     
    @polingjw: Thanks for the nice words!

    I think _null_ has a point. On my box, polingjw2 seems to run faster than polingjw1 below.

       /* test data */
       data test;
         array var{20};
         do i = 1 to 1000000;
           do j=1 to dim(var);
             var{j} = floor(ranuni(0)*300);
           end;
           output;
         end;
         drop i j;
       run;
     
       sasfile work.test.data open;
     
       /* dummy step to load the data into sasfile.
          see Mark^s sas-l posting 
          http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1102B&L=sas-l&D=1&H=0&O=D&T=1&P=10531 */
       data _null_
          set test;
       run;
     
       %macro ne(array, elements);
         %do i = 1 %to %eval(&elements-2);
           %do j = %eval(&i+1%to &elements;
             (&array[&i] ne &array[&j]) &
           %end;
         %end;
         (&array[%eval(&elements-1)] ne &array[&elements])
       %mend;
     
       data polingjw1;
         set test;
         array var
  •  var:;

  •      unique = (%ne(var, 20));
       run;   
     
       %macro allDiff(root=, dim=);
         %local i j maxi minj;
         %let maxi = %eval(&dim - 2);
         %do i = 1 %to &maxi;
           %let minj = %eval(&i + 1);
           %do j = &minj %to &dim;
             %*; &root.&i ne &root&j &
           %end;
         %end;
         &root.%eval(&dim - 1) ne &root.&dim
       %mend  allDiff;
     
       /* using no arrays but the variable names directly. it also  
            takes advantage of if expression short-circuiting as well.
          see this and the thread mentioned in it:
            http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0406D&L=sas-l&D=1&H=0&O=D&T=1&P=45708 */
       data polingjw2;
          set test;
          if %allDiff(root=var, dim=20then unique = 1;
          else unique = 0;
       run;
     
       sasfile work.test.data close;
     
    Regular Contributor
    Posts: 171

    Re: Detecting unique values across variables

    A quick question about your last posting... Wouldn’t using a LOAD option on the SASFILE statement eliminate the need for the dummy step?

    [pre]
    sasfile work.test.data load;
    [/pre]
    Regular Contributor
    Posts: 241

    Re: Detecting unique values across variables

    @polingjw: Yep. You are absolutely correct. I should have used load instead of open. Thanks.
    Contributor
    Posts: 29

    Re: Detecting unique values across variables

    Wow! Nice to see so many ways to do this. Here you go My contribution.
    Another Method using dictionary tables Smiley Happy.

    /*Creating test data */
    data test;
    array var{5};
    do j=1 to 10;
    do i=1 to dim(var);
    var{i}= floor(ranuni(0)*60);
    end;
    output;
    end;
    drop i j;
    run;

    /* here it checks the uniqueness. Doing a not to get 1 for uniquenes */
    data t;
    set test;
    unique=not(var1=var2=var3=var4=var5);
    run;

    /* Taking the numeric column name from the sashelp.vcolumn(dictionary) and creating a macro */
    proc sql;
    select name INTO :VARS SEPARATED BY '=' from sashelp.Vcolumn where libname='WORK'
    and memname='TEST' and type='num';
    quit;

    %PUT &VARS.;
    DATA T;
    SET TEST;
    UNIQUE=NOT(&VARS);
    RUN;

    I've not done bench mark against the other methods to see the performance though.
    Regular Contributor
    Posts: 241

    Re: Detecting unique values across variables

    @jonam: A good try! Unfortunately, however, your code is incorrect (unless you have only two variables). The reason is simply that the negation of "all same" is not "all different."



    Your SAS expression


        not(var1=var2=var3=var4)


    is a shorthand form of


        not (var1=var2 & var2=var3 & var3=var4)


    And this is equivalent to


        (var1^=var2 | var2^=var3 | var3^=var4)




    This evaluates true, if any of the adjacent pairs happens to be different, even when the others are the same.
    Contributor
    Posts: 29

    Re: Detecting unique values across variables

    @chang.

    Oh! you are absolutely right! Thanks a lot for pointing that out....I've taken a different approach.

    /*Creating test data */


    data test;

    do i = 1 to 4;

    do j = 1 to 4;

    do k = 1 to 4;

    output;

    end;

    end;

    end;

    run;

    /* Taking the numeric column name from the sashelp.vcolumn(dictionary) and creating a macro */
    proc sql;
    select compress(name), count(name) INTO :VARS SEPARATED BY ' ',:max from sashelp.Vcolumn where libname='WORK'
    and memname='TEST' and type='num';
    quit;

    %PUT &VARS.;
    /*
    Samevar: to avoid comparing the same variable
    Next : Once identified equal Move to next observation
    */

    option obs=max;
    DATA T;
    SET TEST;
    array a{*} &vars.;
    end=&max.;
    flag=0;
    do l=1 to end;
    unique=1;
    do l1=2 to end;
    if l=l1 then go to Samevar;
    if a{l}=a{l1} then
    do;
    unique=0;
    flag=1;
    end;
    if flag=1 then go to next;
    end;
    Samevar:

    end;
    next:
    drop l l1 end flag;
    RUN; removed the old method.


    Message was edited by: jonam
    Super User
    Posts: 9,681

    Re: Detecting unique values across variables

    Dear Arthur.Carpenter:
    I do not know whether you like hash table, I use it find another way.


    [pre]
    data temp;
    do i = 1 to 4;
    do j = 1 to 4;
    do k = 1 to 4;
    output;
    end;
    end;
    end;
    run;

    data result(drop=rc count _n id);
    set temp;
    declare hash hh(hashexp: 10);
    declare hiter ff('hh');
    hh.definekey('id');
    hh.definedone();
    array var{*} i--k;
    do _n=1 to dim(var);
    id=var{_n};
    hh.replace();
    end;
    count=0;
    rc=ff.first();
    do while(rc=0);
    count+1;
    rc=ff.next();
    end;
    if count=dim(var) then flag=1;
    else flag=0;
    run;
    [/pre]




    Ksharp
    Ask a Question
    Discussion stats
    • 14 replies
    • 1660 views
    • 0 likes
    • 6 in conversation