BookmarkSubscribeRSS Feed
ArtC
Rhodochrosite | Level 12
I have a list of numeric variables (always integers) and would like to determine when they have distinct values. In the following code the three variables are I, J, K, but in reality there can be a large number of variables. Thoughts on an easier way to replace the [pre]i ne j & i ne k & j ne k[/pre]

[pre]data alldiff;
do i = 1 to 4;
do j = 1 to 4;
do k = 1 to 4;
if i ne j & i ne k & j ne k then output alldiff;
end;end;end;
run;
proc print data=alldiff;
run;[/pre]
14 REPLIES 14
chang_y_chung_hotmail_com
Obsidian | Level 7
Here is one way. In the worst case, you will make 0.5*n*(n-1) comparisons per obs.



   /* test data */


   data one;


     do i = 1 to 4;


       do j = 1 to 4;


         do k = 1 to 4;


           output;


         end;


       end;


     end;


   run;


 


   /* flag if all ijk values are different */


   data two;


     set one;


     array var(*) i j k;


     unique = 1


     drop p q;


     do p = 1 to dim(var)-1 while(unique);


        do q = p + 1 to dim(var) while(unique);


           unique = var(p)^=var(q);


        end;


     end


   run;


 


   /* check */


   proc print data=two(obs=10);


   run;


   /* on lst


   Obs    i    j    k    unique


 


     1    1    1    1       0


     2    1    1    2       0


     3    1    1    3       0


     4    1    1    4       0


     5    1    2    1       0


     6    1    2    2       0


     7    1    2    3       1


     8    1    2    4       1


     9    1    3    1       0


    10    1    3    2       1


   */ 

data_null__
Jade | Level 19
I don't think this is really any different from Chang's program just "more statements" 🙂

I don't unstand why I had to define a data item for the hash in order to use the ADD method syntax with the KEY argument, h.add(key:v,data:' ')
It works without DEFINEDATA and the DATA argument but you have assign Y and then call the ADD method, y=v; rc=h.add(), but h.add(key:v) does not.

[pre]
data alldiff;
length y 8 d $1;
call missing(y,d);
declare hash h();
h.definekey('y');
h.definedata('d');
h.definedone();
do i = 1 to 4;
do j = 1 to 4;
do k = 1 to 4;
h.clear();
array v
  • i j k;
    do l=1 to dim(v) until(h.add(key:v,data:' ') gt 0);
    end;
    unique = l eq dim(v)+1;
    output;
    end;
    end;
    end;
    drop y d l;
    run;
    [/pre]
  • ArtC
    Rhodochrosite | Level 12
    Thank you Chang and Data _NULL_. I was afraid (and hoping) that I might have been missing something like a uniquevarvalues function. For either solution there is quite a bit of checking that needs to take place. It will be interesting to see how the hash solution scales for larger data sets and numbers of vars.
    Thanks
    Art
    polingjw
    Quartz | Level 8
    I was also curious to see how the hash solution compares for larger datasets with more variables. I ran a test on a dataset with 1000000 observations and 20 variables. Also, I tried testing a third solution which uses a simple macro to create a single statement in the data step. Here are the results.

    [pre]
    1 data test;
    2 array var{20};
    3 do i = 1 to 1000000;
    4 do j=1 to dim(var);
    5 var{j} = floor(ranuni(0)*300);
    6 end;
    7 output;
    8 end;
    9 drop i j;
    10 run;

    NOTE: The data set WORK.TEST has 1000000 observations and 20 variables.
    NOTE: DATA statement used (Total process time):
    real time 3.57 seconds
    cpu time 3.57 seconds


    11
    12 data chang;
    13 set test;
    14 array var(*) var:;
    NOTE: The array var has the same name as a SAS-supplied or user-defined function. Parentheses
    following this name are treated as array references and not function references.
    15 unique = 1;
    16 drop p q;
    17 do p = 1 to dim(var)-1 while(unique);
    18 do q = p + 1 to dim(var) while(unique);
    19 unique = var(p)^=var(q);
    20 end;
    21 end;
    22 run;

    NOTE: There were 1000000 observations read from the data set WORK.TEST.
    NOTE: The data set WORK.CHANG has 1000000 observations and 21 variables.
    NOTE: DATA statement used (Total process time):
    real time 24.56 seconds
    cpu time 25.00 seconds


    23
    24 data data_null_;
    25 length y 8 d $1;
    26 if _n_ = 1 then do;
    27 call missing(y,d);
    28 declare hash h();
    29 h.definekey('y');
    30 h.definedata('d');
    31 h.definedone();
    32 end;
    33 set test;
    34 h.clear();
    35 array v
  • var:;
    36 do l=1 to dim(v) until(h.add(key:v,data:' ') gt 0);
    37 end;
    38 unique = l eq dim(v)+1;
    39 output;
    40 drop y d l;
    41 run;

    NOTE: There were 1000000 observations read from the data set WORK.TEST.
    NOTE: The data set WORK.DATA_NULL_ has 1000000 observations and 21 variables.
    NOTE: DATA statement used (Total process time):
    real time 36.87 seconds
    cpu time 37.23 seconds


    42
    43 %macro ne(array, elements);
    44 %do i = 1 %to %eval(&elements-2);
    45 %do j = %eval(&i+1) %to &elements;
    46 (&array[&i] ne &array[&j]) &
    47 %end;
    48 %end;
    49 (&array[%eval(&elements-1)] ne &array[&elements])
    50 %mend;
    51
    52 data _null_;
    53 if 0 then set test;
    54 array var
  • var:;
    55 call symputx('num_vars', dim(var));
    56 run;

    NOTE: DATA STEP stopped due to looping.
    NOTE: DATA statement used (Total process time):
    real time 0.01 seconds
    cpu time 0.01 seconds


    57
    58 data polingjw;
    59 set test;
    60 array var
  • var:;
    61 unique = (%ne(var, &num_vars));
    62 run;

    NOTE: There were 1000000 observations read from the data set WORK.TEST.
    NOTE: The data set WORK.POLINGJW has 1000000 observations and 21 variables.
    NOTE: DATA statement used (Total process time):
    real time 22.06 seconds
    cpu time 22.15 seconds
    [/pre]
  • data_null__
    Jade | Level 19
    See how you single statement performs if you use variable name references instead of array reference.
    polingjw
    Quartz | Level 8
    Hi data _null_,

    I don’t get it. Don’t all three programs rely on array references? Please help me to understand.

    Overall, it seems to me that Chang’s program is probably the best answer. His is the simplest program, it is comparable to the macro program that I posted in terms of efficiency (for my test dataset anyway), and I can see how it’s less problematic than using macros. I created the macro only because I was curious to see how using a single statement would compare to the other two excellent suggestions in terms of efficiency. Since I personally like Chang’s solution better than my own, I almost did not post my solution. But, then again, I suspected that if I did post the macro this thread could turn into a good learning experience for myself 🙂

    Thanks again.
    data_null__
    Jade | Level 19
    My hypothesis is that

    var1 eq var2
    is faster than
    var[1] eq var[2]

    I will easy to test since your variable names are VARn, a bit harder to generalize.
    chang_y_chung_hotmail_com
    Obsidian | Level 7
     
    @polingjw: Thanks for the nice words!

    I think _null_ has a point. On my box, polingjw2 seems to run faster than polingjw1 below.

       /* test data */
       data test;
         array var{20};
         do i = 1 to 1000000;
           do j=1 to dim(var);
             var{j} = floor(ranuni(0)*300);
           end;
           output;
         end;
         drop i j;
       run;
     
       sasfile work.test.data open;
     
       /* dummy step to load the data into sasfile.
          see Mark^s sas-l posting 
          http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1102B&L=sas-l&D=1&H=0&O=D&T=1&P=10531 */
       data _null_
          set test;
       run;
     
       %macro ne(array, elements);
         %do i = 1 %to %eval(&elements-2);
           %do j = %eval(&i+1%to &elements;
             (&array[&i] ne &array[&j]) &
           %end;
         %end;
         (&array[%eval(&elements-1)] ne &array[&elements])
       %mend;
     
       data polingjw1;
         set test;
         array var
  •  var:;

  •      unique = (%ne(var, 20));
       run;   
     
       %macro allDiff(root=, dim=);
         %local i j maxi minj;
         %let maxi = %eval(&dim - 2);
         %do i = 1 %to &maxi;
           %let minj = %eval(&i + 1);
           %do j = &minj %to &dim;
             %*; &root.&i ne &root&j &
           %end;
         %end;
         &root.%eval(&dim - 1) ne &root.&dim
       %mend  allDiff;
     
       /* using no arrays but the variable names directly. it also  
            takes advantage of if expression short-circuiting as well.
          see this and the thread mentioned in it:
            http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0406D&L=sas-l&D=1&H=0&O=D&T=1&P=45708 */
       data polingjw2;
          set test;
          if %allDiff(root=var, dim=20then unique = 1;
          else unique = 0;
       run;
     
       sasfile work.test.data close;
     
    polingjw
    Quartz | Level 8
    A quick question about your last posting... Wouldn’t using a LOAD option on the SASFILE statement eliminate the need for the dummy step?

    [pre]
    sasfile work.test.data load;
    [/pre]
    chang_y_chung_hotmail_com
    Obsidian | Level 7
    @polingjw: Yep. You are absolutely correct. I should have used load instead of open. Thanks.
    jonam
    Calcite | Level 5
    Wow! Nice to see so many ways to do this. Here you go My contribution.
    Another Method using dictionary tables :).

    /*Creating test data */
    data test;
    array var{5};
    do j=1 to 10;
    do i=1 to dim(var);
    var{i}= floor(ranuni(0)*60);
    end;
    output;
    end;
    drop i j;
    run;

    /* here it checks the uniqueness. Doing a not to get 1 for uniquenes */
    data t;
    set test;
    unique=not(var1=var2=var3=var4=var5);
    run;

    /* Taking the numeric column name from the sashelp.vcolumn(dictionary) and creating a macro */
    proc sql;
    select name INTO :VARS SEPARATED BY '=' from sashelp.Vcolumn where libname='WORK'
    and memname='TEST' and type='num';
    quit;

    %PUT &VARS.;
    DATA T;
    SET TEST;
    UNIQUE=NOT(&VARS);
    RUN;

    I've not done bench mark against the other methods to see the performance though.
    chang_y_chung_hotmail_com
    Obsidian | Level 7
    @jonam: A good try! Unfortunately, however, your code is incorrect (unless you have only two variables). The reason is simply that the negation of "all same" is not "all different."



    Your SAS expression


        not(var1=var2=var3=var4)


    is a shorthand form of


        not (var1=var2 & var2=var3 & var3=var4)


    And this is equivalent to


        (var1^=var2 | var2^=var3 | var3^=var4)




    This evaluates true, if any of the adjacent pairs happens to be different, even when the others are the same.
    jonam
    Calcite | Level 5
    @chang.

    Oh! you are absolutely right! Thanks a lot for pointing that out....I've taken a different approach.

    /*Creating test data */


    data test;

    do i = 1 to 4;

    do j = 1 to 4;

    do k = 1 to 4;

    output;

    end;

    end;

    end;

    run;

    /* Taking the numeric column name from the sashelp.vcolumn(dictionary) and creating a macro */
    proc sql;
    select compress(name), count(name) INTO :VARS SEPARATED BY ' ',:max from sashelp.Vcolumn where libname='WORK'
    and memname='TEST' and type='num';
    quit;

    %PUT &VARS.;
    /*
    Samevar: to avoid comparing the same variable
    Next : Once identified equal Move to next observation
    */

    option obs=max;
    DATA T;
    SET TEST;
    array a{*} &vars.;
    end=&max.;
    flag=0;
    do l=1 to end;
    unique=1;
    do l1=2 to end;
    if l=l1 then go to Samevar;
    if a{l}=a{l1} then
    do;
    unique=0;
    flag=1;
    end;
    if flag=1 then go to next;
    end;
    Samevar:

    end;
    next:
    drop l l1 end flag;
    RUN; removed the old method.


    Message was edited by: jonam
    Ksharp
    Super User
    Dear Arthur.Carpenter:
    I do not know whether you like hash table, I use it find another way.


    [pre]
    data temp;
    do i = 1 to 4;
    do j = 1 to 4;
    do k = 1 to 4;
    output;
    end;
    end;
    end;
    run;

    data result(drop=rc count _n id);
    set temp;
    declare hash hh(hashexp: 10);
    declare hiter ff('hh');
    hh.definekey('id');
    hh.definedone();
    array var{*} i--k;
    do _n=1 to dim(var);
    id=var{_n};
    hh.replace();
    end;
    count=0;
    rc=ff.first();
    do while(rc=0);
    count+1;
    rc=ff.next();
    end;
    if count=dim(var) then flag=1;
    else flag=0;
    run;
    [/pre]




    Ksharp

    sas-innovate-2024.png

    Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

    Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

     

    Register now!

    How to Concatenate Values

    Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 14 replies
    • 4523 views
    • 0 likes
    • 6 in conversation