BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
art297
Opal | Level 21

Linlin: This also would have worked:

"graycode function" sas

MikeZdeb
Rhodochrosite | Level 12

hi ... never having seen the GRAYCODE function, I was interested in your code versus that posted by Pierre since they differed in the initialization of the 0/1 array and in the initial value of the variable K and I'd wanted to know what happens when those values change (did it make any difference)

I added a few lines of code to your data step and created a data set (HAIKUO) of the 63 different combinations of 0/1 in the array _Y that get used to find the minimum difference

I compared the 63 combinations to those generated by Pierre's posted code (data set PIERRE) ... I think I got this correct ...

* HAIKUO posted code ... make a new data set with ID and _Y1 - _Y6;

data want (drop=_:) haikuo (keep=id _y1-_y6);

set before;

   array x x1-x6;

   array _y(6) (1 1 1 1 1 1);

   _k=0;

   do _i=1 to 2**dim(_y)-1;

      _sum=0;

      _rc=graycode(_k, of _y

  • );
  • * create variable ID from current values of _Y1 - _Y6 and write to data set;

       id = cats(of _y(*));

       if _n_ eq 1 then output haikuo;

          do _j=1 to dim(_y);

               _sum+x(_j)*_y(_j);

            end;

            _diff=abs(_sum-v);

            if _diff < min_diff then pos=cats(of _y(*));

            min_diff=min(min_diff,_diff);

       end;

    output want;

    run;

    * Pierre's code ... create 0/1 variables;

    data pierre (keep=id x1-x6);

    array x{6}(1 0 0 0 0 0);

    k = 1;

    do i = 2 to 2**6;

       id = cats(of x1-x6);

       output;

       rc = graycode(k, of x1-x6);

       end;

    run;

    * merge data sets, look for differences;

    proc sort data=haikuo;

    by id;

    run;

    proc sort data=pierre;

    by id;

    run;

    data not_the_same;

    merge haikuo (in=h) pierre (in=p);

    by id;

    if not (h and p);

    run;


    there are differences ...

    _y1    _y2    _y3    _y4    _y5    _y6      id      x1    x2    x3    x4    x5    x6

    0      0      0      0      0      0     000000     .     .     .     .     .     .

    .      .      .      .      .      .     000001     0     0     0     0     0     1

    .      .      .      .      .      .     100001     1     0     0     0     0     1

    evaluates the combination '000000' that is not evaluated by Pierre

    never use two of the possible combinations (they are used by Pierre)

    there is also one combination that gets evaluated twice (a duplicate ... no dups in Pierre's data) ...

    proc sort data=haikuo out=h dupout=dp nodupkey;

    by id;

    run;

    data set DP ...

    _y1    _y2    _y3    _y4    _y5    _y6      id

    0      1      1      1      1      1     011111

    my conclusion is that I'm not sure that I understand GRAYCODE ... it remains another of my SAS "gray areas"

    ps disclaimer ... I did say that "I think I got this correct"

    art297
    Opal | Level 21

    Mike: I don't think (again, like you, I'm only thinking on this thread) that Haikuo's code is wrong,  If you run PGStat's code over multiple iterations of the same data, sometimes it accepts zeros as part of the needed calculations and sometimes it doesn't.

    The graycode function is simply building a set of all combinations.  How it is applied, on the other hand, is what I (think) you are questioning.

    MikeZdeb
    Rhodochrosite | Level 12

    hi ... if you are not sure that all the possible combinations are tested, can you rely on the answer

    art297
    Opal | Level 21

    Mike,

    Upon further review, the following appears to produce all possible combinations:

    data want;

       array _y(6) (1 1 1 1 1 1);

       _k=-1;

       do _i=1 to 2**dim(_y);

          _rc=graycode(_k, of _y

  • );
  •        pos=cats(of _y(*));

           if _rc ne 0 then output;

       end;

    run;

    /*check*/

    proc sort data=want nodupkey;

      by pos;

    run;

    Haikuo
    Onyx | Level 15

    Thanks, Mike, for the in-depth following up. It is rather puzzling, at least for me right now. Here is the quote from SAS doc:

    "To generate all subsets of n items, you can initialize k to a negative value and execute GRAYCODE in a loop that iterates 2**n times. If you want to start with a non-empty subset, then initialize k to be the number of items in the subset, initialize the other arguments to specify the desired initial subset, and execute GRAYCODE in a loop that iterates 2**n-1 times. The sequence of subsets that are generated by GRAYCODE is cyclical, so you can begin with any subset that you want."


    So it seems to me that if I initiate k=0, then it should exhausted all of the combinations except all null (000000) doing 2**6-1. While in practice, it not only generate null set (000000), but not really "cyclical", so it misses some combination, and it occurs to me if I set k>0, it even misses more.

    Given the condition, the fix should be simple:

    data want;

    set before;

       array x x1-x6;

       array _y(6) (1 1 1 1 1 1);

       _k=-1;

       do _i=1 to 2**dim(_y);

          _sum=0;

          _rc=graycode(_k, of _y

  • );
  •       do _j=1 to dim(_y);

              if _rc >0 then _sum+x(_j)*_y(_j);

            end;

            _diff=abs(_sum-v);

            if _diff < min_diff then pos=cats(of _y(*));

            min_diff=min(min_diff,_diff);

       end;

       drop _:;

    run;

    Haikuo, in gray zone, too

    MikeZdeb
    Rhodochrosite | Level 12

    hi ... OK, maybe not so gray anymore

    another idea, a tiny bit slower, but no gray area involved

    also, I think you need to initialize MIN_DIFF since if the first combination of 0/1 generates the minimum difference, no value is ever assigned to POS (yes/no?) ...

    data want;

    set before;

       array x(6);

       do _i=1 to 2**dim(x)-1;

       _sum=0;

       min_diff = 1e6;

       do _j=1 to dim(x);

          _sum+x(_j)*input(char(put(_i,binary6.),_j),1.);

       end;

         _diff=abs(_sum-v);

         if _diff < min_diff then pos=put(_i,binary6.);

         min_diff=min(min_diff,_diff);

       end;

    drop _: ;

    run;

    Haikuo
    Onyx | Level 15

    Quote "I think you need to initialize MIN_DIFF since if the first combination of 0/1 generates the minimum difference, no value is ever assigned to POS"

    Absolutely agreed. I want you to be my SAS instructor, but I need someone else to score my exam :smileysilly:

    That being said, the code can be made more robust by avoiding using artificial large number '1e6'.  Modification would include:

    1. initiate "min_diff=abs(x1-v)" ;

    2. "if _diff<min_diff" TO "if _diff<=min_diff".

    Yes/no?

    Thanks, Mike, again for your sharp insight.

    Haikuo

    Update: abs() applied.

    art297
    Opal | Level 21

    Haikuo,

    I would initialize with:  min_diff=CONSTANT('BIG')

    At least, that way, you would know that SAS couldn't accurately represent a larger number.

    Haikuo
    Onyx | Level 15

    Thanks, Art. It is not even 9am central time, and I have already learned one new function: constant(). It is gonna to be a good day.

    Haikuo

    art297
    Opal | Level 21

    Mike,

    Below are slightly modified versions of both your and Haikuo's code and a dataset containing all 63 combinations.  Both appear to end up with the same results, but Haikuo's code runs almost three times faster.

    On my machine, given 63,000 records, Haikuo's code ran in 5.5 seconds.

    data before (drop=i sequence id);

      informat sequence $11.;

      input x1-x6 v sequence id;

      do i=1 to 1000;output;end;

      cards;

    2 2 2 2 2 1 1 6           1

    2 2 2 2 1 2 1 5           2

    3 3 3 3 1 1 2 5-6         3

    2 2 2 1 2 2 1 4           4

    3 3 3 1 3 1 2 4-6         5

    3 3 3 1 1 3 2 4-5         6

    4 4 4 1 1 1 3 4-5-6       7

    2 2 1 2 2 2 1 3           8

    3 3 1 3 3 1 2 3-6         9

    3 3 1 3 1 3 2 3-5         10

    4 4 1 4 1 1 3 3-5-6       11

    3 3 1 1 3 3 2 3-4         12

    4 4 1 1 4 1 3 3-4-6       13

    4 4 1 1 1 4 3 3-4-5       14

    5 5 1 1 1 1 4 3-4-5-6     15

    2 1 2 2 2 2 1 2           16

    3 1 3 3 3 1 2 2-6         17

    3 1 3 3 1 3 2 2-5         18

    4 1 4 4 1 1 3 2-5-6       19

    3 1 3 1 3 3 2 2-4         20

    4 1 4 1 4 1 3 2-4-6       21

    4 1 4 1 1 4 3 2-4-5       22

    5 1 5 1 1 1 4 2-4-5-6     23

    3 1 1 3 3 3 2 2-3         24

    4 1 1 4 4 1 3 2-3-6       25

    4 1 1 4 1 4 3 2-3-5       26

    5 1 1 5 1 1 4 2-3-5-6     27

    4 1 1 1 4 4 3 2-3-4       28

    5 1 1 1 5 1 4 2-3-4-6     29

    5 1 1 1 1 5 4 2-3-4-5     30

    6 1 1 1 1 1 5 2-3-4-5-6   31

    1 2 2 2 2 2 1 1           32

    1 3 3 3 3 1 2 1-6         33

    1 3 3 3 1 3 2 1-5         34

    1 4 4 4 1 1 3 1-5-6       35

    1 3 3 1 3 3 2 1-4         36

    1 4 4 1 4 1 3 1-4-6       37

    1 4 4 1 1 4 3 1-4-5       38

    1 5 5 1 1 1 4 1-4-5-6     39

    1 3 1 3 3 3 2 1-3         40

    1 4 1 4 4 1 3 1-3-6       41

    1 4 1 4 1 4 3 1-3-5       42

    1 5 1 5 1 1 4 1-3-5-6     43

    1 4 1 1 4 4 3 1-3-4       44

    1 5 1 1 5 1 4 1-3-4-6     45

    1 5 1 1 1 5 4 1-3-4-5     46

    1 6 1 1 1 1 5 1-3-4-5-6   47

    1 1 3 3 3 3 2 1-2         48

    1 1 4 4 4 1 3 1-2-6       49

    1 1 4 4 1 4 3 1-2-5       50

    1 1 5 5 1 1 4 1-2-5-6     51

    1 1 4 1 4 4 3 1-2-4       52

    1 1 5 1 5 1 4 1-2-4-6     53

    1 1 5 1 1 5 4 1-2-4-5     54

    1 1 6 1 1 1 5 1-2-4-5-6   55

    1 1 1 4 4 4 3 1-2-3       56

    1 1 1 5 5 1 4 1-2-3-6     57

    1 1 1 5 1 5 4 1-2-3-5     58

    1 1 1 6 1 1 5 1-2-3-5-6   59

    1 1 1 1 5 5 4 1-2-3-4     60

    1 1 1 1 6 1 5 1-2-3-4-6   61

    1 1 1 1 1 6 5 1-2-3-4-5   62

    1 1 1 1 1 1 6 1-2-3-4-5-6 63

    ;

    data after_mike;

       set before;

       array x(6);

       min_diff = constant('big');

       do _i=1 to 2**dim(x)-1;

         _sum=0;

        /*   min_diff = 1e6;*/

         do _j=1 to dim(x);

            _sum+x(_j)*input(char(put(_i,binary6.),_j),1.);

         end;

         _diff=abs(_sum-v);

         if _diff < min_diff then pos=put(_i,binary6.);

         min_diff=min(min_diff,_diff);

       end;

    drop _: ;

    run;

    data after_haikuo;

       set before;

       array x x1-x6;

       array _y(6) (1 1 1 1 1 1);

       min_diff = constant('big');

       _k=-1;

       do _i=1 to 2**dim(_y);

          _sum=0;

          _rc=graycode(_k, of _y

  • );
  •       if _rc gt 0 then do _j=1 to dim(_y);

            _sum+x(_j)*_y(_j);

          end;

          _diff=abs(_sum-v);

          if _diff < min_diff then do;

            pos=cats(of _y(*));

            min_diff=_diff;

         end;

          if _diff eq 0 then leave;

       end;

       drop _:;

    run;

    MikeZdeb
    Rhodochrosite | Level 12

    hi Art ... small changes (generate binary sequence once, use logical value versus the INPUT + PUT stuff)..

    data after_mike;

       set before;

       array x(6);

       min_diff = constant('big');

       do _i=1 to 2**dim(x)-1;

         _sum=0;

         _seq = put(_i, binary6.);

         do _j=1 to dim(x);

            _sum+x(_j)*(char(_seq,_j) eq '1');

         end;

         _diff=abs(_sum-v);

         if _diff < min_diff then pos=put(_i,binary6.);

         min_diff=min(min_diff,_diff);

       end;

    drop _: ;

    run;

    test with 400,000 observations (using original posted data just ramped up a bit) ...

    NOTE: There were 400000 observations read from the data set WORK.BEFORE.

    NOTE: The data set WORK.AFTER_MIKE has 400000 observations and 9 variables.

    NOTE: DATA statement used (Total process time):

          real time           20.56 seconds

          cpu time            20.54 seconds

    versus ...

    NOTE: There were 400000 observations read from the data set WORK.BEFORE.

    NOTE: The data set WORK.AFTER_HAIKUO has 400000 observations and 9 variables.

    NOTE: DATA statement used (Total process time):

          real time           17.06 seconds

          cpu time            16.82 seconds

    not that much slower now and no "gray area" in that I know exactly what sequence of 0s and 1s is used each time (but ... still liked learning there's a GRAYCODE function)

    ps ... with only 60,000, not much difference ...

    NOTE: There were 60000 observations read from the data set WORK.BEFORE.

    NOTE: The data set WORK.AFTER_MIKE has 60000 observations and 9 variables.

    NOTE: DATA statement used (Total process time):

          real time           3.09 seconds

          cpu time            3.09 seconds

    NOTE: There were 60000 observations read from the data set WORK.BEFORE.

    NOTE: The data set WORK.AFTER_HAIKUO has 60000 observations and 9 variables.

    NOTE: DATA statement used (Total process time):

          real time           2.54 seconds

          cpu time            2.51 seconds


    AND, EVEN MORE INTERESTING, I changed this ... min_diff = constant('big'); back to this ... min_diff = 1e6;

    and the jobs run a LOT faster ...

    after_mike

    60,000 obs now runs in 1.84 seconds CPU (vs 3.09)

    400,000 obs now runs in 11.93 seconds CPU (vs 20.54)

    after_haikuo

    60,000 obs now runs in 1.29 seconds CPU (vs 2.51)

    400,000 obs now runs in 8.54 seconds CPU (vs 16.82)

    so, you pay a price using the CONSTANT function when all you really need is a very large number (yes/no?)





    art297
    Opal | Level 21

    Mike,  This gets more interesting with each discovery.  Yes, I too, was quite surprised at the effect of including the constant function.  However, that can be alleviated by simply assigning the value to a macro variable just before the datastep.

    I did that, in the code below,  and (for a reason I can't explain) , Haikuo's code experienced a much greater benefit.  Regarding your comment about differences based on number of records, in both cases the difference was about 22 to 23% (on your machine).  On my machine, the difference was around 35%.

    However, with the new change, the difference is now 167%.  Using your code, 4.75 seconds, using Haikuo's code, 1.78 seconds.

    %let min_diff=%sysfunc(constant(BIG));

    data after_mike;

       set before;

       array x(6);

       min_diff = &min_diff.;

       do _i=1 to 2**dim(x)-1;

         _sum=0;

         _seq = put(_i, binary6.);

         do _j=1 to dim(x);

            _sum+x(_j)*(char(_seq,_j) eq '1');

         end;

         _diff=abs(_sum-v);

         if _diff < min_diff then pos=put(_i,binary6.);

         min_diff=min(min_diff,_diff);

       end;

    drop _: ;

    run;

    data after_haikuo;

       set before;

       array x x1-x6;

       array _y(6) (1 1 1 1 1 1);

       min_diff = &min_diff.;

       _k=-1;

       do _i=1 to 2**dim(_y);

          _sum=0;

          _rc=graycode(_k, of _y

  • );
  •       if _rc gt 0 then do _j=1 to dim(_y);

            _sum+x(_j)*_y(_j);

          end;

          _diff=abs(_sum-v);

          if _diff < min_diff then do;

            pos=cats(of _y(*));

            min_diff=_diff;

         end;

          if _diff eq 0 then leave;

       end;

       drop _:;

    run;

    Ksharp
    Super User

    It looks like I am late.

    data have;
    input var1-var6 c;
    cards;
    0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
    0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
    0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
    0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333
    ;
    run;
    
    
    
    
    
    
    data x;
    input (var1-var6) (: $40.);
    cards;
    var1 var2 var3 var4 var5 var6
    ;
    run;
    proc summary data=x ;
    class var1-var6;
    output out=temp(drop=_:);
    run;
    data exp(keep=expression _expression);
    set temp end=last;
    length expression _expression $ 32767;
    retain expression;
    expression=catx(' ',expression, catx('+',of var1-var6));
    if last then do; _expression=expression;output;end;
    run;
    data have;
     set have;
    if _n_ eq 1 then set exp;
    run;
    
    data want;
     set have;
     min=999999;
     array _v{*} var1-var6 ;
     flag=repeat('0',dim(_v)-1);
     do i=1 to dim(_v);
      expression=tranwrd(expression,strip(vname(_v{i})),strip(_v{i}));
     end;
    
     j=1;
     temp=scan(expression,j,' ');
     do while(not missing(temp));
      sum=resolve(cats('%sysevalf(',temp,')'));
      diff=abs(sum-c); 
      if diff lt min then do; want=scan(_expression,j,' ');min=diff; 
    end;
      j+1;
      temp=scan(expression,j,' ');
     end;
    
     k=1;
     _temp=scan(want,k,'var+');
     do while(not missing(_temp));
     substr(flag,input(_temp,best8.),1)='1';
     k+1;
     _temp=scan(want,k,'var+');
     end;
    drop i j k temp _temp sum  expression _expression min;
    run;
    
    
    
    

    Ksharp

    PGStats
    Opal | Level 21

    What I understand from the GRAYCODE function documentation is 1) If you initialize it with k=-1 then it zeros the variable vector on the first iteration 2) if you want to start the sequence elsewhere than zero then you must provide a set of starting values in the variable vector and k must be equal to the number of ones in your starting set. 3) the Gray code is cyclical, so as long as you start it properly, it will cover the whole set of 2**n subsets.

    I agree with Art that the code I provided before was somewhat inefficient in the last SQL part. Here is a more efficient version (note the inclusion of variable v in the scoring process) :

    data before;
    input x1-x6 v;
    cards;
    0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
    0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
    0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
    0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333
    ;

    data x(keep=id x1-x6 v);
    length id $6;
    array x{6}(6*0);
    v = -1;
    /* if id=000000 is an acceptable answer, use this */
    *k = -1;
    *do i = 1 to 2**6;
    /* otherwise use this */
    x{1} = 1;
    k = 1;
    do i = 2 to 2**6;
         id = cats(of x1-x6);
         output;
         rc = graycode(k, of x1-x6);
         end;
    run;

    data scoring /view=scoring;
    set before;
    obs + 1; _TYPE_="PARMS"; _NAME_="SUMX";
    run;

    proc score data=x score=scoring type=PARMS out=scored;
    by obs;
    var v x1-x6;
    id id;
    run;

    data postScore(keep=obs id score) / view=postScore;
    set scored;
    score = abs(SUMX);
    run;

    proc means data=postScore noprint;
    by obs;
    output out=minScores(drop=_:) idgroup(MIN(score) OUT(id)=);
    run;

    data after;
    merge minScores scoring;
    by obs;
    drop obs _:;
    run;

    PG

    PG

    SAS Innovate 2025: Call for Content

    Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

    Submit your idea!

    How to Concatenate Values

    Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 29 replies
    • 3703 views
    • 3 likes
    • 7 in conversation