DATA Step, Macro, Functions and more

find variables whose sum minimizes distance with another variable

Accepted Solution Solved
Reply
Contributor
Posts: 43
Accepted Solution

find variables whose sum minimizes distance with another variable

hi everybody,

I need to find the variables in an array (x1-x6) whose sum has a minimum distance with a variable v. for instance to the following dataset

data before;

input x1-x6 v;

cards;

0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3

0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2

0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8

0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

;

run;

I would like to add a string that indicates which variables provide the most similar value to variable v such as:

data after;

input x1-x6 v r $;

cards;

0.279567742 0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3 011100

0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2 100010

0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8 111000

0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333 111110

;

run;

in case of tie a random choice would be fine.

please consider that the dataset is quite big, so in case of different options I would like suggestions on which is less resource intensive.

any input is very appreciated

thank you very much in advance


Accepted Solutions
Solution
‎09-21-2012 05:51 PM
Respected Advisor
Posts: 4,640

Re: find variables whose sum minimizes distance with another variable

I assumed that you meant L1 distance but you could change it for L2 easily :

data before;
input x1-x6 v;
cards;
0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333
;
run;

data x(keep=id x1-x6);
array x{6}(1 0 0 0 0 0);
k = 1;
do i = 2 to 2**6;
     id = cats(of x1-x6);
     output;
     rc = graycode(k, of x1-x6);
     end;
run;

data scoring;
set before;
obs + 1; _TYPE_="PARMS"; _NAME_="SUMX";
run;

proc score data=x score=scoring type=PARMS out=scored;
by obs v notsorted;
var x1-x6;
id id;
run;

proc sql;
create table a as
select obs, id
from scored
group by obs
having abs(v-sumx) = min(abs(v-sumx)); /* L1 distance */


create table after(drop=_type_ _name_) as
select s.*, id
from scoring as s, a
where s.obs=a.obs;
quit;

proc sort data=after nodupkey; by obs; run;

PG

PG

View solution in original post


All Replies
Respected Advisor
Posts: 3,124

Re: find variables whose sum minimizes distance with another variable

I have no idea how you come up with variable 'r', if you just want to find out ' (x1-x6) whose sum has a minimum distance with a variable v', then we could have the following approaches, my guess is that data step one maybe faster:

data before;

input x1-x6 v;

cards;

  1. 0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
  2. 0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
  3. 0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
  4. 0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

;

/*SQL*/

proc sql;

select *, abs(sum(x1,x2,x3,x4,x5,x6)-v) as dis from before having dis=min(dis);

quit;

/*data step*/

data want;

retain m_dis;

array t(6) _temporary_;

set before end=last;

array x x1-x6;

   dis=abs(sum(of x1-x6)-v);

   if dis < m_dis then do;

      do i=1 to dim(t);

           t(i)=x(i);

        end;

      end;

   m_dis=min(m_dis,dis);

  if last then do;

     do i=1 to dim(t);

         x(i)=t(i);

       end;

       output;

  end;

  drop m_dis i;

  run;

Haikuo

Contributor
Posts: 43

Re: find variables whose sum minimizes distance with another variable

thank you for the suggestions, guys. but sorry, probably my explanation was not very clear.

I need to find for each record the combination of variables (x1-x6) whose sum is closer to variable v.

so for instance for the first record this is obtained by variables (x2,x3,x4) or (x3,x4,x5) an in that case I can accept one of them randomly, say (x2,x3,x4). In this case variable r should be equal to 011100 signaling which are the variables.

for the second record the distance with variable v is minimized by sum(x1,x5) and in this case r should be 100010 and so on

hope this is clearer

PROC Star
Posts: 7,356

Re: find variables whose sum minimizes distance with another variable

Ciro,

I thought your explanation was quite clear.  Here is one way to do it, but I didn't do anything to optimize the code.

The macro to create the various combinations was stolen from: http://support.sas.com/techsup/technote/ts498.html

I didn't bother to optimize it but, rather, just extended it so that it would get all of your combinations in one call, and then used a datastep to convert the resulting file to represent numbers rather than the characters it was designed to work with.

Possibly others can come up with some suggestions regarding how to optimize the code.  I also didn't include any comments in my own code, as I just to Chang's code and modified it to fit what I had been thinking.

data before;

  input x1-x6 v;

  cards;

0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3

0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2

0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8

.2 0.2 0.2 0.2 0.1 0.1 1

0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

;

run;

%macro combo(r)/parmbuff;

  %let i=2;

  %let things=;

  %do %while (%Qscan(&syspbuff,&i,%STR(,%))) ne );

    %let p&i="%Qscan(&syspbuff,&i,%STR(,%)))";

    %if &i=2 %then %let things=&&p&i;

    %else %let things=&things,&&p&i;

    %let i=%eval(&i+1);

  %end;

  %let n=%eval(&i-2);

  data combo;

    keep v1-v&r;

    array word $8  w1-w&n (&things);

    array rr (*) r1-r&r;

    array v $8  v1-v&r;

    %do r=1 %to 6;

      %do i=1 %to &r;                    /* create the DO LOOPs */

        %if &i=1 %then %do;

          do r&i=1 to &n-(&r-&i);

        %end;

        %else %do;

          do r&i=r%eval(&i-1)+1 to &n-(&r-&i);

        %end;

      %end;

      do k=1 to &r;              /* select subscripted items */

        v(k)=word (rr(k));               /* for a SAS data set */

      end;

      output;                    /* writes to a SAS data set */

      %do i=1 %to &r;

        end;                     /* create ENDs for the DO LOOPs */

      %end;

    %end;

  run;

%mend combo;

%combo (6,1,2,3,4,5,6)

data combo (keep=vSmiley Happy;

  array t[6] $ t1-t6;

  array v[6];

  set combo (rename=(v1-v6=t1-t6));

  do i=1 to 6;

    v=input(t,8.);

  end;

run;

data after (keep=x1-x6 v r);

  array combos_in(*) v1-v6;

  array combos[6,63];

  array diff[63];

  retain combos diff;

  j=0;

  do until (lastcombo);

    set combo end=lastcombo;

    j+1;

    do i=1 to 6;

      combos[i,j]=combos_in;

    end;

  end;

  do until (lastrow);

    set before end=lastrow;

    array x[1:6] x1-x6;

    array xx[6];

    retain x xx;

    *drop i mindist;

    do i=1 to 63;

      call missing(of xx(*));

      do j=1 to 6;

        if not missing(combos[j,i]) then xx=x[combos[j,i]];

      end;

      diff=abs(sum(of xx

  • )-v);
  •     end;

        mindist = min(of diff

  • );
  •     do i=1 to 63;

          if mindist eq diff then do;

            r = repeat("0", 6-1);

            do j=1 to 6;

              if not missing(combos[j,i]) then substr(r, combos[j,i], 1) = "1";

            end;

            leave;

          end;

        end;

        output;

      end;

    run;

    Solution
    ‎09-21-2012 05:51 PM
    Respected Advisor
    Posts: 4,640

    Re: find variables whose sum minimizes distance with another variable

    I assumed that you meant L1 distance but you could change it for L2 easily :

    data before;
    input x1-x6 v;
    cards;
    0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
    0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
    0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
    0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333
    ;
    run;

    data x(keep=id x1-x6);
    array x{6}(1 0 0 0 0 0);
    k = 1;
    do i = 2 to 2**6;
         id = cats(of x1-x6);
         output;
         rc = graycode(k, of x1-x6);
         end;
    run;

    data scoring;
    set before;
    obs + 1; _TYPE_="PARMS"; _NAME_="SUMX";
    run;

    proc score data=x score=scoring type=PARMS out=scored;
    by obs v notsorted;
    var x1-x6;
    id id;
    run;

    proc sql;
    create table a as
    select obs, id
    from scored
    group by obs
    having abs(v-sumx) = min(abs(v-sumx)); /* L1 distance */


    create table after(drop=_type_ _name_) as
    select s.*, id
    from scoring as s, a
    where s.obs=a.obs;
    quit;

    proc sort data=after nodupkey; by obs; run;

    PG

    PG
    Respected Advisor
    Posts: 3,124

    Re: find variables whose sum minimizes distance with another variable

    Ok, a strenuous approach:

    data before;

    input x1-x6 v;

    cards;

    0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3

    0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2

    0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8

    0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

    ;

    data want;

    set before;

      array x x1-x6;

      /*one element*/

        do _0=1 to dim(x);

              diff=abs(x(_0)-v);

              if diff<min_diff then pos=cats(_0);

              min_diff=min(min_diff,diff);

            end;

            /*two elements*/

            do _0=1 to dim(x)-1;

              do _1=_0+1 to dim(x);

                 diff=abs(sum(x(_0),x(_1))-v);

                  if diff<min_diff then pos=cats(_0,_1);

                       min_diff=min(min_diff,diff);

              end;

            end;

       /*three elements*/

            do _0=1 to dim(x)-2;

               do _1=_0+1 to dim(x)-1;

                  do _2=_1+1 to dim(x);

                        diff=abs(sum(x(_0),x(_1),x(_2))-v);

                if diff<min_diff then pos=cats(_0,_1,_2);

                         min_diff=min(min_diff,diff);

                      end;

                end;

            end;

        /*four elements*/

                    do _0=1 to dim(x)-3;

               do _1=_0+1 to dim(x)-2;

                  do _2=_1+1 to dim(x)-1;

                         do _3=_2+1 to dim(x);

                        diff=abs(sum(x(_0),x(_1),x(_2),x(_3))-v);

                if diff<min_diff then pos=cats(_0,_1,_2,_3);

                         min_diff=min(min_diff,diff);

                            end;

                      end;

                end;

            end;

                   

       /*Five elements*/

                     do _0=1 to dim(x)-4;

               do _1=_0+1 to dim(x)-3;

                  do _2=_1+1 to dim(x)-2;

                         do _3=_2+1 to dim(x)-1;

                               do _4=_3+1 to dim(x);

                        diff=abs(sum(x(_0),x(_1),x(_2),x(_3),x(_4))-v);

                if diff<min_diff then pos=cats(_0,_1,_2,_3,_4);

                         min_diff=min(min_diff,diff);

                            end;

                            end;

                      end;

                end;

            end;

            /*Six elements*/

            diff=abs(sum(of x(*)));

                if diff<min_diff then pos=cats(1,2,3,4,5,6);

                        min_diff=min(min_diff,diff);

            position='000000';

            do i=1 to lengthn(pos);

            substr(position,substr(pos,i,1),1)='1';

            end;

            drop _: i diff;

    run;

    Haikuo

    Contributor
    Posts: 43

    Re: find variables whose sum minimizes distance with another variable

    hi, thank you very much for introducing me all these tools.

    I am learning a lot studying your examples. As a statistician (and not very good data step programmer) I liked very much the function graycode and the scoring approach, even if It might be not the more performance wise. I might probably mix some of your ideas.

    Respected Advisor
    Posts: 3,124

    Re: find variables whose sum minimizes distance with another variable

    Thank you, PG for introducing graycode() to us and thank you, OP for initiating such an interesting question. IMHO, the use of graycode() wins hands-down, it saves tremendous brain power and makes the  flow of  programming like a happy spring. 

    Here is to use it in data step, just compare the length of my last post, and I am sure it can be better.

    data before;

    input x1-x6 v;

    cards;

    1. 0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
    2. 0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
    3. 0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
    4. 0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

    ;

    data want;

    set before;

       array x x1-x6;

       array _y(6) (1 1 1 1 1 1);

       _k=0;

       do _i=1 to 2**dim(_y)-1;

          _sum=0;

          _rc=graycode(_k, of _y

  • );
  •       do _j=1 to dim(_y);

               _sum+x(_j)*_y(_j);

            end;

            _diff=abs(_sum-v);

            if _diff < min_diff then pos=cats(of _y(*));

            min_diff=min(min_diff,_diff);

       end;

       drop _:;

    run;

    proc print;run;

    Haikuo

    PROC Star
    Posts: 7,356

    Re: find variables whose sum minimizes distance with another variable

    If it were me I would definitely adopt Haikuo's latest suggested code (i.e., the one using the graycode function).  It is parsimonious, easy to read, and runs at least as fast (or faster) than any of the other suggested solutions.

    While we all got to learn some new things from PGStat's and Ksharp's solutions, the performance of their respective solutions don't even begin to compare with Haikuo's code.

    Super Contributor
    Posts: 1,636

    Re: find variables whose sum minimizes distance with another variable

    Art,

    What is graycode function? I googled but nothing relevant came up.  Thank you!

    Respected Advisor
    Posts: 3,124

    Re: find variables whose sum minimizes distance with another variable

    LinLin,

    http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p1myrepjl95...

    Art,

    I had to look up 'parsimonious'. Luckily, I probably don't have to take another GRE any more.

    Haikuo

    Super Contributor
    Posts: 1,636

    Re: find variables whose sum minimizes distance with another variable

    Thank you Haikuo! - Linlin

    PROC Star
    Posts: 7,356

    Re: find variables whose sum minimizes distance with another variable

    Haikuo, I first had 'simplistic' in the sentence but was concerned that the term might be taken the wrong way.  I was very impressed with your suggested code.

    PROC Star
    Posts: 7,356

    Re: find variables whose sum minimizes distance with another variable

    Linlin:  take a look at:

    http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003121538.htm

    Just FYI, you should have been able to find it with the following search string on Google:

    graycode site:support.sas.com

    Super Contributor
    Posts: 1,636

    Re: find variables whose sum minimizes distance with another variable

    Thank you Art! I searched "graycode function,sas" .

    ☑ This topic is SOLVED.

    Need further help from the community? Please ask a new question.

    Discussion stats
    • 29 replies
    • 1174 views
    • 3 likes
    • 7 in conversation