BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ciro
Quartz | Level 8

hi everybody,

I need to find the variables in an array (x1-x6) whose sum has a minimum distance with a variable v. for instance to the following dataset

data before;

input x1-x6 v;

cards;

0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3

0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2

0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8

0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

;

run;

I would like to add a string that indicates which variables provide the most similar value to variable v such as:

data after;

input x1-x6 v r $;

cards;

0.279567742 0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3 011100

0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2 100010

0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8 111000

0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333 111110

;

run;

in case of tie a random choice would be fine.

please consider that the dataset is quite big, so in case of different options I would like suggestions on which is less resource intensive.

any input is very appreciated

thank you very much in advance

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

I assumed that you meant L1 distance but you could change it for L2 easily :

data before;
input x1-x6 v;
cards;
0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333
;
run;

data x(keep=id x1-x6);
array x{6}(1 0 0 0 0 0);
k = 1;
do i = 2 to 2**6;
     id = cats(of x1-x6);
     output;
     rc = graycode(k, of x1-x6);
     end;
run;

data scoring;
set before;
obs + 1; _TYPE_="PARMS"; _NAME_="SUMX";
run;

proc score data=x score=scoring type=PARMS out=scored;
by obs v notsorted;
var x1-x6;
id id;
run;

proc sql;
create table a as
select obs, id
from scored
group by obs
having abs(v-sumx) = min(abs(v-sumx)); /* L1 distance */


create table after(drop=_type_ _name_) as
select s.*, id
from scoring as s, a
where s.obs=a.obs;
quit;

proc sort data=after nodupkey; by obs; run;

PG

PG

View solution in original post

29 REPLIES 29
Haikuo
Onyx | Level 15

I have no idea how you come up with variable 'r', if you just want to find out ' (x1-x6) whose sum has a minimum distance with a variable v', then we could have the following approaches, my guess is that data step one maybe faster:

data before;

input x1-x6 v;

cards;

  1. 0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
  2. 0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
  3. 0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
  4. 0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

;

/*SQL*/

proc sql;

select *, abs(sum(x1,x2,x3,x4,x5,x6)-v) as dis from before having dis=min(dis);

quit;

/*data step*/

data want;

retain m_dis;

array t(6) _temporary_;

set before end=last;

array x x1-x6;

   dis=abs(sum(of x1-x6)-v);

   if dis < m_dis then do;

      do i=1 to dim(t);

           t(i)=x(i);

        end;

      end;

   m_dis=min(m_dis,dis);

  if last then do;

     do i=1 to dim(t);

         x(i)=t(i);

       end;

       output;

  end;

  drop m_dis i;

  run;

Haikuo

ciro
Quartz | Level 8

thank you for the suggestions, guys. but sorry, probably my explanation was not very clear.

I need to find for each record the combination of variables (x1-x6) whose sum is closer to variable v.

so for instance for the first record this is obtained by variables (x2,x3,x4) or (x3,x4,x5) an in that case I can accept one of them randomly, say (x2,x3,x4). In this case variable r should be equal to 011100 signaling which are the variables.

for the second record the distance with variable v is minimized by sum(x1,x5) and in this case r should be 100010 and so on

hope this is clearer

art297
Opal | Level 21

Ciro,

I thought your explanation was quite clear.  Here is one way to do it, but I didn't do anything to optimize the code.

The macro to create the various combinations was stolen from: http://support.sas.com/techsup/technote/ts498.html

I didn't bother to optimize it but, rather, just extended it so that it would get all of your combinations in one call, and then used a datastep to convert the resulting file to represent numbers rather than the characters it was designed to work with.

Possibly others can come up with some suggestions regarding how to optimize the code.  I also didn't include any comments in my own code, as I just to Chang's code and modified it to fit what I had been thinking.

data before;

  input x1-x6 v;

  cards;

0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3

0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2

0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8

.2 0.2 0.2 0.2 0.1 0.1 1

0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

;

run;

%macro combo(r)/parmbuff;

  %let i=2;

  %let things=;

  %do %while (%Qscan(&syspbuff,&i,%STR(,%))) ne );

    %let p&i="%Qscan(&syspbuff,&i,%STR(,%)))";

    %if &i=2 %then %let things=&&p&i;

    %else %let things=&things,&&p&i;

    %let i=%eval(&i+1);

  %end;

  %let n=%eval(&i-2);

  data combo;

    keep v1-v&r;

    array word $8  w1-w&n (&things);

    array rr (*) r1-r&r;

    array v $8  v1-v&r;

    %do r=1 %to 6;

      %do i=1 %to &r;                    /* create the DO LOOPs */

        %if &i=1 %then %do;

          do r&i=1 to &n-(&r-&i);

        %end;

        %else %do;

          do r&i=r%eval(&i-1)+1 to &n-(&r-&i);

        %end;

      %end;

      do k=1 to &r;              /* select subscripted items */

        v(k)=word (rr(k));               /* for a SAS data set */

      end;

      output;                    /* writes to a SAS data set */

      %do i=1 %to &r;

        end;                     /* create ENDs for the DO LOOPs */

      %end;

    %end;

  run;

%mend combo;

%combo (6,1,2,3,4,5,6)

data combo (keep=v:);

  array t[6] $ t1-t6;

  array v[6];

  set combo (rename=(v1-v6=t1-t6));

  do i=1 to 6;

    v=input(t,8.);

  end;

run;

data after (keep=x1-x6 v r);

  array combos_in(*) v1-v6;

  array combos[6,63];

  array diff[63];

  retain combos diff;

  j=0;

  do until (lastcombo);

    set combo end=lastcombo;

    j+1;

    do i=1 to 6;

      combos[i,j]=combos_in;

    end;

  end;

  do until (lastrow);

    set before end=lastrow;

    array x[1:6] x1-x6;

    array xx[6];

    retain x xx;

    *drop i mindist;

    do i=1 to 63;

      call missing(of xx(*));

      do j=1 to 6;

        if not missing(combos[j,i]) then xx=x[combos[j,i]];

      end;

      diff=abs(sum(of xx

  • )-v);
  •     end;

        mindist = min(of diff

  • );
  •     do i=1 to 63;

          if mindist eq diff then do;

            r = repeat("0", 6-1);

            do j=1 to 6;

              if not missing(combos[j,i]) then substr(r, combos[j,i], 1) = "1";

            end;

            leave;

          end;

        end;

        output;

      end;

    run;

    PGStats
    Opal | Level 21

    I assumed that you meant L1 distance but you could change it for L2 easily :

    data before;
    input x1-x6 v;
    cards;
    0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
    0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
    0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
    0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333
    ;
    run;

    data x(keep=id x1-x6);
    array x{6}(1 0 0 0 0 0);
    k = 1;
    do i = 2 to 2**6;
         id = cats(of x1-x6);
         output;
         rc = graycode(k, of x1-x6);
         end;
    run;

    data scoring;
    set before;
    obs + 1; _TYPE_="PARMS"; _NAME_="SUMX";
    run;

    proc score data=x score=scoring type=PARMS out=scored;
    by obs v notsorted;
    var x1-x6;
    id id;
    run;

    proc sql;
    create table a as
    select obs, id
    from scored
    group by obs
    having abs(v-sumx) = min(abs(v-sumx)); /* L1 distance */


    create table after(drop=_type_ _name_) as
    select s.*, id
    from scoring as s, a
    where s.obs=a.obs;
    quit;

    proc sort data=after nodupkey; by obs; run;

    PG

    PG
    Haikuo
    Onyx | Level 15

    Ok, a strenuous approach:

    data before;

    input x1-x6 v;

    cards;

    0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3

    0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2

    0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8

    0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

    ;

    data want;

    set before;

      array x x1-x6;

      /*one element*/

        do _0=1 to dim(x);

              diff=abs(x(_0)-v);

              if diff<min_diff then pos=cats(_0);

              min_diff=min(min_diff,diff);

            end;

            /*two elements*/

            do _0=1 to dim(x)-1;

              do _1=_0+1 to dim(x);

                 diff=abs(sum(x(_0),x(_1))-v);

                  if diff<min_diff then pos=cats(_0,_1);

                       min_diff=min(min_diff,diff);

              end;

            end;

       /*three elements*/

            do _0=1 to dim(x)-2;

               do _1=_0+1 to dim(x)-1;

                  do _2=_1+1 to dim(x);

                        diff=abs(sum(x(_0),x(_1),x(_2))-v);

                if diff<min_diff then pos=cats(_0,_1,_2);

                         min_diff=min(min_diff,diff);

                      end;

                end;

            end;

        /*four elements*/

                    do _0=1 to dim(x)-3;

               do _1=_0+1 to dim(x)-2;

                  do _2=_1+1 to dim(x)-1;

                         do _3=_2+1 to dim(x);

                        diff=abs(sum(x(_0),x(_1),x(_2),x(_3))-v);

                if diff<min_diff then pos=cats(_0,_1,_2,_3);

                         min_diff=min(min_diff,diff);

                            end;

                      end;

                end;

            end;

                   

       /*Five elements*/

                     do _0=1 to dim(x)-4;

               do _1=_0+1 to dim(x)-3;

                  do _2=_1+1 to dim(x)-2;

                         do _3=_2+1 to dim(x)-1;

                               do _4=_3+1 to dim(x);

                        diff=abs(sum(x(_0),x(_1),x(_2),x(_3),x(_4))-v);

                if diff<min_diff then pos=cats(_0,_1,_2,_3,_4);

                         min_diff=min(min_diff,diff);

                            end;

                            end;

                      end;

                end;

            end;

            /*Six elements*/

            diff=abs(sum(of x(*)));

                if diff<min_diff then pos=cats(1,2,3,4,5,6);

                        min_diff=min(min_diff,diff);

            position='000000';

            do i=1 to lengthn(pos);

            substr(position,substr(pos,i,1),1)='1';

            end;

            drop _: i diff;

    run;

    Haikuo

    ciro
    Quartz | Level 8

    hi, thank you very much for introducing me all these tools.

    I am learning a lot studying your examples. As a statistician (and not very good data step programmer) I liked very much the function graycode and the scoring approach, even if It might be not the more performance wise. I might probably mix some of your ideas.

    Haikuo
    Onyx | Level 15

    Thank you, PG for introducing graycode() to us and thank you, OP for initiating such an interesting question. IMHO, the use of graycode() wins hands-down, it saves tremendous brain power and makes the  flow of  programming like a happy spring. 

    Here is to use it in data step, just compare the length of my last post, and I am sure it can be better.

    data before;

    input x1-x6 v;

    cards;

    1. 0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
    2. 0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
    3. 0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
    4. 0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333

    ;

    data want;

    set before;

       array x x1-x6;

       array _y(6) (1 1 1 1 1 1);

       _k=0;

       do _i=1 to 2**dim(_y)-1;

          _sum=0;

          _rc=graycode(_k, of _y

  • );
  •       do _j=1 to dim(_y);

               _sum+x(_j)*_y(_j);

            end;

            _diff=abs(_sum-v);

            if _diff < min_diff then pos=cats(of _y(*));

            min_diff=min(min_diff,_diff);

       end;

       drop _:;

    run;

    proc print;run;

    Haikuo

    art297
    Opal | Level 21

    If it were me I would definitely adopt Haikuo's latest suggested code (i.e., the one using the graycode function).  It is parsimonious, easy to read, and runs at least as fast (or faster) than any of the other suggested solutions.

    While we all got to learn some new things from PGStat's and Ksharp's solutions, the performance of their respective solutions don't even begin to compare with Haikuo's code.

    Linlin
    Lapis Lazuli | Level 10

    Art,

    What is graycode function? I googled but nothing relevant came up.  Thank you!

    Haikuo
    Onyx | Level 15

    LinLin,

    http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p1myrepjl95...

    Art,

    I had to look up 'parsimonious'. Luckily, I probably don't have to take another GRE any more.

    Haikuo

    Linlin
    Lapis Lazuli | Level 10

    Thank you Haikuo! - Linlin

    art297
    Opal | Level 21

    Haikuo, I first had 'simplistic' in the sentence but was concerned that the term might be taken the wrong way.  I was very impressed with your suggested code.

    art297
    Opal | Level 21

    Linlin:  take a look at:

    http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003121538.htm

    Just FYI, you should have been able to find it with the following search string on Google:

    graycode site:support.sas.com

    Linlin
    Lapis Lazuli | Level 10

    Thank you Art! I searched "graycode function,sas" .

    sas-innovate-2024.png

    Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

    Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

     

    Register now!

    How to Concatenate Values

    Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 29 replies
    • 3063 views
    • 3 likes
    • 7 in conversation