Solved: Re: find variables whose sum minimizes distance with another variable - Page 2

art297 · Posted 09-24-2012 04:22 PM

Linlin: This also would have worked:

"graycode function" sas

MikeZdeb · Posted 09-24-2012 06:32 PM

hi ... never having seen the GRAYCODE function, I was interested in your code versus that posted by Pierre since they differed in the initialization of the 0/1 array and in the initial value of the variable K and I'd wanted to know what happens when those values change (did it make any difference)

I added a few lines of code to your data step and created a data set (HAIKUO) of the 63 different combinations of 0/1 in the array _Y that get used to find the minimum difference

I compared the 63 combinations to those generated by Pierre's posted code (data set PIERRE) ... I think I got this correct ...

* HAIKUO posted code ... make a new data set with ID and _Y1 - _Y6;

data want (drop=_:) haikuo (keep=id _y1-_y6);

set before;

array x x1-x6;

array _y(6) (1 1 1 1 1 1);

_k=0;

do _i=1 to 2**dim(_y)-1;

_sum=0;

_rc=graycode(_k, of _y

);

* create variable ID from current values of _Y1 - _Y6 and write to data set;

id = cats(of _y(*));

if _n_ eq 1 then output haikuo;

do _j=1 to dim(_y);

_sum+x(_j)*_y(_j);

end;

_diff=abs(_sum-v);

if _diff < min_diff then pos=cats(of _y(*));

min_diff=min(min_diff,_diff);

end;

output want;

run;

* Pierre's code ... create 0/1 variables;

data pierre (keep=id x1-x6);

array x{6}(1 0 0 0 0 0);

k = 1;

do i = 2 to 2**6;

id = cats(of x1-x6);

output;

rc = graycode(k, of x1-x6);

end;

run;

* merge data sets, look for differences;

proc sort data=haikuo;

by id;

run;

proc sort data=pierre;

by id;

run;

data not_the_same;

merge haikuo (in=h) pierre (in=p);

by id;

if not (h and p);

run;

there are differences ...

_y1 _y2 _y3 _y4 _y5 _y6 id x1 x2 x3 x4 x5 x6

0 0 0 0 0 0 000000 . . . . . .

. . . . . . 000001 0 0 0 0 0 1

. . . . . . 100001 1 0 0 0 0 1

evaluates the combination '000000' that is not evaluated by Pierre

never use two of the possible combinations (they are used by Pierre)

there is also one combination that gets evaluated twice (a duplicate ... no dups in Pierre's data) ...

proc sort data=haikuo out=h dupout=dp nodupkey;

by id;

run;

data set DP ...

_y1 _y2 _y3 _y4 _y5 _y6 id

0 1 1 1 1 1 011111

my conclusion is that I'm not sure that I understand GRAYCODE ... it remains another of my SAS "gray areas"

ps disclaimer ... I did say that "I think I got this correct"

art297 · Posted 09-24-2012 07:00 PM

Mike: I don't think (again, like you, I'm only thinking on this thread) that Haikuo's code is wrong, If you run PGStat's code over multiple iterations of the same data, sometimes it accepts zeros as part of the needed calculations and sometimes it doesn't.

The graycode function is simply building a set of all combinations. How it is applied, on the other hand, is what I (think) you are questioning.

MikeZdeb · Posted 09-24-2012 07:54 PM

hi ... if you are not sure that all the possible combinations are tested, can you rely on the answer

art297 · Posted 09-24-2012 08:56 PM

Mike,

Upon further review, the following appears to produce all possible combinations:

data want;

array _y(6) (1 1 1 1 1 1);

_k=-1;

do _i=1 to 2**dim(_y);

_rc=graycode(_k, of _y

);

pos=cats(of _y(*));

if _rc ne 0 then output;

end;

run;

/*check*/

proc sort data=want nodupkey;

by pos;

run;

Haikuo · Posted 09-24-2012 07:56 PM

Thanks, Mike, for the in-depth following up. It is rather puzzling, at least for me right now. Here is the quote from SAS doc:

"To generate all subsets of n items, you can initialize k to a negative value and execute GRAYCODE in a loop that iterates 2**n times. If you want to start with a non-empty subset, then initialize k to be the number of items in the subset, initialize the other arguments to specify the desired initial subset, and execute GRAYCODE in a loop that iterates 2**n-1 times. The sequence of subsets that are generated by GRAYCODE is cyclical, so you can begin with any subset that you want."

So it seems to me that if I initiate k=0, then it should exhausted all of the combinations except all null (000000) doing 2**6-1. While in practice, it not only generate null set (000000), but not really "cyclical", so it misses some combination, and it occurs to me if I set k>0, it even misses more.

Given the condition, the fix should be simple:

data want;

set before;

array x x1-x6;

array _y(6) (1 1 1 1 1 1);

_k=-1;

do _i=1 to 2**dim(_y);

_sum=0;

_rc=graycode(_k, of _y

);

do _j=1 to dim(_y);

if _rc >0 then _sum+x(_j)*_y(_j);

end;

_diff=abs(_sum-v);

if _diff < min_diff then pos=cats(of _y(*));

min_diff=min(min_diff,_diff);

end;

drop _:;

run;

Haikuo, in gray zone, too

MikeZdeb · Posted 09-24-2012 09:44 PM

hi ... OK, maybe not so gray anymore

another idea, a tiny bit slower, but no gray area involved

also, I think you need to initialize MIN_DIFF since if the first combination of 0/1 generates the minimum difference, no value is ever assigned to POS (yes/no?) ...

data want;

set before;

array x(6);

do _i=1 to 2**dim(x)-1;

_sum=0;

min_diff = 1e6;

do _j=1 to dim(x);

_sum+x(_j)*input(char(put(_i,binary6.),_j),1.);

end;

_diff=abs(_sum-v);

if _diff < min_diff then pos=put(_i,binary6.);

min_diff=min(min_diff,_diff);

end;

drop _: ;

run;

Haikuo · Posted 09-24-2012 10:35 PM

Quote "I think you need to initialize MIN_DIFF since if the first combination of 0/1 generates the minimum difference, no value is ever assigned to POS"

Absolutely agreed. I want you to be my SAS instructor, but I need someone else to score my exam :smileysilly:

That being said, the code can be made more robust by avoiding using artificial large number '1e6'. Modification would include:

1. initiate "min_diff=abs(x1-v)" ;

2. "if _diff<min_diff" TO "if _diff<=min_diff".

Yes/no?

Thanks, Mike, again for your sharp insight.

Haikuo

Update: abs() applied.

art297 · Posted 09-25-2012 08:59 AM

Haikuo,

I would initialize with: min_diff=CONSTANT('BIG')

At least, that way, you would know that SAS couldn't accurately represent a larger number.

Haikuo · Posted 09-25-2012 09:50 AM

Thanks, Art. It is not even 9am central time, and I have already learned one new function: constant(). It is gonna to be a good day.

Haikuo

art297 · Posted 09-25-2012 09:59 AM

Mike,

Below are slightly modified versions of both your and Haikuo's code and a dataset containing all 63 combinations. Both appear to end up with the same results, but Haikuo's code runs almost three times faster.

On my machine, given 63,000 records, Haikuo's code ran in 5.5 seconds.

data before (drop=i sequence id);

informat sequence $11.;

input x1-x6 v sequence id;

do i=1 to 1000;output;end;

cards;

2 2 2 2 2 1 1 6 1

2 2 2 2 1 2 1 5 2

3 3 3 3 1 1 2 5-6 3

2 2 2 1 2 2 1 4 4

3 3 3 1 3 1 2 4-6 5

3 3 3 1 1 3 2 4-5 6

4 4 4 1 1 1 3 4-5-6 7

2 2 1 2 2 2 1 3 8

3 3 1 3 3 1 2 3-6 9

3 3 1 3 1 3 2 3-5 10

4 4 1 4 1 1 3 3-5-6 11

3 3 1 1 3 3 2 3-4 12

4 4 1 1 4 1 3 3-4-6 13

4 4 1 1 1 4 3 3-4-5 14

5 5 1 1 1 1 4 3-4-5-6 15

2 1 2 2 2 2 1 2 16

3 1 3 3 3 1 2 2-6 17

3 1 3 3 1 3 2 2-5 18

4 1 4 4 1 1 3 2-5-6 19

3 1 3 1 3 3 2 2-4 20

4 1 4 1 4 1 3 2-4-6 21

4 1 4 1 1 4 3 2-4-5 22

5 1 5 1 1 1 4 2-4-5-6 23

3 1 1 3 3 3 2 2-3 24

4 1 1 4 4 1 3 2-3-6 25

4 1 1 4 1 4 3 2-3-5 26

5 1 1 5 1 1 4 2-3-5-6 27

4 1 1 1 4 4 3 2-3-4 28

5 1 1 1 5 1 4 2-3-4-6 29

5 1 1 1 1 5 4 2-3-4-5 30

6 1 1 1 1 1 5 2-3-4-5-6 31

1 2 2 2 2 2 1 1 32

1 3 3 3 3 1 2 1-6 33

1 3 3 3 1 3 2 1-5 34

1 4 4 4 1 1 3 1-5-6 35

1 3 3 1 3 3 2 1-4 36

1 4 4 1 4 1 3 1-4-6 37

1 4 4 1 1 4 3 1-4-5 38

1 5 5 1 1 1 4 1-4-5-6 39

1 3 1 3 3 3 2 1-3 40

1 4 1 4 4 1 3 1-3-6 41

1 4 1 4 1 4 3 1-3-5 42

1 5 1 5 1 1 4 1-3-5-6 43

1 4 1 1 4 4 3 1-3-4 44

1 5 1 1 5 1 4 1-3-4-6 45

1 5 1 1 1 5 4 1-3-4-5 46

1 6 1 1 1 1 5 1-3-4-5-6 47

1 1 3 3 3 3 2 1-2 48

1 1 4 4 4 1 3 1-2-6 49

1 1 4 4 1 4 3 1-2-5 50

1 1 5 5 1 1 4 1-2-5-6 51

1 1 4 1 4 4 3 1-2-4 52

1 1 5 1 5 1 4 1-2-4-6 53

1 1 5 1 1 5 4 1-2-4-5 54

1 1 6 1 1 1 5 1-2-4-5-6 55

1 1 1 4 4 4 3 1-2-3 56

1 1 1 5 5 1 4 1-2-3-6 57

1 1 1 5 1 5 4 1-2-3-5 58

1 1 1 6 1 1 5 1-2-3-5-6 59

1 1 1 1 5 5 4 1-2-3-4 60

1 1 1 1 6 1 5 1-2-3-4-6 61

1 1 1 1 1 6 5 1-2-3-4-5 62

1 1 1 1 1 1 6 1-2-3-4-5-6 63

;

data after_mike;

set before;

array x(6);

min_diff = constant('big');

do _i=1 to 2**dim(x)-1;

_sum=0;

/* min_diff = 1e6;*/

do _j=1 to dim(x);

_sum+x(_j)*input(char(put(_i,binary6.),_j),1.);

end;

_diff=abs(_sum-v);

if _diff < min_diff then pos=put(_i,binary6.);

min_diff=min(min_diff,_diff);

end;

drop _: ;

run;

data after_haikuo;

set before;

array x x1-x6;

array _y(6) (1 1 1 1 1 1);

min_diff = constant('big');

_k=-1;

do _i=1 to 2**dim(_y);

_sum=0;

_rc=graycode(_k, of _y

);

if _rc gt 0 then do _j=1 to dim(_y);

_sum+x(_j)*_y(_j);

end;

_diff=abs(_sum-v);

if _diff < min_diff then do;

pos=cats(of _y(*));

min_diff=_diff;

end;

if _diff eq 0 then leave;

end;

drop _:;

run;

MikeZdeb · Posted 09-25-2012 09:43 PM

hi Art ... small changes (generate binary sequence once, use logical value versus the INPUT + PUT stuff)..

data after_mike;

set before;

array x(6);

min_diff = constant('big');

do _i=1 to 2**dim(x)-1;

_sum=0;

_seq = put(_i, binary6.);

do _j=1 to dim(x);

_sum+x(_j)*(char(_seq,_j) eq '1');

end;

_diff=abs(_sum-v);

if _diff < min_diff then pos=put(_i,binary6.);

min_diff=min(min_diff,_diff);

end;

drop _: ;

run;

test with 400,000 observations (using original posted data just ramped up a bit) ...

NOTE: There were 400000 observations read from the data set WORK.BEFORE.

NOTE: The data set WORK.AFTER_MIKE has 400000 observations and 9 variables.

NOTE: DATA statement used (Total process time):

real time 20.56 seconds

cpu time 20.54 seconds

versus ...

NOTE: There were 400000 observations read from the data set WORK.BEFORE.

NOTE: The data set WORK.AFTER_HAIKUO has 400000 observations and 9 variables.

NOTE: DATA statement used (Total process time):

real time 17.06 seconds

cpu time 16.82 seconds

not that much slower now and no "gray area" in that I know exactly what sequence of 0s and 1s is used each time (but ... still liked learning there's a GRAYCODE function)

ps ... with only 60,000, not much difference ...

NOTE: There were 60000 observations read from the data set WORK.BEFORE.

NOTE: The data set WORK.AFTER_MIKE has 60000 observations and 9 variables.

NOTE: DATA statement used (Total process time):

real time 3.09 seconds

cpu time 3.09 seconds

NOTE: There were 60000 observations read from the data set WORK.BEFORE.

NOTE: The data set WORK.AFTER_HAIKUO has 60000 observations and 9 variables.

NOTE: DATA statement used (Total process time):

real time 2.54 seconds

cpu time 2.51 seconds

AND, EVEN MORE INTERESTING, I changed this ... min_diff = constant('big'); back to this ... min_diff = 1e6;

and the jobs run a LOT faster ...

after_mike

60,000 obs now runs in 1.84 seconds CPU (vs 3.09)

400,000 obs now runs in 11.93 seconds CPU (vs 20.54)

after_haikuo

60,000 obs now runs in 1.29 seconds CPU (vs 2.51)

400,000 obs now runs in 8.54 seconds CPU (vs 16.82)

so, you pay a price using the CONSTANT function when all you really need is a very large number (yes/no?)

art297 · Posted 09-26-2012 11:02 AM

Mike, This gets more interesting with each discovery. Yes, I too, was quite surprised at the effect of including the constant function. However, that can be alleviated by simply assigning the value to a macro variable just before the datastep.

I did that, in the code below, and (for a reason I can't explain) , Haikuo's code experienced a much greater benefit. Regarding your comment about differences based on number of records, in both cases the difference was about 22 to 23% (on your machine). On my machine, the difference was around 35%.

However, with the new change, the difference is now 167%. Using your code, 4.75 seconds, using Haikuo's code, 1.78 seconds.

%let min_diff=%sysfunc(constant(BIG));

data after_mike;

set before;

array x(6);

min_diff = &min_diff.;

do _i=1 to 2**dim(x)-1;

_sum=0;

_seq = put(_i, binary6.);

do _j=1 to dim(x);

_sum+x(_j)*(char(_seq,_j) eq '1');

end;

_diff=abs(_sum-v);

if _diff < min_diff then pos=put(_i,binary6.);

min_diff=min(min_diff,_diff);

end;

drop _: ;

run;

data after_haikuo;

set before;

array x x1-x6;

array _y(6) (1 1 1 1 1 1);

min_diff = &min_diff.;

_k=-1;

do _i=1 to 2**dim(_y);

_sum=0;

_rc=graycode(_k, of _y

);

if _rc gt 0 then do _j=1 to dim(_y);

_sum+x(_j)*_y(_j);

end;

_diff=abs(_sum-v);

if _diff < min_diff then do;

pos=cats(of _y(*));

min_diff=_diff;

end;

if _diff eq 0 then leave;

end;

drop _:;

run;

Ksharp · Posted 09-24-2012 10:20 AM

It looks like I am late.

data have;
input var1-var6 c;
cards;
0.279567742  0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333
;
run;






data x;
input (var1-var6) (: $40.);
cards;
var1 var2 var3 var4 var5 var6
;
run;
proc summary data=x ;
class var1-var6;
output out=temp(drop=_:);
run;
data exp(keep=expression _expression);
set temp end=last;
length expression _expression $ 32767;
retain expression;
expression=catx(' ',expression, catx('+',of var1-var6));
if last then do; _expression=expression;output;end;
run;
data have;
 set have;
if _n_ eq 1 then set exp;
run;

data want;
 set have;
 min=999999;
 array _v{*} var1-var6 ;
 flag=repeat('0',dim(_v)-1);
 do i=1 to dim(_v);
  expression=tranwrd(expression,strip(vname(_v{i})),strip(_v{i}));
 end;

 j=1;
 temp=scan(expression,j,' ');
 do while(not missing(temp));
  sum=resolve(cats('%sysevalf(',temp,')'));
  diff=abs(sum-c); 
  if diff lt min then do; want=scan(_expression,j,' ');min=diff; 
end;
  j+1;
  temp=scan(expression,j,' ');
 end;

 k=1;
 _temp=scan(want,k,'var+');
 do while(not missing(_temp));
 substr(flag,input(_temp,best8.),1)='1';
 k+1;
 _temp=scan(want,k,'var+');
 end;
drop i j k temp _temp sum  expression _expression min;
run;

Ksharp

PGStats · Posted 09-24-2012 09:29 PM

What I understand from the GRAYCODE function documentation is 1) If you initialize it with k=-1 then it zeros the variable vector on the first iteration 2) if you want to start the sequence elsewhere than zero then you must provide a set of starting values in the variable vector and k must be equal to the number of ones in your starting set. 3) the Gray code is cyclical, so as long as you start it properly, it will cover the whole set of 2**n subsets.

I agree with Art that the code I provided before was somewhat inefficient in the last SQL part. Here is a more efficient version (note the inclusion of variable v in the scoring process) :

data before;
input x1-x6 v;
cards;
0.279567742 0.978487097 0.978487097 0.978487097 0.978487097 0.139783871 3
0.928564286 1.083325 1.083325 1.083325 0.154760714 0 1.2
0.838703226 0.978487097 0.978487097 0.978487097 0.559135484 0 2.8
0.43333 1.011103333 1.011103333 1.011103333 0.86666 0 4.333
;

data x(keep=id x1-x6 v);
length id $6;
array x{6}(6*0);
v = -1;
/* if id=000000 is an acceptable answer, use this */
*k = -1;
*do i = 1 to 2**6;
/* otherwise use this */
x{1} = 1;
k = 1;
do i = 2 to 2**6;
     id = cats(of x1-x6);
     output;
     rc = graycode(k, of x1-x6);
     end;
run;

data scoring /view=scoring;
set before;
obs + 1; _TYPE_="PARMS"; _NAME_="SUMX";
run;

proc score data=x score=scoring type=PARMS out=scored;
by obs;
var v x1-x6;
id id;
run;

data postScore(keep=obs id score) / view=postScore;
set scored;
score = abs(SUMX);
run;

proc means data=postScore noprint;
by obs;
output out=minScores(drop=_:) idgroup(MIN(score) OUT(id)=);
run;

data after;
merge minScores scoring;
by obs;
drop obs _:;
run;

PG

SAS Innovate 2025: Call for Content

Classroom Training Available!