Solved: Finding Closest Value to a List of Values

Anirudh9 · Posted 09-24-2018 03:35 PM

I have two tables

table 1

31

53

25

125

129

table 2

31

62

93

124

155

I want values on table 1 to be matched with the closest 'next' higher value from table 2.

Desired table

input output

31 31

53 62

25 31

125 155

129 155

Anirudh9 · Posted 09-26-2018 02:27 PM

Hey guys! I used your ideas and came up with this simple answer. Thank you so much for all the suggestions. This community is the best!!

proc sql;

create table test as

select x,

(select max(y) from t2) as y

from t1

where x > (select max(y) from t2)

union

select x,

(select min(y) from t2 where y>=x) as y

from t1

where t1;

quit;

View solution in original post

Reeza · Posted 09-24-2018 03:40 PM

How big are these tables? For the size you have there, I'd be doing a temporary array but that loads it into memory so if it's a larger data set it may not work. Or if you need GROUP BY processing that approach won't work either.

@Anirudh9 wrote:

I have two tables

table 1

31

53

25

125

129

table 2

31

62

93

124

155

I want values on table 1 to be matched with the closest 'next' higher value from table 2.

Desired table

input output

31 31

53 62

25 31

125 155

129 155

Anirudh9 · Posted 09-24-2018 03:42 PM

Maybe 30-40 rows max.

Reeza · Posted 09-24-2018 03:57 PM

Here's one way then:

data t1;
input value1;
cards;
31
53
25
125
129
;

data t2;
input value2;
cards;
31
62
93
124
155
;
run;

data find_nearest;
*load data into temporary array;
array _nearest(5) _temporary_;

if _n_=1 then do j=1 to dim(_nearest);
    set T2;*dataset with values to load into array;
    _nearest(j) = value2;
end;

call sortn(of _nearest(*));

set T1;

do i=1 to dim(_nearest);
    if _nearest(i) > value1 then leave;
end;
nearest = _nearest(i);
Difference = _nearest(i) - value1;

keep value1 nearest difference;

run;

mkeintz · Posted 09-24-2018 05:06 PM

If you original data sets were both sorted, you could skip the proc sorts here:

data t1;
input value1;
cards;
31
53
25
125
129
;

data t2;
input value2;
cards;
31
62
93
124
155
;
run;

proc sort data=t1; by value1;run;
proc sort data=t2; by value2;run;

data want;
  set t1 (keep=value1 rename=(value1=value2)  in=in1)
      t2 (in=in2);
  by value2;
  if in1 then set t1;
  if in2;
run;

The SAS behavior taken advantage of here is how and when the PDV (program data vector) gets assigned new values. In the case of variables from t1 or t2, that only happens when there is a corresponding SET statement. So VALUE1 (and all other T1 variables) is only updated when the "if in1 then set t1;" statement is executed. That means that VALUE1 is NOT changed when a T2 record is read in. The first SET statement [set t1 ..... t2 (in=in2)] doesn't encounter value1 since it is renamed (for order purposes) to value2.

This also means that if you have a T1 followed by multiple T2's, all those T2's would get the same (most recent) value1. And if you have several consecutive T1 records, only the final one would propagate to subsequent T2's.

As per comments elsewhere on this topic, the above does not accommodate keeping any T1 records for which value1 is higher than any available value2 in T2. Here is a modification to include such records:

data wantnew (drop=n2);
  set t1 (keep=value1 rename=(value1=value2)  in=in1)
      t2 (in=in2)
      nobs=nobs_both;
  by value2;
  if in1 then set t1 nobs=nobs1;
  else n2+1;
  if in2 or n2 >= nobs_both-nobs1;
  if in1 then call missing(value2);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 09-24-2018 11:47 PM

@mkeintz So clever!

High-Performance SAS Coding - Third Edition

novinosrin · Posted 09-24-2018 04:03 PM


data one;
input var1 ;
cards;
31
53
25
125
129
;

data two;
input var2;
cards;
31
62
93
124
155
;
 
proc sql;
create table want(drop=a) as
select *, ifn(var2-var1<0,.,var2-var1) as a
from one, two
group by var1
having a=min(a);
quit;

FreelanceReinh · Posted 09-25-2018 06:24 AM

@mkeintz: A creative solution indeed, but does it solve the original question?

@Anirudh9: Here are two other suggestions. Please note that I have inserted value 180 into the first dataset, just to demonstrate what happens if there is no higher (or equal) value contained in the second dataset.

data t1;
input x @@;
cards;
31 53 25 125 129 180
;

data t2;
input y @@;
cards;
31 62 93 124 155
;

/* Another PROC SQL solution */

proc sql;
create table want as
select x, (select min(y) from t2 where y>=x) as y
from t1;
quit;


/* Another DATA step solution */

proc sort data=t2;
by y;
run;

data want;
set t1;
do p=1 to n until(y>=x);
  set t2 point=p nobs=n;
end;
if y<x then y=.;
run;

mkeintz · Posted 09-25-2018 02:47 PM

@FreelanceReinh wrote:

@mkeintz: A creative solution indeed, but does it solve the original question?

The OP didn't specify what was wanted when there are value1's (variable X here) for which there are no next high value2's (variable Y). So yes, while my solution worked for the sample data presented, it simply ignored case for X greater than the max Y value. That may not be would the OP would want.

However, I don't think this mean one has to resort to the "do p=1 to n until(y>=x);" recommendation, which can be very expensive when data sets are large, since it requires starting at the beginning of data set 2 for every incoming record of data set 1 (although it eliminates the need to presort data set T1). If avoiding sorts really was crucial, I'd go with a hash object technique, using data set T2 as the hash object. I won't describe that approach here.

Instead, assuming proc sort's are affordable, one could tweak my original suggestion (data wantold;) as per the below (see the data wantnew step). I've already presorted T1 and T2 for this example.

data t1;
input x @@;
cards;
25 31 53 125 129 180
;

data t2;
input y @@;
cards;
31 62 93 124 155
;


data wantold;
  set t1 (keep=x rename=(x=y)  in=in1)
      t2 (in=in2);
  by y;
  if in1 then set t1 nobs=nobs1;
  if in2;
run;


data wantnew (drop=n2);
  set t1 (keep=x rename=(x=y)  in=in1)
      t2 (in=in2)
      nobs=nobs_both;
  by y;
  if in1 then set t1 nobs=nobs1;
  else n2+1;
  if in2 or n2 >= nobs_both-nobs1;
  if in1 then call missing(y);
run;

The main difference above is to track how many T2 records have been read (variable N2) and compare that to the total number of T2 records available (NOBS_BOTH-NOBS1). The data want_new step includes all records following the exhaustion T2.

My main point in these suggestions is for sas programmers to realize that the PDV is their friend. It's very useful to be aware of how and when (and what part of) the PDV is being modified throughout the data step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 09-25-2018 05:43 PM

Thanks @mkeintz for your detailed reply. Sorry for being unspecific and putting two answers into the same post. I wasn't referring to the special case ("180") which I introduced, but to the OP's own example:

@Anirudh9 wrote:

I want values on table 1 to be matched with the closest 'next' higher value from table 2.

Desired table

input output

31 31

53 62

25 31

125 155

129 155

So, my understanding is that, e.g., value1=125 from (the OP's) "table 1" is to be matched with value2=155 from "table 2" -- the next higher value.

When I run your code (original version), I obtain this dataset WANT:

value2    value1

   31        31
   62        53
   93        53
  124        53
  155       129

This differs significantly (i.e. not only in terms of row or column sort order) from the OP's "desired table". In particular, value1=125 doesn't occur at all. (Same with version WANTOLD. In WANTNEW only the correct assignment 180 --> . is added.)

Did I overlook or misunderstand something? (I suppose so because the above discrepancy is too obvious.)

I fully agree that the DO-loop approach I suggested would not be advisable for large tables. But I thought for datasets with "[m]aybe 30-40 rows max." (Anirudh9) it should be acceptable in terms of run time.

@mkeintz wrote:

My main point in these suggestions is for sas programmers to realize that the PDV is their friend. It's very useful to be aware of how and when (and what part of) the PDV is being modified throughout the data step.

Very true! Much appreciated.

mkeintz · Posted 09-26-2018 09:42 AM

Well, it took two commens from @FreelanceReinh but I now see what he was pointing out. I was ignoring instance of T1 records that were not immediately followed by a T2, yielding an interesting solution to a problem not posed by the OP.

Here's a correct version. Like the others it also depends on asynchronous updating of the program data vector., reading from two sorted data sets:

data t1;
input x @@;
cards;
25 31 53 125 129 180
;

data t2;
input y @@;
cards;
31 62 93 124 155
;


data want_correct;
  set t1;
  do while (y<x  and end_of_t2=0);
    set t2 end=end_of_t2;
  end;
  if y<x then call missing(y);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Anirudh9 · Posted 09-26-2018 10:18 AM

I wrote the same code on Python. Maybe you could help me translate the same to SAS.

The match rate is what my target value is. actual_rate is table 1, matrix_rate is table 2.

If the value In actual_rate is higher than max of matrix_rate, I want the highest value of matrix_rate to be used. I think this code will help understand my problem a lot better.

@mkeintz @FreelanceReinh @Ksharp @ChrisNZ

match_rate = []
actual_rate = [31,53,25,61,44,165,311,233] #These are samples, actual data is bigger, 30-40 rows
matrix_rate = [31,62,93,124,155,186,211,252,288]

for i in range(np.size(actual_rate)):

if actual_rate[i] > max(matrix_rate):
match_rate[i] = max(matrix_rate)

else:
  temp1 = []
  for j in range(np.size(matrix_rate)):
   p = matrix_rate[j] - actual_rate[i]
   if (p >= 0):
    temp1.append(p)

match_rate[i] = min(temp1) + actual_rate[I]

Reeza · Posted 09-26-2018 10:43 AM

@Anirudh9 there are several working solutions posted. None work for you?

@Anirudh9 wrote:

I wrote the same code on Python. Maybe you could help me translate the same to SAS.

The match rate is what my target value is. actual_rate is table 1, matrix_rate is table 2.

If the value In actual_rate is higher than max of matrix_rate, I want the highest value of matrix_rate to be used. I think this code will help understand my problem a lot better.

@mkeintz @FreelanceReinh @Ksharp @ChrisNZ

match_rate = []
actual_rate = [31,53,25,61,44,165,311,233]  #These are samples, actual data is bigger, 30-40 rows
matrix_rate = [31,62,93,124,155,186,211,252,288]

for i in range(np.size(actual_rate)):

if actual_rate[i] > max(matrix_rate):
  match_rate[i] = max(matrix_rate)

else:
  temp1 = []
  for j in range(np.size(matrix_rate)):
   p = matrix_rate[j] - actual_rate[i]
   if (p >= 0):
    temp1.append(p)

  match_rate[i] = min(temp1) + actual_rate[I]

ChrisNZ · Posted 09-26-2018 06:42 PM

This is a totally different question as the data is now horizontal.

I don't know python by this is probably pretty close.


data _null_;

  array MATCH_RATE  [8] ;
  array ACTUAL_RATE [8] (31 53 25  61  44 165 311 233    ) ;  
  array MATRIX_RATE [9] (31 62 93 124 155 186 211 252 288) ;

  do I= 1 to dim(ACTUAL_RATE);
   
    if ACTUAL_RATE[I] > max(of MATRIX_RATE[*]) then 
      MATCH_RATE[I] = max(of MATRIX_RATE[*]); 
     
    else do;
      TEMP1=.;
      do J= 1 to dim(MATRIX_RATE);
         P = MATRIX_RATE[J] - ACTUAL_RATE[I];
         if P >= 0 then  TEMP1=min(TEMP1,P);
      end;
  
      MATCH_RATE[I] = TEMP1 + ACTUAL_RATE[I]; 
      putlog I=   MATCH_RATE[I]=;
   end;
 end;
run;

I=1 MATCH_RATE1=31
I=2 MATCH_RATE2=62
I=3 MATCH_RATE3=31
I=4 MATCH_RATE4=62
I=5 MATCH_RATE5=62
I=6 MATCH_RATE6=186
I=8 MATCH_RATE8=252

High-Performance SAS Coding - Third Edition

Ksharp · Posted 09-25-2018 09:16 AM

data t1;
input x @@;
cards;
31 53 25 125 129 180
;

data t2;
input y @@;
cards;
31 62 93 124 155
;

proc sql;
create table want as
 select distinct *
  from t1,t2
   where x<=y
    group by x
	 having y=min(y);
quit;

Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Re: Finding Closest Value to a List of Values

Registration is open

SAS Training: Just a Click Away