BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Anirudh9
Obsidian | Level 7

I have two tables 

 

table 1

31

53

25

125

129

 

table 2

31

62

93

124

155

 

 

I want values on table 1 to be matched with the closest 'next' higher value from table 2. 

 

Desired table

 

input output

31      31

53      62

25      31

125    155

129    155

 

1 ACCEPTED SOLUTION

Accepted Solutions
Anirudh9
Obsidian | Level 7

 

Hey guys! I used your ideas and came up with this simple answer. Thank you so much for all the suggestions. This community is the best!!

 

 

proc sql;

create table test as

select x,

(select max(y) from t2) as y

from t1

where x > (select max(y) from t2)

 

union

 

select x,

(select min(y) from t2 where y>=x) as y

from t1

where t1;

quit;

 

View solution in original post

16 REPLIES 16
Reeza
Super User

How big are these tables? For the size you have there, I'd be doing a temporary array but that loads it into memory so if it's a larger data set it may not work. Or if you need GROUP BY processing that approach won't work either.

 


@Anirudh9 wrote:

I have two tables 

 

table 1

31

53

25

125

129

 

table 2

31

62

93

124

155

 

 

I want values on table 1 to be matched with the closest 'next' higher value from table 2. 

 

Desired table

 

input output

31      31

53      62

25      31

125    155

129    155

 


 

Anirudh9
Obsidian | Level 7

Maybe 30-40 rows max.

Reeza
Super User

Here's one way then:

 

data t1;
input value1;
cards;
31
53
25
125
129
;

data t2;
input value2;
cards;
31
62
93
124
155
;
run;

data find_nearest;
*load data into temporary array;
array _nearest(5) _temporary_;

if _n_=1 then do j=1 to dim(_nearest);
    set T2;*dataset with values to load into array;
    _nearest(j) = value2;
end;

call sortn(of _nearest(*));

set T1;

do i=1 to dim(_nearest);
    if _nearest(i) > value1 then leave;
end;
nearest = _nearest(i);
Difference = _nearest(i) - value1;

keep value1 nearest difference;

run;


mkeintz
PROC Star

If you original data sets were both sorted, you could skip the proc sorts here:

 

data t1;
input value1;
cards;
31
53
25
125
129
;

data t2;
input value2;
cards;
31
62
93
124
155
;
run;

proc sort data=t1; by value1;run;
proc sort data=t2; by value2;run;

data want;
  set t1 (keep=value1 rename=(value1=value2)  in=in1)
      t2 (in=in2);
  by value2;
  if in1 then set t1;
  if in2;
run;

The SAS behavior taken advantage of here is how and when the PDV (program data vector) gets assigned new values.  In the case of variables from t1 or t2, that only happens when there is a corresponding SET statement.  So VALUE1 (and all other T1 variables) is only updated when the "if in1 then set t1;" statement is executed.  That means that VALUE1 is NOT changed when a T2 record is read in.  The first SET statement  [set t1 ..... t2 (in=in2)] doesn't encounter value1 since it is renamed (for order purposes) to value2.

 

This also means that if you have a T1 followed by multiple T2's, all those T2's would get the same (most recent) value1.  And if you have several consecutive T1 records, only the final one would propagate to subsequent T2's.

 

 

 

As per comments elsewhere on this topic, the above does not accommodate keeping any T1 records for which value1 is higher than any available value2 in T2.  Here is a modification to include such records:

 

data wantnew (drop=n2);
  set t1 (keep=value1 rename=(value1=value2)  in=in1)
      t2 (in=in2)
      nobs=nobs_both;
  by value2;
  if in1 then set t1 nobs=nobs1;
  else n2+1;
  if in2 or n2 >= nobs_both-nobs1;
  if in1 then call missing(value2);
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
novinosrin
Tourmaline | Level 20

data one;
input var1 ;
cards;
31
53
25
125
129
;

data two;
input var2;
cards;
31
62
93
124
155
;
 
proc sql;
create table want(drop=a) as
select *, ifn(var2-var1<0,.,var2-var1) as a
from one, two
group by var1
having a=min(a);
quit;
FreelanceReinh
Jade | Level 19

@mkeintz: A creative solution indeed, but does it solve the original question?

 

@Anirudh9: Here are two other suggestions. Please note that I have inserted value 180 into the first dataset, just to demonstrate what happens if there is no higher (or equal) value contained in the second dataset.

 

data t1;
input x @@;
cards;
31 53 25 125 129 180
;

data t2;
input y @@;
cards;
31 62 93 124 155
;

/* Another PROC SQL solution */

proc sql;
create table want as
select x, (select min(y) from t2 where y>=x) as y
from t1;
quit;


/* Another DATA step solution */

proc sort data=t2;
by y;
run;

data want;
set t1;
do p=1 to n until(y>=x);
  set t2 point=p nobs=n;
end;
if y<x then y=.;
run;
mkeintz
PROC Star

@FreelanceReinh wrote:

@mkeintz: A creative solution indeed, but does it solve the original question?

The OP didn't specify what was wanted when there are value1's (variable X here) for which there are no next high value2's (variable Y).  So yes, while my solution worked for the sample data presented, it simply ignored case for X greater than the max Y value.  That may not be would the OP would want. 

 

However, I don't think this mean one has to resort to the "do p=1 to n until(y>=x);" recommendation, which can be very expensive when data sets are large, since it requires starting at the beginning of data set 2 for every incoming record of data set 1 (although it eliminates the need to presort data set T1).  If avoiding sorts really was crucial, I'd go with a hash object technique, using data set T2 as the hash object.  I won't describe that approach here.

 

Instead, assuming proc sort's are affordable, one could tweak my original suggestion (data wantold;) as per the below (see the data wantnew step).  I've already presorted T1 and T2 for this example.

 

data t1;
input x @@;
cards;
25 31 53 125 129 180
;

data t2;
input y @@;
cards;
31 62 93 124 155
;


data wantold;
  set t1 (keep=x rename=(x=y)  in=in1)
      t2 (in=in2);
  by y;
  if in1 then set t1 nobs=nobs1;
  if in2;
run;


data wantnew (drop=n2);
  set t1 (keep=x rename=(x=y)  in=in1)
      t2 (in=in2)
      nobs=nobs_both;
  by y;
  if in1 then set t1 nobs=nobs1;
  else n2+1;
  if in2 or n2 >= nobs_both-nobs1;
  if in1 then call missing(y);
run;

 

The main difference above is to track how many T2 records have been read (variable N2) and compare that to the total number of T2 records available (NOBS_BOTH-NOBS1).  The data want_new step includes all records following the exhaustion T2.

 

My main point in these suggestions is for sas programmers to realize that the PDV is their friend.  It's very useful to be aware of how and when (and what part of) the PDV is being modified throughout the data step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FreelanceReinh
Jade | Level 19

Thanks @mkeintz for your detailed reply. Sorry for being unspecific and putting two answers into the same post. I wasn't referring to the special case ("180") which I introduced, but to the OP's own example:

@Anirudh9 wrote:


I want values on table 1 to be matched with the closest 'next' higher value from table 2. 

 

Desired table

 

input output

31      31

53      62

25      31

125    155

129    155 


So, my understanding is that, e.g., value1=125 from (the OP's) "table 1" is to be matched with value2=155 from "table 2" -- the next higher value.

 

When I run your code (original version), I obtain this dataset WANT:

value2    value1

   31        31
   62        53
   93        53
  124        53
  155       129

This differs significantly (i.e. not only in terms of row or column sort order) from the OP's "desired table". In particular, value1=125 doesn't occur at all. (Same with version WANTOLD. In WANTNEW only the correct assignment 180 --> . is added.)

 

Did I overlook or misunderstand something? (I suppose so because the above discrepancy is too obvious.)

 

I fully agree that the DO-loop approach I suggested would not be advisable for large tables. But I thought for datasets with "[m]aybe 30-40 rows max." (Anirudh9) it should be acceptable in terms of run time.

 


@mkeintz wrote:

My main point in these suggestions is for sas programmers to realize that the PDV is their friend.  It's very useful to be aware of how and when (and what part of) the PDV is being modified throughout the data step.


Very true! Much appreciated.

mkeintz
PROC Star

Well, it took two commens from @FreelanceReinh but I now see what he was pointing out.  I was ignoring instance of T1 records that were not immediately followed by a T2, yielding an interesting solution to a problem not posed by the OP.

 

Here's a correct version.  Like the others it also depends on asynchronous updating of the program data vector., reading from two sorted data sets:

 

data t1;
input x @@;
cards;
25 31 53 125 129 180
;

data t2;
input y @@;
cards;
31 62 93 124 155
;


data want_correct;
  set t1;
  do while (y<x  and end_of_t2=0);
    set t2 end=end_of_t2;
  end;
  if y<x then call missing(y);
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Anirudh9
Obsidian | Level 7

 

 

 

 

I wrote the same code on Python. Maybe you could help me translate the same to SAS.

 

The match rate is what my target value is. actual_rate is table 1, matrix_rate is table 2.

 

If the value In actual_rate is higher than max of matrix_rate, I want the highest value of matrix_rate to be used. I think this code will help understand my problem a lot better.

 

@mkeintz @FreelanceReinh @Ksharp @ChrisNZ 

 

 

match_rate = []
actual_rate = [31,53,25,61,44,165,311,233]  #These are samples, actual data is bigger, 30-40 rows
matrix_rate = [31,62,93,124,155,186,211,252,288]


for i in range(np.size(actual_rate)):
 
 if actual_rate[i] > max(matrix_rate):
  match_rate[i] = max(matrix_rate)


 else:
  temp1 = []
  for j in range(np.size(matrix_rate)):
   p = matrix_rate[j] - actual_rate[i]
   if (p >= 0):
    temp1.append(p)


  match_rate[i] = min(temp1) + actual_rate[I]

Reeza
Super User

@Anirudh9 there are several working solutions posted. None work for you?

 


@Anirudh9 wrote:

 

 

 

 

I wrote the same code on Python. Maybe you could help me translate the same to SAS.

 

The match rate is what my target value is. actual_rate is table 1, matrix_rate is table 2.

 

If the value In actual_rate is higher than max of matrix_rate, I want the highest value of matrix_rate to be used. I think this code will help understand my problem a lot better.

 

@mkeintz @FreelanceReinh @Ksharp @ChrisNZ 

 

 

match_rate = []
actual_rate = [31,53,25,61,44,165,311,233]  #These are samples, actual data is bigger, 30-40 rows
matrix_rate = [31,62,93,124,155,186,211,252,288]


for i in range(np.size(actual_rate)):
 
 if actual_rate[i] > max(matrix_rate):
  match_rate[i] = max(matrix_rate)


 else:
  temp1 = []
  for j in range(np.size(matrix_rate)):
   p = matrix_rate[j] - actual_rate[i]
   if (p >= 0):
    temp1.append(p)


  match_rate[i] = min(temp1) + actual_rate[I]


 

ChrisNZ
Tourmaline | Level 20

This is a totally different question as the data is now horizontal.

I don't know python by this is probably pretty close.


data _null_;

  array MATCH_RATE  [8] ;
  array ACTUAL_RATE [8] (31 53 25  61  44 165 311 233    ) ;  
  array MATRIX_RATE [9] (31 62 93 124 155 186 211 252 288) ;

  do I= 1 to dim(ACTUAL_RATE);
   
    if ACTUAL_RATE[I] > max(of MATRIX_RATE[*]) then 
      MATCH_RATE[I] = max(of MATRIX_RATE[*]); 
     
    else do;
      TEMP1=.;
      do J= 1 to dim(MATRIX_RATE);
         P = MATRIX_RATE[J] - ACTUAL_RATE[I];
         if P >= 0 then  TEMP1=min(TEMP1,P);
      end;
  
      MATCH_RATE[I] = TEMP1 + ACTUAL_RATE[I]; 
      putlog I=   MATCH_RATE[I]=;
   end;
 end;
run;

I=1 MATCH_RATE1=31
I=2 MATCH_RATE2=62
I=3 MATCH_RATE3=31
I=4 MATCH_RATE4=62
I=5 MATCH_RATE5=62
I=6 MATCH_RATE6=186
I=8 MATCH_RATE8=252

 

Ksharp
Super User
data t1;
input x @@;
cards;
31 53 25 125 129 180
;

data t2;
input y @@;
cards;
31 62 93 124 155
;

proc sql;
create table want as
 select distinct *
  from t1,t2
   where x<=y
    group by x
	 having y=min(y);
quit;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 7750 views
  • 13 likes
  • 8 in conversation