I have two tables
table 1
31
53
25
125
129
table 2
31
62
93
124
155
I want values on table 1 to be matched with the closest 'next' higher value from table 2.
Desired table
input output
31 31
53 62
25 31
125 155
129 155
Hey guys! I used your ideas and came up with this simple answer. Thank you so much for all the suggestions. This community is the best!!
proc sql;
create table test as
select x,
(select max(y) from t2) as y
from t1
where x > (select max(y) from t2)
union
select x,
(select min(y) from t2 where y>=x) as y
from t1
where t1;
quit;
How big are these tables? For the size you have there, I'd be doing a temporary array but that loads it into memory so if it's a larger data set it may not work. Or if you need GROUP BY processing that approach won't work either.
@Anirudh9 wrote:
I have two tables
table 1
31
53
25
125
129
table 2
31
62
93
124
155
I want values on table 1 to be matched with the closest 'next' higher value from table 2.
Desired table
input output
31 31
53 62
25 31
125 155
129 155
Maybe 30-40 rows max.
Here's one way then:
data t1;
input value1;
cards;
31
53
25
125
129
;
data t2;
input value2;
cards;
31
62
93
124
155
;
run;
data find_nearest;
*load data into temporary array;
array _nearest(5) _temporary_;
if _n_=1 then do j=1 to dim(_nearest);
set T2;*dataset with values to load into array;
_nearest(j) = value2;
end;
call sortn(of _nearest(*));
set T1;
do i=1 to dim(_nearest);
if _nearest(i) > value1 then leave;
end;
nearest = _nearest(i);
Difference = _nearest(i) - value1;
keep value1 nearest difference;
run;
If you original data sets were both sorted, you could skip the proc sorts here:
data t1;
input value1;
cards;
31
53
25
125
129
;
data t2;
input value2;
cards;
31
62
93
124
155
;
run;
proc sort data=t1; by value1;run;
proc sort data=t2; by value2;run;
data want;
set t1 (keep=value1 rename=(value1=value2) in=in1)
t2 (in=in2);
by value2;
if in1 then set t1;
if in2;
run;
The SAS behavior taken advantage of here is how and when the PDV (program data vector) gets assigned new values. In the case of variables from t1 or t2, that only happens when there is a corresponding SET statement. So VALUE1 (and all other T1 variables) is only updated when the "if in1 then set t1;" statement is executed. That means that VALUE1 is NOT changed when a T2 record is read in. The first SET statement [set t1 ..... t2 (in=in2)] doesn't encounter value1 since it is renamed (for order purposes) to value2.
This also means that if you have a T1 followed by multiple T2's, all those T2's would get the same (most recent) value1. And if you have several consecutive T1 records, only the final one would propagate to subsequent T2's.
As per comments elsewhere on this topic, the above does not accommodate keeping any T1 records for which value1 is higher than any available value2 in T2. Here is a modification to include such records:
data wantnew (drop=n2);
set t1 (keep=value1 rename=(value1=value2) in=in1)
t2 (in=in2)
nobs=nobs_both;
by value2;
if in1 then set t1 nobs=nobs1;
else n2+1;
if in2 or n2 >= nobs_both-nobs1;
if in1 then call missing(value2);
run;
data one;
input var1 ;
cards;
31
53
25
125
129
;
data two;
input var2;
cards;
31
62
93
124
155
;
proc sql;
create table want(drop=a) as
select *, ifn(var2-var1<0,.,var2-var1) as a
from one, two
group by var1
having a=min(a);
quit;
@mkeintz: A creative solution indeed, but does it solve the original question?
@Anirudh9: Here are two other suggestions. Please note that I have inserted value 180 into the first dataset, just to demonstrate what happens if there is no higher (or equal) value contained in the second dataset.
data t1;
input x @@;
cards;
31 53 25 125 129 180
;
data t2;
input y @@;
cards;
31 62 93 124 155
;
/* Another PROC SQL solution */
proc sql;
create table want as
select x, (select min(y) from t2 where y>=x) as y
from t1;
quit;
/* Another DATA step solution */
proc sort data=t2;
by y;
run;
data want;
set t1;
do p=1 to n until(y>=x);
set t2 point=p nobs=n;
end;
if y<x then y=.;
run;
@FreelanceReinh wrote:
@mkeintz: A creative solution indeed, but does it solve the original question?
The OP didn't specify what was wanted when there are value1's (variable X here) for which there are no next high value2's (variable Y). So yes, while my solution worked for the sample data presented, it simply ignored case for X greater than the max Y value. That may not be would the OP would want.
However, I don't think this mean one has to resort to the "do p=1 to n until(y>=x);" recommendation, which can be very expensive when data sets are large, since it requires starting at the beginning of data set 2 for every incoming record of data set 1 (although it eliminates the need to presort data set T1). If avoiding sorts really was crucial, I'd go with a hash object technique, using data set T2 as the hash object. I won't describe that approach here.
Instead, assuming proc sort's are affordable, one could tweak my original suggestion (data wantold;) as per the below (see the data wantnew step). I've already presorted T1 and T2 for this example.
data t1;
input x @@;
cards;
25 31 53 125 129 180
;
data t2;
input y @@;
cards;
31 62 93 124 155
;
data wantold;
set t1 (keep=x rename=(x=y) in=in1)
t2 (in=in2);
by y;
if in1 then set t1 nobs=nobs1;
if in2;
run;
data wantnew (drop=n2);
set t1 (keep=x rename=(x=y) in=in1)
t2 (in=in2)
nobs=nobs_both;
by y;
if in1 then set t1 nobs=nobs1;
else n2+1;
if in2 or n2 >= nobs_both-nobs1;
if in1 then call missing(y);
run;
The main difference above is to track how many T2 records have been read (variable N2) and compare that to the total number of T2 records available (NOBS_BOTH-NOBS1). The data want_new step includes all records following the exhaustion T2.
My main point in these suggestions is for sas programmers to realize that the PDV is their friend. It's very useful to be aware of how and when (and what part of) the PDV is being modified throughout the data step.
Thanks @mkeintz for your detailed reply. Sorry for being unspecific and putting two answers into the same post. I wasn't referring to the special case ("180") which I introduced, but to the OP's own example:
@Anirudh9 wrote:
I want values on table 1 to be matched with the closest 'next' higher value from table 2.
Desired table
input output
31 31
53 62
25 31
125 155
129 155
So, my understanding is that, e.g., value1=125 from (the OP's) "table 1" is to be matched with value2=155 from "table 2" -- the next higher value.
When I run your code (original version), I obtain this dataset WANT:
value2 value1 31 31 62 53 93 53 124 53 155 129
This differs significantly (i.e. not only in terms of row or column sort order) from the OP's "desired table". In particular, value1=125 doesn't occur at all. (Same with version WANTOLD. In WANTNEW only the correct assignment 180 --> . is added.)
Did I overlook or misunderstand something? (I suppose so because the above discrepancy is too obvious.)
I fully agree that the DO-loop approach I suggested would not be advisable for large tables. But I thought for datasets with "[m]aybe 30-40 rows max." (Anirudh9) it should be acceptable in terms of run time.
@mkeintz wrote:
My main point in these suggestions is for sas programmers to realize that the PDV is their friend. It's very useful to be aware of how and when (and what part of) the PDV is being modified throughout the data step.
Very true! Much appreciated.
Well, it took two commens from @FreelanceReinh but I now see what he was pointing out. I was ignoring instance of T1 records that were not immediately followed by a T2, yielding an interesting solution to a problem not posed by the OP.
Here's a correct version. Like the others it also depends on asynchronous updating of the program data vector., reading from two sorted data sets:
data t1;
input x @@;
cards;
25 31 53 125 129 180
;
data t2;
input y @@;
cards;
31 62 93 124 155
;
data want_correct;
set t1;
do while (y<x and end_of_t2=0);
set t2 end=end_of_t2;
end;
if y<x then call missing(y);
run;
I wrote the same code on Python. Maybe you could help me translate the same to SAS.
The match rate is what my target value is. actual_rate is table 1, matrix_rate is table 2.
If the value In actual_rate is higher than max of matrix_rate, I want the highest value of matrix_rate to be used. I think this code will help understand my problem a lot better.
@mkeintz @FreelanceReinh @Ksharp @ChrisNZ
match_rate = []
actual_rate = [31,53,25,61,44,165,311,233] #These are samples, actual data is bigger, 30-40 rows
matrix_rate = [31,62,93,124,155,186,211,252,288]
for i in range(np.size(actual_rate)):
if actual_rate[i] > max(matrix_rate):
match_rate[i] = max(matrix_rate)
else:
temp1 = []
for j in range(np.size(matrix_rate)):
p = matrix_rate[j] - actual_rate[i]
if (p >= 0):
temp1.append(p)
match_rate[i] = min(temp1) + actual_rate[I]
@Anirudh9 there are several working solutions posted. None work for you?
@Anirudh9 wrote:
I wrote the same code on Python. Maybe you could help me translate the same to SAS.
The match rate is what my target value is. actual_rate is table 1, matrix_rate is table 2.
If the value In actual_rate is higher than max of matrix_rate, I want the highest value of matrix_rate to be used. I think this code will help understand my problem a lot better.
@mkeintz @FreelanceReinh @Ksharp @ChrisNZ
match_rate = []
actual_rate = [31,53,25,61,44,165,311,233] #These are samples, actual data is bigger, 30-40 rows
matrix_rate = [31,62,93,124,155,186,211,252,288]
for i in range(np.size(actual_rate)):
if actual_rate[i] > max(matrix_rate):
match_rate[i] = max(matrix_rate)
else:
temp1 = []
for j in range(np.size(matrix_rate)):
p = matrix_rate[j] - actual_rate[i]
if (p >= 0):
temp1.append(p)
match_rate[i] = min(temp1) + actual_rate[I]
This is a totally different question as the data is now horizontal.
I don't know python by this is probably pretty close.
data _null_;
array MATCH_RATE [8] ;
array ACTUAL_RATE [8] (31 53 25 61 44 165 311 233 ) ;
array MATRIX_RATE [9] (31 62 93 124 155 186 211 252 288) ;
do I= 1 to dim(ACTUAL_RATE);
if ACTUAL_RATE[I] > max(of MATRIX_RATE[*]) then
MATCH_RATE[I] = max(of MATRIX_RATE[*]);
else do;
TEMP1=.;
do J= 1 to dim(MATRIX_RATE);
P = MATRIX_RATE[J] - ACTUAL_RATE[I];
if P >= 0 then TEMP1=min(TEMP1,P);
end;
MATCH_RATE[I] = TEMP1 + ACTUAL_RATE[I];
putlog I= MATCH_RATE[I]=;
end;
end;
run;
I=1 MATCH_RATE1=31
I=2 MATCH_RATE2=62
I=3 MATCH_RATE3=31
I=4 MATCH_RATE4=62
I=5 MATCH_RATE5=62
I=6 MATCH_RATE6=186
I=8 MATCH_RATE8=252
data t1;
input x @@;
cards;
31 53 25 125 129 180
;
data t2;
input y @@;
cards;
31 62 93 124 155
;
proc sql;
create table want as
select distinct *
from t1,t2
where x<=y
group by x
having y=min(y);
quit;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.