Do the ranges in table1 overlap or not, and do the ranges cover all possible values in table2?
No overlaps means that you can't have multiple observations out of one source observation.
Consider creating a value format with ranges out of table1, and applying that to table2 in a sequential data step.
Depending on the real range of your numbers and the size of the descriptions, using a temporary array or a hash might also be possible.
We will need more details about these particulars to make suggestions with code.
Example of creating a format out of your table:
data myfmt;
set table1 (
end=done
rename=(
bottom_range = start
top_range = end
description = label
)
);
;
fmtname = "myrange";
type = "N";
output;
if done
then do;
start = "other";
hlo = "O";
label = "";
output;
end;
run;
proc format cntlin=myfmt;
run;
data want;
set table2;
description = put(number,myrange.);
run;
Untested; for tested code, provide usable example data in data steps with datalines.
Given your on condition not using any additional key variable for the join, are you eventually after something like below?
Select [columns]
From Table2
left join
(
select min(bottom_range) as min_val, max(top_range) as top_val
from table1
) t1
on Table2.Number between t1.min_val and t1.top_val
Your current join condition leads likely to a many:many join which could explain the long runtime.
If you need to look at the individual ranges then some sequential logic using a SAS datastep is likely more efficient.
How many rows do you have in table1?
Do you need a row in the result set for every single matching ranges (=rows from table 2 duplicated) or just select the row if any of the ranges matches?
You can do it with array probing, it requires some "pre-requisites" but executes with no sorting, and with only one data reading for bigger table.
Test data:
data Table1;
input bottom_range top_range description $ ;
cards;
1 4 A
5 9 B
15 19 D
20 24 E
30 34 F
;
run;
data Table2;
call streaminit(1234);
do _N_ = 1 to 100;
do c1 = "X","Y","Z";
c2 = catx("_", "ABC", _N_);
number = rand('integer', 0, 30);
output;
end;
end;
run;
Pre-requisites:
/* sort table with ranges (they do not overlap that's IMPORTANT) */
proc sort data = Table1;
by bottom_range top_range;
run;
/* create temporary table expanding ranges and collect metadata */
data table1A;
set table1 end=end;
do Number = bottom_range to top_range;
output;
end;
if _N_ = 1 then call symputX("nobsB", bottom_range, "G");
if end then call symputX("nobsT", top_range, "G");
run;
Execution:
data want;
/* create and populate an arrays for data */
/* if you need only description delete T and B */
do until (eof1);
set Table1A end = eof1;
array t [&nobsB.:&nobsT.] _temporary_;
array b [&nobsB.:&nobsT.] _temporary_;
array d [&nobsB.:&nobsT.] $ _temporary_;
b [number] = bottom_range;
t [number] = top_range;
d [number] = description;
end;
do until (eof2);
set Table2 end = eof2;
/* do the array probbing */
/* if "hit" then point to exact data cell */
if &nobsB. <= number <= &nobsT. then
do;
bottom_range = b [number];
top_range = t [number];
description = d [number];
end;
else call missing(bottom_range, top_range, description);
output;
end;
stop;
run;
Bart
I case the ranges would overlap (I know I'm "driving away" form the main topic) similar approach with array probing could be used. it requires some modifications to "pre-requisites" and 2 additional arrays, but does the job without any modification of the "big" data set.
Data:
data Table1;
input bottom_range top_range description $ ;
cards;
1 4 A
5 9 B
15 23 D
20 24 E
30 34 F
30 40 G
;
run;
data Table2;
call streaminit(1234);
do _N_ = 1 to 10;
do c1 = "X","Y","Z";
c2 = catx("_", "ABC", _N_);
number = rand('integer', 0, 42);
output;
end;
end;
run;
Pre-requisites:
/* create temporary table expanding ranges */
data table1A;
set table1 end=end;
do Number = bottom_range to top_range;
output;
end;
run;
/* sort table with ranges */
proc sort data = Table1A;
by Number bottom_range top_range;
run;
/* ... and collect metadata */
data _null_;
do point = 1,nobs;
set Table1A nobs=nobs point=point;
if point = 1 then call symputX("nobsB", bottom_range, "G");
if point = nobs then call symputX("nobsT", top_range, "G");
end;
call symputX("nobs", nobs, "G");
stop;
run;
Execution:
data want;
/* create and populate an arrays for data */
/* if you need only "description" variable delete T and B */
do until (eof1);
set Table1A end = eof1 curobs=curobs;
by number;
/* tables for ranges of repetitions, see Table1A */
array n [&nobsB.:&nobsT.] _temporary_;
array o [&nobsB.:&nobsT.] _temporary_;
if first.number then n[number] = curobs;
if last.number then o[number] = curobs;
/* populate data */
array t [&nobs.] _temporary_;
array b [&nobs.] _temporary_;
array d [&nobs.] $ _temporary_;
b [curobs] = bottom_range;
t [curobs] = top_range;
d [curobs] = description;
end;
do until (eof2);
set Table2 end = eof2;
/* do the array probbing */
/* if "hit" AND not missing (not existing range) then ... */
if &nobsB. <= number <= &nobsT. and o[number] then
/* ... point to exact data cells */
do _N_ = n[number] to o[number];
bottom_range = b [_N_];
top_range = t [_N_];
description = d [_N_];
output;
end;
else
do;
/* otherwise keep missing */
call missing(bottom_range, top_range, description);
output;
end;
end;
stop;
run;
For 3 millions obs and ranges between 0 and 40, log was:
NOTE: There were 39 observations read from the data set WORK.TABLE1A. NOTE: There were 3000000 observations read from the data set WORK.TABLE2. NOTE: The data set WORK.WANT has 3627859 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.81 seconds user cpu time 0.34 seconds system cpu time 0.46 seconds memory 896.06k OS Memory 19448.00k
All the best
Bart
Hello @Micha
As per your post one of your datasets is over 20 million rows.
In case this is a table in a RDBMS database, then consider executing the query in the database itself.
You can pass the smaller table to the database or rather have it as a database table updating the data as needed.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.