BookmarkSubscribeRSS Feed
Micha
Calcite | Level 5
Hi. I have 2 datasets, one is quite large with more than 20mil rows. I am trying to join the two datasets using proc sql. This is my query, but it takes a really long time.

"
Select [columns]

From Table2

left join Table1

on Table2.Number between Table1.bottom_range and Table1.top_range"

Is there a more efficient solution to doing this?

I have checked and the format of the fields in the join are all numeric.

Table1: has 3 columns , a bottom_range, top_range and a description. For example, the bottom_range might have values 10000, 20000 etc. The top_range might have values 19999, 29999 etc.

Table 2 has multiple columns, one of which, as an example, is called Number, with values such as 17999, 21887 etc. ) this table has 20 million+ observations.
8 REPLIES 8
Micha
Calcite | Level 5
1. There is no overlap - does it make a different if it overlaps or not?
2. No, the ranges don't cover all possible values in table2.
Kurt_Bremser
Super User

No overlaps means that you can't have multiple observations out of one source observation.

Consider creating a value format with ranges out of table1, and applying that to table2 in a sequential data step.

Depending on the real range of your numbers and the size of the descriptions, using a temporary array or a hash might also be possible.

We will need more details about these particulars to make suggestions with code.

Kurt_Bremser
Super User

Example of  creating a format out of your table:

data myfmt;
set table1 (
  end=done
  rename=(
    bottom_range = start
    top_range = end
    description = label
  )
);
;
fmtname = "myrange";
type = "N";
output;
if done
then do;
  start = "other";
  hlo = "O";
  label = "";
  output;
end;
run;

proc format cntlin=myfmt;
run;

data want;
set table2;
description = put(number,myrange.);
run;

Untested; for tested code, provide usable example data in data steps with datalines.

Patrick
Opal | Level 21

Given your on condition not using any additional key variable for the join, are you eventually after something like below?

Select [columns]
From Table2
left join 
(
select min(bottom_range) as min_val, max(top_range) as top_val
from table1
) t1
on Table2.Number between t1.min_val and t1.top_val

Your current join condition leads likely to a many:many join which could explain the long runtime.

 

If you need to look at the individual ranges then some sequential logic using a SAS datastep is likely more efficient.

How many rows do you have in table1?

Do you need a row in the result set for every single matching ranges (=rows from table 2 duplicated) or just select the row if any of the ranges matches?

yabwon
Onyx | Level 15

You can do it with array probing, it requires some "pre-requisites" but executes with no sorting, and with only one data reading for bigger table. 

Test data:

data Table1;
input bottom_range top_range description $ ;
cards;
1 4 A
5 9 B
15 19 D
20 24 E
30 34 F
;
run;

data Table2;
  call streaminit(1234);
  do _N_ = 1 to 100;
    do c1 = "X","Y","Z";
      c2 = catx("_", "ABC", _N_);
      number = rand('integer', 0, 30);
      output;
    end;
  end;
run;

Pre-requisites:

/* sort table with ranges (they do not overlap that's IMPORTANT) */
proc sort data = Table1;
  by bottom_range top_range;
run;

/* create temporary table expanding ranges and collect metadata */
data table1A;
  set table1 end=end;
  do Number = bottom_range to top_range;
    output;
  end;
  if _N_ = 1 then call symputX("nobsB", bottom_range, "G");
  if end then call symputX("nobsT", top_range, "G");
run;

Execution:

data want;
  /* create and populate an arrays for data */
  /* if you need only description delete T and B */
  do until (eof1);
    set Table1A end = eof1;
    array t [&nobsB.:&nobsT.]  _temporary_;
    array b [&nobsB.:&nobsT.]  _temporary_;
    array d [&nobsB.:&nobsT.] $ _temporary_;
    b [number] = bottom_range;
    t [number] = top_range;
    d [number] = description;
  end;

  do until (eof2);
    set Table2 end = eof2;

    /* do the array probbing */
    /* if "hit" then point to exact data cell */
    if &nobsB. <= number <= &nobsT. then
      do;
        bottom_range = b [number];
        top_range    = t [number];
        description  = d [number];
      end;
    else call missing(bottom_range, top_range, description);

    output;
  end;

stop;
run;

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



yabwon
Onyx | Level 15

I case the ranges would overlap (I know I'm "driving away" form the main topic) similar approach with array probing could be used. it requires some modifications to "pre-requisites" and 2 additional arrays, but does the job without any modification of the "big" data set.

 

Data:

 

data Table1;
input bottom_range top_range description $ ;
cards;
1 4 A
5 9 B
15 23 D
20 24 E
30 34 F
30 40 G
;
run;

data Table2;
  call streaminit(1234);
  do _N_ = 1 to 10;
    do c1 = "X","Y","Z";
      c2 = catx("_", "ABC", _N_);
      number = rand('integer', 0, 42);
      output;
    end;
  end;
run;

Pre-requisites:

 

 

/* create temporary table expanding ranges */
data table1A;
  set table1 end=end;
  do Number = bottom_range to top_range;
    output;
  end;
run;

/* sort table with ranges */
proc sort data = Table1A;
  by Number bottom_range top_range;
run;

/* ... and collect metadata */
data _null_;
  do point = 1,nobs;
    set Table1A nobs=nobs point=point;

    if point = 1 then call symputX("nobsB", bottom_range, "G");
    if point = nobs then call symputX("nobsT", top_range, "G");
  end;

  call symputX("nobs", nobs, "G");
  stop;
run;

Execution:

 

 

data want;
  /* create and populate an arrays for data */
  /* if you need only "description" variable delete T and B */
  do until (eof1);
    set Table1A end = eof1 curobs=curobs;
    by number;

    /* tables for ranges of repetitions, see Table1A */
    array n [&nobsB.:&nobsT.]  _temporary_;
    array o [&nobsB.:&nobsT.]  _temporary_;
    if first.number then n[number] = curobs;
    if last.number then o[number] = curobs;

    /* populate data */
    array t [&nobs.]  _temporary_;
    array b [&nobs.]  _temporary_;
    array d [&nobs.] $ _temporary_;
    b [curobs] = bottom_range;
    t [curobs] = top_range;
    d [curobs] = description;
  end;

  do until (eof2);
    set Table2 end = eof2;

    /* do the array probbing */
    /* if "hit" AND not missing (not existing range) then ... */ 
    if &nobsB. <= number <= &nobsT. and o[number] then
      /* ... point to exact data cells */
      do _N_ = n[number] to o[number];
        bottom_range = b [_N_];
        top_range    = t [_N_];
        description  = d [_N_];
        output;
      end;
    else 
      do;
        /* otherwise keep missing */
        call missing(bottom_range, top_range, description);
        output;
      end;
  end;

stop;
run;

 

 

For 3 millions obs and ranges between 0 and 40, log was:

NOTE: There were 39 observations read from the data set WORK.TABLE1A.
NOTE: There were 3000000 observations read from the data set WORK.TABLE2.
NOTE: The data set WORK.WANT has 3627859 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.81 seconds
      user cpu time       0.34 seconds
      system cpu time     0.46 seconds
      memory              896.06k
      OS Memory           19448.00k

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Sajid01
Meteorite | Level 14

Hello @Micha 
As per your post one of your datasets is over 20 million rows.
In case this is a table in a RDBMS database, then consider executing the query in the database itself.
You can pass the smaller table to the database or rather have it as a database table updating the data as needed.

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1299 views
  • 0 likes
  • 5 in conversation