Re: Joining based on 2 ranges

Micha · Posted 04-10-2022 01:41 AM

Hi. I have 2 datasets, one is quite large with more than 20mil rows. I am trying to join the two datasets using proc sql. This is my query, but it takes a really long time.

"
Select [columns]

From Table2

left join Table1

on Table2.Number between Table1.bottom_range and Table1.top_range"

Is there a more efficient solution to doing this?

I have checked and the format of the fields in the join are all numeric.

Table1: has 3 columns , a bottom_range, top_range and a description. For example, the bottom_range might have values 10000, 20000 etc. The top_range might have values 19999, 29999 etc.

Table 2 has multiple columns, one of which, as an example, is called Number, with values such as 17999, 21887 etc. ) this table has 20 million+ observations.

Kurt_Bremser · Posted 04-10-2022 01:46 AM

Do the ranges in table1 overlap or not, and do the ranges cover all possible values in table2?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Micha · Posted 04-10-2022 02:17 AM

1. There is no overlap - does it make a different if it overlaps or not?
2. No, the ranges don't cover all possible values in table2.

Kurt_Bremser · Posted 04-10-2022 02:51 AM

No overlaps means that you can't have multiple observations out of one source observation.

Consider creating a value format with ranges out of table1, and applying that to table2 in a sequential data step.

Depending on the real range of your numbers and the size of the descriptions, using a temporary array or a hash might also be possible.

We will need more details about these particulars to make suggestions with code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 04-10-2022 05:10 AM

Example of creating a format out of your table:

data myfmt;
set table1 (
  end=done
  rename=(
    bottom_range = start
    top_range = end
    description = label
  )
);
;
fmtname = "myrange";
type = "N";
output;
if done
then do;
  start = "other";
  hlo = "O";
  label = "";
  output;
end;
run;

proc format cntlin=myfmt;
run;

data want;
set table2;
description = put(number,myrange.);
run;

Untested; for tested code, provide usable example data in data steps with datalines.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 04-10-2022 02:43 AM

Given your on condition not using any additional key variable for the join, are you eventually after something like below?

Select [columns]
From Table2
left join 
(
select min(bottom_range) as min_val, max(top_range) as top_val
from table1
) t1
on Table2.Number between t1.min_val and t1.top_val

Your current join condition leads likely to a many:many join which could explain the long runtime.

If you need to look at the individual ranges then some sequential logic using a SAS datastep is likely more efficient.

How many rows do you have in table1?

Do you need a row in the result set for every single matching ranges (=rows from table 2 duplicated) or just select the row if any of the ranges matches?

yabwon · Posted 04-10-2022 03:16 AM

You can do it with array probing, it requires some "pre-requisites" but executes with no sorting, and with only one data reading for bigger table.

Test data:

data Table1;
input bottom_range top_range description $ ;
cards;
1 4 A
5 9 B
15 19 D
20 24 E
30 34 F
;
run;

data Table2;
  call streaminit(1234);
  do _N_ = 1 to 100;
    do c1 = "X","Y","Z";
      c2 = catx("_", "ABC", _N_);
      number = rand('integer', 0, 30);
      output;
    end;
  end;
run;

Pre-requisites:

/* sort table with ranges (they do not overlap that's IMPORTANT) */
proc sort data = Table1;
  by bottom_range top_range;
run;

/* create temporary table expanding ranges and collect metadata */
data table1A;
  set table1 end=end;
  do Number = bottom_range to top_range;
    output;
  end;
  if _N_ = 1 then call symputX("nobsB", bottom_range, "G");
  if end then call symputX("nobsT", top_range, "G");
run;

Execution:

data want;
  /* create and populate an arrays for data */
  /* if you need only description delete T and B */
  do until (eof1);
    set Table1A end = eof1;
    array t [&nobsB.:&nobsT.]  _temporary_;
    array b [&nobsB.:&nobsT.]  _temporary_;
    array d [&nobsB.:&nobsT.] $ _temporary_;
    b [number] = bottom_range;
    t [number] = top_range;
    d [number] = description;
  end;

  do until (eof2);
    set Table2 end = eof2;

    /* do the array probbing */
    /* if "hit" then point to exact data cell */
    if &nobsB. <= number <= &nobsT. then
      do;
        bottom_range = b [number];
        top_range    = t [number];
        description  = d [number];
      end;
    else call missing(bottom_range, top_range, description);

    output;
  end;

stop;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

yabwon · Posted 04-10-2022 04:25 AM

I case the ranges would overlap (I know I'm "driving away" form the main topic) similar approach with array probing could be used. it requires some modifications to "pre-requisites" and 2 additional arrays, but does the job without any modification of the "big" data set.

Data:

data Table1;
input bottom_range top_range description $ ;
cards;
1 4 A
5 9 B
15 23 D
20 24 E
30 34 F
30 40 G
;
run;

data Table2;
  call streaminit(1234);
  do _N_ = 1 to 10;
    do c1 = "X","Y","Z";
      c2 = catx("_", "ABC", _N_);
      number = rand('integer', 0, 42);
      output;
    end;
  end;
run;

Pre-requisites:

/* create temporary table expanding ranges */
data table1A;
  set table1 end=end;
  do Number = bottom_range to top_range;
    output;
  end;
run;

/* sort table with ranges */
proc sort data = Table1A;
  by Number bottom_range top_range;
run;

/* ... and collect metadata */
data _null_;
  do point = 1,nobs;
    set Table1A nobs=nobs point=point;

    if point = 1 then call symputX("nobsB", bottom_range, "G");
    if point = nobs then call symputX("nobsT", top_range, "G");
  end;

  call symputX("nobs", nobs, "G");
  stop;
run;

Execution:

data want;
  /* create and populate an arrays for data */
  /* if you need only "description" variable delete T and B */
  do until (eof1);
    set Table1A end = eof1 curobs=curobs;
    by number;

    /* tables for ranges of repetitions, see Table1A */
    array n [&nobsB.:&nobsT.]  _temporary_;
    array o [&nobsB.:&nobsT.]  _temporary_;
    if first.number then n[number] = curobs;
    if last.number then o[number] = curobs;

    /* populate data */
    array t [&nobs.]  _temporary_;
    array b [&nobs.]  _temporary_;
    array d [&nobs.] $ _temporary_;
    b [curobs] = bottom_range;
    t [curobs] = top_range;
    d [curobs] = description;
  end;

  do until (eof2);
    set Table2 end = eof2;

    /* do the array probbing */
    /* if "hit" AND not missing (not existing range) then ... */ 
    if &nobsB. <= number <= &nobsT. and o[number] then
      /* ... point to exact data cells */
      do _N_ = n[number] to o[number];
        bottom_range = b [_N_];
        top_range    = t [_N_];
        description  = d [_N_];
        output;
      end;
    else 
      do;
        /* otherwise keep missing */
        call missing(bottom_range, top_range, description);
        output;
      end;
  end;

stop;
run;

For 3 millions obs and ranges between 0 and 40, log was:

NOTE: There were 39 observations read from the data set WORK.TABLE1A.
NOTE: There were 3000000 observations read from the data set WORK.TABLE2.
NOTE: The data set WORK.WANT has 3627859 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.81 seconds
      user cpu time       0.34 seconds
      system cpu time     0.46 seconds
      memory              896.06k
      OS Memory           19448.00k

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Sajid01 · Posted 04-10-2022 09:11 AM

Hello @Micha
As per your post one of your datasets is over 20 million rows.
In case this is a table in a RDBMS database, then consider executing the query in the database itself.
You can pass the smaller table to the database or rather have it as a database table updating the data as needed.

Register Today!

SAS Training: Just a Click Away