Efficient one to many join to a huge dataset

ChuckC · Posted 05-01-2019 02:47 PM

One to many match of a very small input file (&sfile._&dbase ) to a very large file (sdCumV.VIN_VEH_OPTNS_&dbase)

In one particular example small input has 2 unique records, and very large file as 419M (where each input record will typically find ~200 matches).

This particular example run for 44 minutes. Obviously, we can do better.

What I'm not sure of is how best to utilize SAS for such a task. Any ideas would be welcome.

* ------------------------ ;
* one to MANY matches ;
* ------------------------ ;

* match to VIN_VEH_OPTNS file to get OPTN_CD ;
proc sql ;
create table squish_optns_&dbase as
select * from
&sfile._&dbase as G left join sdCumV.VIN_VEH_OPTNS_&dbase as S on G.VIN = S.GMC_VEH_IDENT_NBR ;
quit ;

Reeza · Posted 05-01-2019 02:52 PM

Instead of join how about filters?

select * from bigTable where ID in (select distinct ID from smallTable);

And consider adding an index to your big table which will speed up that query a lot.

ChrisNZ · Posted 05-01-2019 07:10 PM

A dynamic filtering clause does not use indexes.

data B(index=(I) compress=no)   S(compress=no); 
  do I= 1 to 2e6;
   do J=1 to 20; 
     output B; 
   end;
   if I in(333333,999999) then output S;
 end;
run;

 proc sql;     * 0.3 seconds ;
   select B.* from S left join B on B.I=S.I; quit;
         
 proc sql;     * 8.6 seconds ;
   select B.* from B where I in (select I from S); quit;

 proc sql;     * 0.3 seconds ;
   select distinct I into :values separated by ',' from S; 
   select B.* from B where I in (&values); quit;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 05-01-2019 05:24 PM

I second @Reeza 's recommendation. For such a low percentage of the table being retrieved, an index is the way to go.

High-Performance SAS Coding - Third Edition

AMSAS · Posted 05-02-2019 12:02 PM

There is an alternative approach, using a custom format & then PUTC function. Here's a simple example:

data large ;
	do obs=1 to 1000 ;
		match=substr("ABC",int(ranuni(0)*3)+1,1) ;
		output ;
	end ;
run ;

data small ;
	fmtname="$myFmt" ;
	do start="A", "B","C" ;
		label=repeat(start,3) ;
		output ;
	end ;
run ;
	
proc format cntlin=small ;
run ;

data join ;
	set large ;
	matchvalue=putc(match,"$myFmt.") ;
run ;

ChrisNZ · Posted 05-02-2019 06:46 PM

Recreating a large table for just a few hundreds records modified is not the most efficient.

Updating using an index is probably the best way.

High-Performance SAS Coding - Third Edition

ScottBass · Posted 05-03-2019 12:33 AM

Here's some example code. See the examples in the doc for the SET statement.

data small;
   do x=1 to 1E4;
      key=int(ranuni(0)*1E4);
      output;
   end;
   drop x;
run;

data large (index=(key));
   do x=1 to 1E8;
      key=int(ranuni(0)*1E8);
      * other columns go here ;
      foo=x;
      output;
   end;
   drop x;
run;

* first match only (one-to-one matching);
data want1;
   if 0 then set large;
   call missing(foo);  * implied retain on data set variables ;
   set small;
   set large key=key;
   _error_=0;           * SAS treats non-matches as an error ;
   * if _iorc_=0;       * to keep matches only ;
run;

* multiple matches (one-to-many matching) ;
data want2;
   if 0 then set large;
   call missing(foo);  * implied retain on data set variables ;
   set small;
   do until (_iorc_ ne 0);
      set large key=key;
      if _iorc_=0 then output;
   end;
   _error_=0;           * SAS treats non-matches as an error ;
run;

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

ChuckC · Posted 05-09-2019 11:10 AM

Good morning all,
Wanted to take a quick minute and post a word of thanks for all of the EXCELLENT ideas and insights shared.

In the end, what I found was that moving from the use of Index to COMPRESS and adding a select Distinct, in my case, provided the best result (see snip below).

This change miaculously reduced the cycle time from 40+ minutes to less than 4 minutes!

HOWEVER, when the full production jobs runs, which runs this code over a series of datasets, it gets much longer. In production it runs 16-20 mins per cycle. So still a huge improvement from the original, but leaves me believing I have a memory issue to address next.

Thanks again all. Until next time 🙂

Chuck

* ------------------------ ;
* one to MANY matches ;
* ------------------------ ;

/* * ORIGINAL WAY - match to VIN_VEH_OPTNS file to get OPTN_CD ;
proc sql ;
create table TEST_Selectto2017 as
select * from
sdresp.grabvin_o2017 as G left join sdCumV.VIN_VEH_OPTNS_o2017 as S on G.VIN = S.GMC_VEH_IDENT_NBR ;
quit ;
*/
/* * BY WAY OF SELECT DISTINCT ;
proc sql ;
create table sdresp.TEST_Selecto2017 as
select * from sdCumV.VIN_VEH_OPTNS_o2017
where GMC_VEH_IDENT_NBR in (select distinct VIN from sdresp.grabvin_o2017) ;
quit ;
*/

LinusH · Posted 05-10-2019 05:28 AM

Set the FULLSTIMER option to monitor resource consumption.

MSGLEVEL=I will give information about sorting and index usage.

PROC SQL _method; will give the SQL planner evaluation.

It could be a good idea to monitor the overall server resource consumption at the same time.

Depending on the result from these inputs, adjust MEMSIZE and SORTSIZE global options.

Potentially if the SQL triggers can hash join, yo might want to increase the BUFFERSIZE (specified as a PROC SQL option).

Data never sleeps

Oligolas · Posted 05-10-2019 07:27 AM

Hi,

Fine for me but, do you realize that the results you get with your both SQL snippets are completely different?

in the first code you get all records from your big table sdresp.grabvin_o2017 with all columns from both tables

but in the second code you solely get the matching records and the columns from sdCumV.VIN_VEH_OPTNS_o2017

If this is what you want and performance you seek, consider using an INNER JOIN Statement combined with DISTINCT on indexed tables:

PROC SQL;
   CREATE TABLE sdresp.TEST_Selecto2017 AS
      SELECT DISTINCT S.*
      FROM sdCumV.VIN_VEH_OPTNS_o2017 S
      INNER JOIN sdresp.grabvin_o2017 G
      ON S.GMC_VEH_IDENT_NBR EQ G.VIN
   ;
QUIT;

________________________
- Cheers -

Astounding · Posted 05-10-2019 08:46 AM

Do you have some control over the production process?

You may be able to combine the data sets in your "series of data sets" (possibly adding an identifier to indicate the source of the observation). Then run with this combined data set against your large data set. That way you only need to hit the large data set once. You're left with the task of slicing and dicing many small data sets afterwards, but the CPU time it takes to do that ought to be small.

.......................................................

Another idea along similar lines but with fewer changes to the production process:

1. Add a step to find all the required records based on the "series of data sets".

2. Extract those records from the large data set.

3. Continue with the current production process, but using the extract instead of the original large data set.

ChrisNZ · Posted 05-12-2019 05:41 PM

If you decide to use indexes, just to let you know that you can load the small indexed table in memory using the SASFILE statement.
Indexes are loaded too and the random reads are much faster.

High-Performance SAS Coding - Third Edition

ScottBass · Posted 05-12-2019 09:03 PM

@ChrisNZ wrote:

If you decide to use indexes, just to let you know that you can load the small indexed table in memory using the SASFILE statement.
Indexes are loaded too and the random reads are much faster.

I like to think of indexed datasets as allowing random access based on the index, but with the data remaining on disk. So each read requires disk I/O. In that respect, they are analogous to hash objects, but without loading the data into memory.

But, I don't know the internals of SAS indexes vs. the hash object "index" (search algorithm). I don't know if they are similar, or completely different.

I find if my source table is small, but lookup table is huge, I can get better elapsed time by using index key lookup (esp. if the index already exists, say via overnight ETL processing). This approach saves the overhead of loading a hash object for relatively few lookups.

However, if my source table is huge, and the lookup table is small (or huge, but still fits in memory), then I can get better elapsed time by "suffering" the overhead of loading the hash object, but then having "blinding speed" for the lookups, where the lookups span the majority of the keys in the lookup table.

Question:

With the caveats of the above, if the lookup table is small (and fits in memory), would it be better to use a hash object, even if the table is indexed? Or, would the performance be similar, since as you say, the index is also read into memory, and therefore the lookups are similar to a hash object?

(I guess the OP can try both approaches and see which works best...)

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

ChrisNZ · Posted 05-13-2019 12:57 AM

@ScottBass

> if the lookup table is small (and fits in memory), would it be better to use a hash object, even if the table is indexed? Or, would the performance be similar, since as you say, the index is also read into memory

You're in luck Scott. Look at pages 162+ of my book, since you had the superior wisdom to procure it. 🙂

You'll see this very question benchmarked together with other look up methods. In broad lines:

- Index is faster if few rows (say 1%) are retrieved,

- Hash gets better as a larger proportion of the rows is fetched,

- SASFILE+index sits somewhere in the middle and is best if not too many columns are needed and if more than 1% (say) of the rows are needed, due to the initial overhead of loading the data in memory.

High-Performance SAS Coding - Third Edition

Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

Re: Efficient one to many join to a huge dataset

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away