topic Re: Efficient one to many join to a huge dataset in SAS Programming

Efficient one to many join to a huge dataset

ChuckC — Wed, 01 May 2019 18:47:41 GMT

One to many match of a very small input file (&sfile._&dbase ) to a very large file (sdCumV.VIN_VEH_OPTNS_&dbase)

In one particular example small input has 2 unique records, and very large file as 419M (where each input record will typically find ~200 matches).

This particular example run for 44 minutes. Obviously, we can do better.

What I'm not sure of is how best to utilize SAS for such a task. Any ideas would be welcome.

* ------------------------ ;
* one to MANY matches ;
* ------------------------ ;

* match to VIN_VEH_OPTNS file to get OPTN_CD ;
proc sql ;
create table squish_optns_&dbase as
select * from
&sfile._&dbase as G left join sdCumV.VIN_VEH_OPTNS_&dbase as S on G.VIN = S.GMC_VEH_IDENT_NBR ;
quit ;

Re: Efficient one to many join to a huge dataset

Reeza — Wed, 01 May 2019 18:52:57 GMT

Instead of join how about filters?

select * from bigTable where ID in (select distinct ID from smallTable);

And consider adding an index to your big table which will speed up that query a lot.

Re: Efficient one to many join to a huge dataset

ChrisNZ — Wed, 01 May 2019 21:24:25 GMT

I second @Reeza 's recommendation. For such a low percentage of the table being retrieved, an index is the way to go.

Re: Efficient one to many join to a huge dataset

ChrisNZ — Wed, 01 May 2019 23:10:03 GMT

A dynamic filtering clause does not use indexes.

data B(index=(I) compress=no)   S(compress=no); 
  do I= 1 to 2e6;
   do J=1 to 20; 
     output B; 
   end;
   if I in(333333,999999) then output S;
 end;
run;

 proc sql;     * 0.3 seconds ;
   select B.* from S left join B on B.I=S.I; quit;
         
 proc sql;     * 8.6 seconds ;
   select B.* from B where I in (select I from S); quit;

 proc sql;     * 0.3 seconds ;
   select distinct I into :values separated by ',' from S; 
   select B.* from B where I in (&values); quit;

Re: Efficient one to many join to a huge dataset

AMSAS — Thu, 02 May 2019 16:02:39 GMT

There is an alternative approach, using a custom format & then PUTC function. Here's a simple example:

data large ;
	do obs=1 to 1000 ;
		match=substr("ABC",int(ranuni(0)*3)+1,1) ;
		output ;
	end ;
run ;

data small ;
	fmtname="$myFmt" ;
	do start="A", "B","C" ;
		label=repeat(start,3) ;
		output ;
	end ;
run ;
	
proc format cntlin=small ;
run ;

data join ;
	set large ;
	matchvalue=putc(match,"$myFmt.") ;
run ;

Re: Efficient one to many join to a huge dataset

ChrisNZ — Thu, 02 May 2019 22:46:41 GMT

Recreating a large table for just a few hundreds records modified is not the most efficient.

Updating using an index is probably the best way.

Re: Efficient one to many join to a huge dataset

ScottBass — Fri, 03 May 2019 04:39:04 GMT

Here's some example code. See the examples in the doc for the SET statement.

data small;
   do x=1 to 1E4;
      key=int(ranuni(0)*1E4);
      output;
   end;
   drop x;
run;

data large (index=(key));
   do x=1 to 1E8;
      key=int(ranuni(0)*1E8);
      * other columns go here ;
      foo=x;
      output;
   end;
   drop x;
run;

* first match only (one-to-one matching);
data want1;
   if 0 then set large;
   call missing(foo);  * implied retain on data set variables ;
   set small;
   set large key=key;
   _error_=0;           * SAS treats non-matches as an error ;
   * if _iorc_=0;       * to keep matches only ;
run;

* multiple matches (one-to-many matching) ;
data want2;
   if 0 then set large;
   call missing(foo);  * implied retain on data set variables ;
   set small;
   do until (_iorc_ ne 0);
      set large key=key;
      if _iorc_=0 then output;
   end;
   _error_=0;           * SAS treats non-matches as an error ;
run;

Re: Efficient one to many join to a huge dataset

ChuckC — Thu, 09 May 2019 15:10:38 GMT

Good morning all,
Wanted to take a quick minute and post a word of thanks for all of the EXCELLENT ideas and insights shared.

In the end, what I found was that moving from the use of Index to COMPRESS and adding a select Distinct, in my case, provided the best result (see snip below).

This change miaculously reduced the cycle time from 40+ minutes to less than 4 minutes!

HOWEVER, when the full production jobs runs, which runs this code over a series of datasets, it gets much longer. In production it runs 16-20 mins per cycle. So still a huge improvement from the original, but leaves me believing I have a memory issue to address next.

Thanks again all. Until next time 🙂

Chuck

* ------------------------ ;
* one to MANY matches ;
* ------------------------ ;

/* * ORIGINAL WAY - match to VIN_VEH_OPTNS file to get OPTN_CD ;
proc sql ;
create table TEST_Selectto2017 as
select * from
sdresp.grabvin_o2017 as G left join sdCumV.VIN_VEH_OPTNS_o2017 as S on G.VIN = S.GMC_VEH_IDENT_NBR ;
quit ;
*/
/* * BY WAY OF SELECT DISTINCT ;
proc sql ;
create table sdresp.TEST_Selecto2017 as
select * from sdCumV.VIN_VEH_OPTNS_o2017
where GMC_VEH_IDENT_NBR in (select distinct VIN from sdresp.grabvin_o2017) ;
quit ;
*/

Re: Efficient one to many join to a huge dataset

LinusH — Fri, 10 May 2019 09:28:44 GMT

Set the FULLSTIMER option to monitor resource consumption.

MSGLEVEL=I will give information about sorting and index usage.

PROC SQL _method; will give the SQL planner evaluation.

It could be a good idea to monitor the overall server resource consumption at the same time.

Depending on the result from these inputs, adjust MEMSIZE and SORTSIZE global options.

Potentially if the SQL triggers can hash join, yo might want to increase the BUFFERSIZE (specified as a PROC SQL option).

Re: Efficient one to many join to a huge dataset

Oligolas — Fri, 10 May 2019 11:27:10 GMT

Hi,

Fine for me but, do you realize that the results you get with your both SQL snippets are completely different?

in the first code you get all records from your big table sdresp.grabvin_o2017 with all columns from both tables

but in the second code you solely get the matching records and the columns from sdCumV.VIN_VEH_OPTNS_o2017

If this is what you want and performance you seek, consider using an INNER JOIN Statement combined with DISTINCT on indexed tables:

PROC SQL;
   CREATE TABLE sdresp.TEST_Selecto2017 AS
      SELECT DISTINCT S.*
      FROM sdCumV.VIN_VEH_OPTNS_o2017 S
      INNER JOIN sdresp.grabvin_o2017 G
      ON S.GMC_VEH_IDENT_NBR EQ G.VIN
   ;
QUIT;

Re: Efficient one to many join to a huge dataset

Astounding — Fri, 10 May 2019 13:20:18 GMT

Do you have some control over the production process?

You may be able to combine the data sets in your "series of data sets" (possibly adding an identifier to indicate the source of the observation). Then run with this combined data set against your large data set. That way you only need to hit the large data set once. You're left with the task of slicing and dicing many small data sets afterwards, but the CPU time it takes to do that ought to be small.

.......................................................

Another idea along similar lines but with fewer changes to the production process:

1. Add a step to find all the required records based on the "series of data sets".

2. Extract those records from the large data set.

3. Continue with the current production process, but using the extract instead of the original large data set.

Re: Efficient one to many join to a huge dataset

ChrisNZ — Sun, 12 May 2019 21:41:13 GMT

If you decide to use indexes, just to let you know that you can load the small indexed table in memory using the SASFILE statement.
Indexes are loaded too and the random reads are much faster.

Re: Efficient one to many join to a huge dataset

ScottBass — Mon, 13 May 2019 01:03:03 GMT

@ChrisNZ wrote:

If you decide to use indexes, just to let you know that you can load the small indexed table in memory using the SASFILE statement.
Indexes are loaded too and the random reads are much faster.

I like to think of indexed datasets as allowing random access based on the index, but with the data remaining on disk. So each read requires disk I/O. In that respect, they are analogous to hash objects, but without loading the data into memory.

But, I don't know the internals of SAS indexes vs. the hash object "index" (search algorithm). I don't know if they are similar, or completely different.

I find if my source table is small, but lookup table is huge, I can get better elapsed time by using index key lookup (esp. if the index already exists, say via overnight ETL processing). This approach saves the overhead of loading a hash object for relatively few lookups.

However, if my source table is huge, and the lookup table is small (or huge, but still fits in memory), then I can get better elapsed time by "suffering" the overhead of loading the hash object, but then having "blinding speed" for the lookups, where the lookups span the majority of the keys in the lookup table.

Question:

With the caveats of the above, if the lookup table is small (and fits in memory), would it be better to use a hash object, even if the table is indexed? Or, would the performance be similar, since as you say, the index is also read into memory, and therefore the lookups are similar to a hash object?

(I guess the OP can try both approaches and see which works best...)

Re: Efficient one to many join to a huge dataset

ChrisNZ — Mon, 13 May 2019 05:39:34 GMT

@ScottBass

> if the lookup table is small (and fits in memory), would it be better to use a hash object, even if the table is indexed? Or, would the performance be similar, since as you say, the index is also read into memory

You're in luck Scott. Look at pages 162+ of my book, since you had the superior wisdom to procure it. 🙂

You'll see this very question benchmarked together with other look up methods. In broad lines:

- Index is faster if few rows (say 1%) are retrieved,

- Hash gets better as a larger proportion of the rows is fetched,

- SASFILE+index sits somewhere in the middle and is best if not too many columns are needed and if more than 1% (say) of the rows are needed, due to the initial overhead of loading the data in memory.