Solved: A HASH alternative for a lengthy Proc Sql Exclusive join?

Zeus_Olympous · Posted 11-30-2017 10:05 AM

Dear all,

I have two tables (please see attached a portion since "sample" contains millions of observations)

data source;
retain string "abcde";
cpevent=2;ptm=22;output;
cpevent=3;ptm=33;output;
cpevent=4;ptm=44;output;
cpevent=0;ptm=11;output;
cpevent=1;ptm=11;output;
cpevent=1;ptm=15;output;
run;

data sample;
cpevent=3;ptm=33;output;
cpevent=1;ptm=11;output;
run;

I want the observations in "source" that are not matched with the observations in the "sample".

I used Proc sql as shown:

proc sql;

create table not_exist_in_sample as
select * from source as a
where not exists (select 1 from sample as b where a.cpevent=b.cpevent and a.ptm=b.ptm);

quit;

but the problem is taking ages to conclude the run.

I understand that HASH object code would do it way faster.

Any HASH hint/tip would be more than welcome.

Thank you.

Astounding · Posted 11-30-2017 11:05 AM

OK, working with data sets sorted BY CPEVENT PTM:

data not_found_in_sample;

set sample (in=in_sample) source;

by cpevent ptm;

if first.ptm then delete_flag = in_sample;

retain delete_flag;

if delete_flag then delete;

drop delete_flag;

run;

It might be helpful for all of us if you could post the results ... how much time did SQL take vs. how much time did the DATA step take.

View solution in original post

RW9 · Posted 11-30-2017 10:15 AM

Should be pretty simple, what have you tried?

Some google responses to get you started:

https://www.sas.com/content/dam/SAS/en_ca/User%20Group%20Presentations/Regina-User-Group/ShaunKaufma...

http://www2.sas.com/proceedings/forum2008/029-2008.pdf

http://www2.sas.com/proceedings/sugi31/244-31.pdf

Astounding · Posted 11-30-2017 10:42 AM

There are much faster methods, but let's explore some possibilities. Are you data sets small enough that you can sort them in a reasonably short amount of time?

Zeus_Olympous · Posted 11-30-2017 10:59 AM

Dear Astounding,

Thank you ..

Yes, I would guess so..

Sorting them would not take that much of a time.

Astounding · Posted 11-30-2017 11:05 AM

OK, working with data sets sorted BY CPEVENT PTM:

data not_found_in_sample;

set sample (in=in_sample) source;

by cpevent ptm;

if first.ptm then delete_flag = in_sample;

retain delete_flag;

if delete_flag then delete;

drop delete_flag;

run;

It might be helpful for all of us if you could post the results ... how much time did SQL take vs. how much time did the DATA step take.

Zeus_Olympous · Posted 11-30-2017 12:35 PM

Dear Astounding,

Thank you very much.

Your code indicates a clean Victory of SAS data step VS. Proc Sql

Results:

Data Step

87 proc sort data=sample;

88 by var1 var2;

89 run;

NOTE: There were 17770 observations read from the data set WORK.sample.

NOTE: The data set WORK.sample has 17770 observations and 87 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 1.75 seconds

cpu time 1.60 seconds

90

91 proc sort data=source;

92 by var1 var2;

3 The SAS System 18:20 Thursday, November 30, 2017

93 run;

NOTE: There were 2185269 observations read from the data set WORK.source.

NOTE: The data set WORK.source has 2185269 observations and 29 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 11.77 seconds

cpu time 16.22 seconds

94

95 data not_found_in_sample;

96 set source(in=in_sample) sample;

97 by var1 var2;

98

99 if first.var2_D then

100 delete_flag = in_sample;

101 retain delete_flag;

102

103 if delete_flag then

104 delete;

105 drop delete_flag;

106 run;

WARNING: Multiple lengths were specified for the variable Sales_Channel by input data set(s). This can cause truncation of data.

NOTE: There were 2185269 observations read from the data set WORK.source.

NOTE: There were 17770 observations read from the data set WORK.sample.

NOTE: The data set WORK.NOT_FOUND_IN_SAMPLE has 145 observations and 105 variables.

NOTE: DATA statement used (Total process time):

real time 7.93 seconds

cpu time 7.76 seconds

Proc Sql

Unknown time!!!!!

After one hour I just stopped the PROC SQL processing!!!!!

Astounding · Posted 11-30-2017 01:32 PM

Great. Also note a couple of issues.

It looks like SALES_CHANNEL is in both data sets, but with different lengths. Drop it from the data set that doesn't need it.

More importantly, you switched the order of the data sets on the SET statement. I know you were editing the log that you posted (I can tell by looking at the BY statement vs. IF FIRST.VAR2_D). But if the order was actually switched, that changes the results. Compare your SET statement with the original one that I posted. The SAMPLE data set has to come first, and the IN= option has to be applied to the SAMPLE data set. So you might be OK but it depends on what you actually ran vs. what you posted as the log.

Zeus_Olympous · Posted 11-30-2017 04:08 PM

You are right.

Thank you for the hint.

I tried both ways.

Actually the first one suits my goal.

Thanks again.

Kurt_Bremser · Posted 11-30-2017 01:36 PM

See my Maxim 10. SAS SQL has a tendency to overload the I/O subsystem when doing joins of physically large tables.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 11-30-2017 05:31 PM

You may be interested in a different exclusion approach in SQL.

proc sql;
  create table not_exist_in_sample  as
  select cpevent, ptm from source
  except
  select cpevent, ptm from sample
  ;
quit;

The Except operation requires the same variables on both select statements, so you would need to join the result back to the source data set to bring in any additional variables.

I won't claim this is as fast as the data set but I suspect that the subquery in the exists function is a lot of extra overhead.

Ksharp · Posted 12-01-2017 08:53 AM

It is indeed very convenient for Hash Table.


data source;
  retain string "abcde";
  cpevent=2;ptm=22;output;
  cpevent=3;ptm=33;output;
  cpevent=4;ptm=44;output;
  cpevent=0;ptm=11;output;
  cpevent=1;ptm=11;output;
  cpevent=1;ptm=15;output;
run;
 
data sample;
  cpevent=3;ptm=33;output;
  cpevent=1;ptm=11;output;
run;

data want;
 if _n_=1 then do;
  if 0 then set sample;
  declare hash h(dataset:'sample',hashexp:20);
  h.definekey('cpevent','ptm');
  h.definedone();
 end;
set source;
if h.check() ne 0;
run;

Zeus_Olympous · Posted 12-06-2017 06:22 AM

I really thank you all for your help.

Ksharp's solution with Hash table is very insightful as well.

A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Registration is open

SAS Training: Just a Click Away