DATA Step, Macro, Functions and more

A HASH alternative for a lengthy Proc Sql Exclusive join?

Accepted Solution Solved
Reply
Contributor
Posts: 53
Accepted Solution

A HASH alternative for a lengthy Proc Sql Exclusive join?

Dear all,

 

I have two tables (please see attached a portion since "sample" contains millions of observations)

 

data source;
  retain string "abcde";
  cpevent=2;ptm=22;output;
  cpevent=3;ptm=33;output;
  cpevent=4;ptm=44;output;
  cpevent=0;ptm=11;output;
  cpevent=1;ptm=11;output;
  cpevent=1;ptm=15;output;
run;

 

data sample;
  cpevent=3;ptm=33;output;
  cpevent=1;ptm=11;output;
run;

 

I want the observations in "source" that are not matched with the observations in the "sample".

 

I used Proc sql as shown:

 

proc sql;

create table not_exist_in_sample  as
  select * from source as a
  where not exists (select 1 from sample as b where a.cpevent=b.cpevent and a.ptm=b.ptm);

quit;

 

but the problem is taking ages to conclude the run.

 

I understand that HASH object code would do it way faster.

 

Any HASH hint/tip would be more than welcome.

 

Thank you.


Accepted Solutions
Solution
‎12-06-2017 06:21 AM
Super User
Posts: 6,935

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

[ Edited ]
Posted in reply to Zeus_Olympous

OK, working with data sets sorted BY CPEVENT PTM:

 

data not_found_in_sample;

set sample (in=in_sample) source;

by cpevent ptm;

if first.ptm then delete_flag = in_sample;

retain delete_flag;

if delete_flag then delete;

drop delete_flag;

run;

 

It might be helpful for all of us if you could post the results ... how much time did SQL take vs. how much time did the DATA step take.

View solution in original post


All Replies
Super User
Super User
Posts: 9,866

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Zeus_Olympous
Super User
Posts: 6,935

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Zeus_Olympous

There are much faster methods, but let's explore some possibilities.  Are you data sets small enough that you can sort them in a reasonably short amount of time?

Contributor
Posts: 53

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Astounding

Dear Astounding,

 

Thank you ..

 

Yes, I would guess so..

 

Sorting them would not take that much of a time.

Solution
‎12-06-2017 06:21 AM
Super User
Posts: 6,935

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

[ Edited ]
Posted in reply to Zeus_Olympous

OK, working with data sets sorted BY CPEVENT PTM:

 

data not_found_in_sample;

set sample (in=in_sample) source;

by cpevent ptm;

if first.ptm then delete_flag = in_sample;

retain delete_flag;

if delete_flag then delete;

drop delete_flag;

run;

 

It might be helpful for all of us if you could post the results ... how much time did SQL take vs. how much time did the DATA step take.

Contributor
Posts: 53

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Astounding

Dear Astounding,

 

Thank you very much.

 

Your code indicates a clean Victory of SAS data step VS. Proc Sql

 

Results:

Data Step

 

87 proc sort data=sample;

88 by var1 var2;

89 run;

NOTE: There were 17770 observations read from the data set WORK.sample.

NOTE: The data set WORK.sample  has 17770 observations and 87 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 1.75 seconds

cpu time 1.60 seconds

 

90

91 proc sort data=source;

92 by  var1 var2;

3 The SAS System 18:20 Thursday, November 30, 2017

93 run;

NOTE: There were 2185269 observations read from the data set WORK.source.

NOTE: The data set WORK.source has 2185269 observations and 29 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 11.77 seconds

cpu time 16.22 seconds

 

94

95 data not_found_in_sample;

96 set source(in=in_sample) sample;

97 by var1 var2;

98

99 if first.var2_D then

100 delete_flag = in_sample;

101 retain delete_flag;

102

103 if delete_flag then

104 delete;

105 drop delete_flag;

106 run;

WARNING: Multiple lengths were specified for the variable Sales_Channel by input data set(s). This can cause truncation of data.

NOTE: There were 2185269 observations read from the data set WORK.source.

NOTE: There were 17770 observations read from the data set WORK.sample.

NOTE: The data set WORK.NOT_FOUND_IN_SAMPLE has 145 observations and 105 variables.

NOTE: DATA statement used (Total process time):

real time 7.93 seconds

cpu time 7.76 seconds

 

Proc Sql

 

Unknown time!!!!!

 

After one hour I just stopped the PROC SQL processing!!!!!

Super User
Posts: 6,935

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Zeus_Olympous

Great.  Also note a couple of issues.

 

It looks like SALES_CHANNEL is in both data sets, but with different lengths.  Drop it from the data set that doesn't need it.

 

More importantly, you switched the order of the data sets on the SET statement.  I know you were editing the log that you posted (I can tell by looking at the BY statement vs. IF FIRST.VAR2_D).  But if the order was actually switched, that changes the results.  Compare your SET statement with the original one that I posted.  The SAMPLE data set has to come first, and the IN= option has to be applied to the SAMPLE data set.  So you might be OK but it depends on what you actually ran vs. what you posted as the log.

 

 

Contributor
Posts: 53

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Astounding

You are right.

Thank you for the hint.

 

I tried both ways.

 

Actually the first one suits my goal.

 

Thanks again.

 

Super User
Posts: 10,618

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Zeus_Olympous

See my Maxim 10. SAS SQL has a tendency to overload the I/O subsystem when doing joins of physically large tables.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Super User
Posts: 13,947

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Zeus_Olympous

You may be interested in a different exclusion approach in SQL.

proc sql;
  create table not_exist_in_sample  as
  select cpevent, ptm from source
  except
  select cpevent, ptm from sample
  ;
quit;

The Except operation requires the same variables on both select statements, so you would need to join the result back to the source data set to bring in any additional variables.

 

I won't claim this is as fast as the data set but I suspect that the subquery in the exists function is a lot of extra overhead.

Super User
Posts: 10,860

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

Posted in reply to Zeus_Olympous

It is indeed very convenient for Hash Table.

 


data source;
  retain string "abcde";
  cpevent=2;ptm=22;output;
  cpevent=3;ptm=33;output;
  cpevent=4;ptm=44;output;
  cpevent=0;ptm=11;output;
  cpevent=1;ptm=11;output;
  cpevent=1;ptm=15;output;
run;
 
data sample;
  cpevent=3;ptm=33;output;
  cpevent=1;ptm=11;output;
run;

data want;
 if _n_=1 then do;
  if 0 then set sample;
  declare hash h(dataset:'sample',hashexp:20);
  h.definekey('cpevent','ptm');
  h.definedone();
 end;
set source;
if h.check() ne 0;
run;
Contributor
Posts: 53

Re: A HASH alternative for a lengthy Proc Sql Exclusive join?

I really thank you all for your help.

 

Ksharp's solution with Hash table is very insightful as well.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 249 views
  • 0 likes
  • 6 in conversation