Dear all,
I have two tables (please see attached a portion since "sample" contains millions of observations)
data source;
retain string "abcde";
cpevent=2;ptm=22;output;
cpevent=3;ptm=33;output;
cpevent=4;ptm=44;output;
cpevent=0;ptm=11;output;
cpevent=1;ptm=11;output;
cpevent=1;ptm=15;output;
run;
data sample;
cpevent=3;ptm=33;output;
cpevent=1;ptm=11;output;
run;
I want the observations in "source" that are not matched with the observations in the "sample".
I used Proc sql as shown:
proc sql;
create table not_exist_in_sample as
select * from source as a
where not exists (select 1 from sample as b where a.cpevent=b.cpevent and a.ptm=b.ptm);
quit;
but the problem is taking ages to conclude the run.
I understand that HASH object code would do it way faster.
Any HASH hint/tip would be more than welcome.
Thank you.
OK, working with data sets sorted BY CPEVENT PTM:
data not_found_in_sample;
set sample (in=in_sample) source;
by cpevent ptm;
if first.ptm then delete_flag = in_sample;
retain delete_flag;
if delete_flag then delete;
drop delete_flag;
run;
It might be helpful for all of us if you could post the results ... how much time did SQL take vs. how much time did the DATA step take.
Should be pretty simple, what have you tried?
Some google responses to get you started:
http://www2.sas.com/proceedings/forum2008/029-2008.pdf
There are much faster methods, but let's explore some possibilities. Are you data sets small enough that you can sort them in a reasonably short amount of time?
Dear Astounding,
Thank you ..
Yes, I would guess so..
Sorting them would not take that much of a time.
OK, working with data sets sorted BY CPEVENT PTM:
data not_found_in_sample;
set sample (in=in_sample) source;
by cpevent ptm;
if first.ptm then delete_flag = in_sample;
retain delete_flag;
if delete_flag then delete;
drop delete_flag;
run;
It might be helpful for all of us if you could post the results ... how much time did SQL take vs. how much time did the DATA step take.
Dear Astounding,
Thank you very much.
Your code indicates a clean Victory of SAS data step VS. Proc Sql
Results:
Data Step
87 proc sort data=sample;
88 by var1 var2;
89 run;
NOTE: There were 17770 observations read from the data set WORK.sample.
NOTE: The data set WORK.sample has 17770 observations and 87 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 1.75 seconds
cpu time 1.60 seconds
90
91 proc sort data=source;
92 by var1 var2;
3 The SAS System 18:20 Thursday, November 30, 2017
93 run;
NOTE: There were 2185269 observations read from the data set WORK.source.
NOTE: The data set WORK.source has 2185269 observations and 29 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 11.77 seconds
cpu time 16.22 seconds
94
95 data not_found_in_sample;
96 set source(in=in_sample) sample;
97 by var1 var2;
98
99 if first.var2_D then
100 delete_flag = in_sample;
101 retain delete_flag;
102
103 if delete_flag then
104 delete;
105 drop delete_flag;
106 run;
WARNING: Multiple lengths were specified for the variable Sales_Channel by input data set(s). This can cause truncation of data.
NOTE: There were 2185269 observations read from the data set WORK.source.
NOTE: There were 17770 observations read from the data set WORK.sample.
NOTE: The data set WORK.NOT_FOUND_IN_SAMPLE has 145 observations and 105 variables.
NOTE: DATA statement used (Total process time):
real time 7.93 seconds
cpu time 7.76 seconds
Proc Sql
Unknown time!!!!!
After one hour I just stopped the PROC SQL processing!!!!!
Great. Also note a couple of issues.
It looks like SALES_CHANNEL is in both data sets, but with different lengths. Drop it from the data set that doesn't need it.
More importantly, you switched the order of the data sets on the SET statement. I know you were editing the log that you posted (I can tell by looking at the BY statement vs. IF FIRST.VAR2_D). But if the order was actually switched, that changes the results. Compare your SET statement with the original one that I posted. The SAMPLE data set has to come first, and the IN= option has to be applied to the SAMPLE data set. So you might be OK but it depends on what you actually ran vs. what you posted as the log.
You are right.
Thank you for the hint.
I tried both ways.
Actually the first one suits my goal.
Thanks again.
See my Maxim 10. SAS SQL has a tendency to overload the I/O subsystem when doing joins of physically large tables.
You may be interested in a different exclusion approach in SQL.
proc sql; create table not_exist_in_sample as select cpevent, ptm from source except select cpevent, ptm from sample ; quit;
The Except operation requires the same variables on both select statements, so you would need to join the result back to the source data set to bring in any additional variables.
I won't claim this is as fast as the data set but I suspect that the subquery in the exists function is a lot of extra overhead.
It is indeed very convenient for Hash Table.
data source;
retain string "abcde";
cpevent=2;ptm=22;output;
cpevent=3;ptm=33;output;
cpevent=4;ptm=44;output;
cpevent=0;ptm=11;output;
cpevent=1;ptm=11;output;
cpevent=1;ptm=15;output;
run;
data sample;
cpevent=3;ptm=33;output;
cpevent=1;ptm=11;output;
run;
data want;
if _n_=1 then do;
if 0 then set sample;
declare hash h(dataset:'sample',hashexp:20);
h.definekey('cpevent','ptm');
h.definedone();
end;
set source;
if h.check() ne 0;
run;
I really thank you all for your help.
Ksharp's solution with Hash table is very insightful as well.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.