BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Zeus_Olympous
Obsidian | Level 7

Dear all,

 

I have two tables (please see attached a portion since "sample" contains millions of observations)

 

data source;
  retain string "abcde";
  cpevent=2;ptm=22;output;
  cpevent=3;ptm=33;output;
  cpevent=4;ptm=44;output;
  cpevent=0;ptm=11;output;
  cpevent=1;ptm=11;output;
  cpevent=1;ptm=15;output;
run;

 

data sample;
  cpevent=3;ptm=33;output;
  cpevent=1;ptm=11;output;
run;

 

I want the observations in "source" that are not matched with the observations in the "sample".

 

I used Proc sql as shown:

 

proc sql;

create table not_exist_in_sample  as
  select * from source as a
  where not exists (select 1 from sample as b where a.cpevent=b.cpevent and a.ptm=b.ptm);

quit;

 

but the problem is taking ages to conclude the run.

 

I understand that HASH object code would do it way faster.

 

Any HASH hint/tip would be more than welcome.

 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

OK, working with data sets sorted BY CPEVENT PTM:

 

data not_found_in_sample;

set sample (in=in_sample) source;

by cpevent ptm;

if first.ptm then delete_flag = in_sample;

retain delete_flag;

if delete_flag then delete;

drop delete_flag;

run;

 

It might be helpful for all of us if you could post the results ... how much time did SQL take vs. how much time did the DATA step take.

View solution in original post

11 REPLIES 11
Astounding
PROC Star

There are much faster methods, but let's explore some possibilities.  Are you data sets small enough that you can sort them in a reasonably short amount of time?

Zeus_Olympous
Obsidian | Level 7

Dear Astounding,

 

Thank you ..

 

Yes, I would guess so..

 

Sorting them would not take that much of a time.

Astounding
PROC Star

OK, working with data sets sorted BY CPEVENT PTM:

 

data not_found_in_sample;

set sample (in=in_sample) source;

by cpevent ptm;

if first.ptm then delete_flag = in_sample;

retain delete_flag;

if delete_flag then delete;

drop delete_flag;

run;

 

It might be helpful for all of us if you could post the results ... how much time did SQL take vs. how much time did the DATA step take.

Zeus_Olympous
Obsidian | Level 7

Dear Astounding,

 

Thank you very much.

 

Your code indicates a clean Victory of SAS data step VS. Proc Sql

 

Results:

Data Step

 

87 proc sort data=sample;

88 by var1 var2;

89 run;

NOTE: There were 17770 observations read from the data set WORK.sample.

NOTE: The data set WORK.sample  has 17770 observations and 87 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 1.75 seconds

cpu time 1.60 seconds

 

90

91 proc sort data=source;

92 by  var1 var2;

3 The SAS System 18:20 Thursday, November 30, 2017

93 run;

NOTE: There were 2185269 observations read from the data set WORK.source.

NOTE: The data set WORK.source has 2185269 observations and 29 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 11.77 seconds

cpu time 16.22 seconds

 

94

95 data not_found_in_sample;

96 set source(in=in_sample) sample;

97 by var1 var2;

98

99 if first.var2_D then

100 delete_flag = in_sample;

101 retain delete_flag;

102

103 if delete_flag then

104 delete;

105 drop delete_flag;

106 run;

WARNING: Multiple lengths were specified for the variable Sales_Channel by input data set(s). This can cause truncation of data.

NOTE: There were 2185269 observations read from the data set WORK.source.

NOTE: There were 17770 observations read from the data set WORK.sample.

NOTE: The data set WORK.NOT_FOUND_IN_SAMPLE has 145 observations and 105 variables.

NOTE: DATA statement used (Total process time):

real time 7.93 seconds

cpu time 7.76 seconds

 

Proc Sql

 

Unknown time!!!!!

 

After one hour I just stopped the PROC SQL processing!!!!!

Astounding
PROC Star

Great.  Also note a couple of issues.

 

It looks like SALES_CHANNEL is in both data sets, but with different lengths.  Drop it from the data set that doesn't need it.

 

More importantly, you switched the order of the data sets on the SET statement.  I know you were editing the log that you posted (I can tell by looking at the BY statement vs. IF FIRST.VAR2_D).  But if the order was actually switched, that changes the results.  Compare your SET statement with the original one that I posted.  The SAMPLE data set has to come first, and the IN= option has to be applied to the SAMPLE data set.  So you might be OK but it depends on what you actually ran vs. what you posted as the log.

 

 

Zeus_Olympous
Obsidian | Level 7

You are right.

Thank you for the hint.

 

I tried both ways.

 

Actually the first one suits my goal.

 

Thanks again.

 

ballardw
Super User

You may be interested in a different exclusion approach in SQL.

proc sql;
  create table not_exist_in_sample  as
  select cpevent, ptm from source
  except
  select cpevent, ptm from sample
  ;
quit;

The Except operation requires the same variables on both select statements, so you would need to join the result back to the source data set to bring in any additional variables.

 

I won't claim this is as fast as the data set but I suspect that the subquery in the exists function is a lot of extra overhead.

Ksharp
Super User

It is indeed very convenient for Hash Table.

 


data source;
  retain string "abcde";
  cpevent=2;ptm=22;output;
  cpevent=3;ptm=33;output;
  cpevent=4;ptm=44;output;
  cpevent=0;ptm=11;output;
  cpevent=1;ptm=11;output;
  cpevent=1;ptm=15;output;
run;
 
data sample;
  cpevent=3;ptm=33;output;
  cpevent=1;ptm=11;output;
run;

data want;
 if _n_=1 then do;
  if 0 then set sample;
  declare hash h(dataset:'sample',hashexp:20);
  h.definekey('cpevent','ptm');
  h.definedone();
 end;
set source;
if h.check() ne 0;
run;
Zeus_Olympous
Obsidian | Level 7

I really thank you all for your help.

 

Ksharp's solution with Hash table is very insightful as well.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1363 views
  • 0 likes
  • 6 in conversation