- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello
What is more efficient way to perform merge?
proc sql or data step? and why?
/***Way1-Proc sql****/
proc sql;
create table want1 as
select a.*,b.Numerator
from t1 as a
left join t2 as b
on a.Customer_id=b.Customer_id and a.branch_ID=b.branch_ID
;
quit;
/***Way2-Data Step****/
proc sort data=t1;
by Customer_id branch_ID;
Run;
proc sort data=t2;
by Customer_id branch_ID;
Run;
Data want;
Retain Numerator;
Merge t1(in=a) t2(in=b);
by Customer_id branch_ID;
if a;
Run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Run both steps and look at the log. You may need sufficiently large datasets to see a difference.
If t2 (using only the three variables needed) can fit into memory, a single DATA step with a hash object will be most efficient:
data want;
set t1;
if _n_ = 1
then do;
length numerator /* insert proper attribute here */;
declare hash t2 (dataset:"t2 (keep=customer_id branch_id numerator)");
t2.definekey("customer_id","branch_id");
t2.definedata("numerator");
t2.definedone();
call missing("numerator");
end;
rc = t2.find();
drop rc;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Run both steps and look at the log. You may need sufficiently large datasets to see a difference.
If t2 (using only the three variables needed) can fit into memory, a single DATA step with a hash object will be most efficient:
data want;
set t1;
if _n_ = 1
then do;
length numerator /* insert proper attribute here */;
declare hash t2 (dataset:"t2 (keep=customer_id branch_id numerator)");
t2.definekey("customer_id","branch_id");
t2.definedata("numerator");
t2.definedone();
call missing("numerator");
end;
rc = t2.find();
drop rc;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Let's say that t2 data set is located in permanent library r_r.
so the data set name is r_r.t2
does it mean that I need to write r_r.t2 in your code instead of t2?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ronein wrote:
does it mean that I need to write r_r.t2 in your code instead of t2?
Correct.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
<Pedantic mode: on>
Any discussion of "efficient" should include a definition of what is measured and how it is used to determine efficiency. If you don't provide something to measure how do you know if something is "more efficient".
<Pedantic mode:off>
Programming time
Program maintenance
Single Run time
Data storage space
Network traffic
Just a few different things that might be considered in "efficient".