I am trying to merge two files and don’t want to get rid of the duplicate values in Table_1 and replaced with the values in table_2. What are the codes that will do it for me?
Data Table_1;
input ID Name$ SalesQ Price;
datalines;
101 AMBA 4 15
107 GEORGE 22 15
107 GEORGE 22 15
107 GEORGE 22 15
108 HADEN 9 15
110 JACKY 8 15
109 IDDIE 19 15
104 DAME 6 15
111 KAN 211 15
112 LAN 83 09 15
106 FOYE 26 15
107 GEORGE 22 15
107 GEORGE 22 15
;
data Table_2;
input ID Name $ SalesQ Price color$;
datalines;
101 AMBA 6 24 blue
102 BAMBIE 5 32 red
103 CARRIE 1 17 yellow
104 DAME 1 24 green
105 EDDIE 7 2 tan
106 FOYE 3 6 bluegreen
107 GEORGE 100 11 pink
;
Proc sort Data = Table_1;
By ID;
Run;
Proc sort Data = Table_2;
By ID;
Run;
Data Table_3;
Merge Table_1 Table_2;
By ID;
Run;
Current Output | ||||
ID | Name | SalesQ | Price | color |
101 | AMBA | 6 | 24 | blue |
102 | BAMBIE | 5 | 32 | red |
103 | CARRIE | 1 | 17 | yellow |
104 | DAME | 1 | 24 | green |
105 | EDDIE | 7 | 2 | tan |
106 | FOYE | 3 | 6 | bluegree |
107 | GEORGE | 100 | 11 | pink |
107 | GEORGE | 22 | 15 | pink |
107 | GEORGE | 22 | 15 | pink |
107 | GEORGE | 22 | 15 | pink |
107 | GEORGE | 22 | 15 | pink |
108 | HADEN | 9 | 15 | |
109 | IDDIE | 19 | 15 | |
110 | JACKY | 8 | 15 | |
111 | KAN | 211 | 15 | |
112 | LAN | 83 | 9 | |
Desired Output | ||||
ID | Name | SalesQ | Price | color |
101 | AMBA | 6 | 24 | blue |
102 | BAMBIE | 5 | 32 | red |
103 | CARRIE | 1 | 17 | yellow |
104 | DAME | 1 | 24 | green |
105 | EDDIE | 7 | 2 | tan |
106 | FOYE | 3 | 6 | bluegree |
107 | GEORGE | 100 | 11 | pink |
107 | GEORGE | 100 | 11 | pink |
107 | GEORGE | 100 | 11 | pink |
107 | GEORGE | 100 | 11 | pink |
107 | GEORGE | 100 | 11 | pink |
108 | HADEN | 9 | 15 | |
109 | IDDIE | 19 | 15 | |
110 | JACKY | 8 | 15 | |
111 | KAN | 211 | 15 | |
112 | LAN | 83 | 9 |
Another way to look at the problem is to control the output based on whether data exists in both files. e.g.:
Data Table_3 (drop=in:);
merge Table_2 (in=in2)
Table_1 (in=in1 rename=(SalesQ=in_S Price=in_P));
By ID;
if in1 and not(in2) then do;
SalesQ=in_S;
Price=in_P;
end;
Run;
HTH,
Art
try:
Data Table_3;
Merge Table_1 Table_2(keep=id color);
By ID;
Run;
I tried both suggestions but they did not work. Please give it another short
What happens if you reverse your tables in the merge statement like the following?
Data Table_3;
Merge Table_2 Table_1;
By ID;
Run;
proc sql;
create table output as
select
coalesce(a.ID, b.ID) as ID,
coalescec(a.name, b.name) as name,
case
when missing(b.salesq) then a.salesq
else b.salesq
end as salesq,
case
when missing(b.price) then a.price
else b.price
end as price,
b.color
from
table_1 as a
full join
table_2 as b
on
a.ID = b.ID
order by
ID;
quit;
It seems to me to be hash() problem, no need to presort:
data want;
if _n_=1 then do;
if 0 then set table_2;
dcl hash h(dataset:"table_2");
h.definekey("id");
h.definedata(all:"y");
h.definedone();
end;
set table_1;
_n_=h.find();
output;
call missing(of _all_);
run;
Haikuo
Colleagues, I have invested some thoughts into problem after I realized that it is more than it looks like. This problem can be used as the touchstone to show what classic data step Can do or Can Not do. The signature feature of data step is sequential processing, it is efficient and relying less on the capacity of memory. However, one thing difficult or awkward for data step to do is 'roll back', data step implicit loop or even data step index will only go One-way. This is why although we have data step index, it can only be used on one-to-one merge. The advent of hash() changes everything, with hash(), data step can do almost everything that Proc SQL is capable of, and most of the time with an advantageous margin in performance. So you see, if using data step to do a merge requiring 'roll back', then it has to be 'random access' kind of technique, in fact, we can actually mimic hash() behavior using array() + point= under classic data step setting, of course it is not as slick and robust as hash():
data want;
array ind(0:9999) _temporary_;
do _n_=1 by 1 until (l1);
set table_2 end=l1;
ind(id)=_n_;
end;
do until (l2);
set table_1 end=l2;
_n_=ind(id);
set table_2 point=_n_;
output;
call missing(color);
end;
run;
Any comments will be highly appreciated. And thank OP for raising this interesting question.
Haikuo
Another way to look at the problem is to control the output based on whether data exists in both files. e.g.:
Data Table_3 (drop=in:);
merge Table_2 (in=in2)
Table_1 (in=in1 rename=(SalesQ=in_S Price=in_P));
By ID;
if in1 and not(in2) then do;
SalesQ=in_S;
Price=in_P;
end;
Run;
HTH,
Art
Borrowed Art's idea:
Data Table_3b (drop=in:);
merge Table_1 Table_2 (in=in2 rename=(SalesQ=in_S Price=in_P));
By ID;
if in2 then do;
SalesQ=in_S;
Price=in_P;
end;
Run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.