Hi,
I have two data sets A and B with some common variables and case IDs, and I want to merge them. Most of all, for cases with the same variable names, I'd like to take values from first A; if missing then take from B. I know I can rename variables in B before merging and then use DO IF to get the job done. But I'm wondering if there is quick way without going through the hassle of renaming and DO IF (as I have need to deal with more than 15 variables on this.
In SQL I could use coalesce, such as proc sql; select coalesce(a.var1, b.var1) as var1 from A as a full join B as b on a.ID=b.ID ...
But in Data sets, is there a trick to do this in a couple of simple lines?
data a; input id 1 var1 3 var2 5 var3 7;
datalines;
1 1 2
2 2 3 4
3 3 4 5
;
proc print; run;
data b; input id var1 var3 var4;
datalines;
1 11 12 7
2 12 13 8
4 13 14 9
;
proc print; run;
data comb; format id var1 var2 var3 var4; merge b a; by id;
proc print; run;
HAVE | WANT | ||||||||
id | var1 | var2 | var3 | var4 | id | var1 | var2 | var3 | var4 |
1 | 1 | 2 | . | 7 | 1 | 1 | 2 | 12 | 7 |
2 | . | 3 | 4 | 8 | 2 | 40 | 3 | 4 | 8 |
3 | 3 | . | 5 | . | 3 | 3 | . | 5 | . |
4 | 13 | . | 14 | 9 | 4 | 13 | . | 14 | 9 |
Why
If my suggested values are actuallly correct, then:
data want;
update b a;
by id;
run;
The UPDATE statement is a lot like merge. When the corresponding value in the second dataset (A) is not missing, it will replace the value in B. But if the value in A is missing then the original value in B is kept.
Another way to do this is:
data want;
set a b;
by id;
array vars var1-var4;
do over vars;
vars=ifn(first.id=0,coalesce(lag(vars),vars),vars);
end;
if last.id;
run;
But the UPDATE statement, can't be extended to more-than-two datasets (say you have datasets A, B, and C, and you want the equivalent of coalesce(a.var,b.var,c.var)). That could be done via:
data want;
set a b c;
by id;
array vars {4} var1-var4;
array tmp {4} _temporary_ ;
if first.id then call missing(of tmp{*});
do i=1 to 4;
tmp{i}=coalesce(tmp{i},vars{i});
end;
if last.id;
do i=1 to 4;
vars{i}=tmp{i};
end;
run;
But actually the update statement can be used with more-than-two datasets, per @Tom 's reply to my post in this thread. True, it's an extra data step, but because the extra step is a data set VIEW (not a data set FILE), it does not require extra disk usage. Very nice.
Why
If my suggested values are actuallly correct, then:
data want;
update b a;
by id;
run;
The UPDATE statement is a lot like merge. When the corresponding value in the second dataset (A) is not missing, it will replace the value in B. But if the value in A is missing then the original value in B is kept.
Another way to do this is:
data want;
set a b;
by id;
array vars var1-var4;
do over vars;
vars=ifn(first.id=0,coalesce(lag(vars),vars),vars);
end;
if last.id;
run;
But the UPDATE statement, can't be extended to more-than-two datasets (say you have datasets A, B, and C, and you want the equivalent of coalesce(a.var,b.var,c.var)). That could be done via:
data want;
set a b c;
by id;
array vars {4} var1-var4;
array tmp {4} _temporary_ ;
if first.id then call missing(of tmp{*});
do i=1 to 4;
tmp{i}=coalesce(tmp{i},vars{i});
end;
if last.id;
do i=1 to 4;
vars{i}=tmp{i};
end;
run;
But actually the update statement can be used with more-than-two datasets, per @Tom 's reply to my post in this thread. True, it's an extra data step, but because the extra step is a data set VIEW (not a data set FILE), it does not require extra disk usage. Very nice.
Update seems like what they want.
To get around the limit of only two datasets (original and transactions) you could make a dataset view.
Plus the original dataset needs to have unique observations per BY group. Perhaps your original dataset has multiple observations per id, perhaps because of repeated observation over rime.
You could get around both limitations by using a data step view and OBS=0 dataset option.
So make a view that interleaves the datasets by the id variables. Set the datasets in reverse order of precedence. Use this as the transactions dataset. Then use OBS=0 option on any of the dataset as the original dataset.
data all / view=all;
set c b a ;
by id;
run;
data want;
update a(obs=0) all;
by id;
run;
Thanks mkeintz for the codes. All of them work beautifully. And Tom's way to get around for 3 datasets is brilliant and very concise. Only wish I could mark both your replies as solutions.
You are misusing "concatenate". That means to "add on to" or such. If I have a value of 'a' and concatenate with a value of 'b' the result is 'ab' (or if done in the other order) 'ba'. It does not replace values.
COALESCE works in a data step but you would need to rename at least one of the variables.
Your "example" have does not indicate 1) which data set is "have" (does not match either A or B), 2) and should show the two "have" data sets each clearly and then the "want".
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.