BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Solph
Pyrite | Level 9

Hi,

 

I have two data sets A and B with some common variables and case IDs, and I want to merge them. Most of all, for cases with the same variable names, I'd like to take values from first A; if missing then take from B. I know I can rename variables in B before merging and then use DO IF to get the job done. But I'm wondering if there is quick way without going through the hassle of renaming and DO IF (as I have need to deal with more than 15 variables on this.

In SQL I could use coalesce, such as proc sql; select coalesce(a.var1, b.var1) as var1 from A as a full join B as b on a.ID=b.ID ...

But in Data sets, is there a trick to do this in a couple of simple lines? 

data a; input id 1 var1 3 var2 5 var3 7;
datalines;
1 1 2 
2 2 3 4
3 3 4 5
;
proc print; run;
data b; input id var1 var3 var4;
datalines;
1 11 12 7
2 12 13 8
4 13 14 9
;
proc print; run;

data comb; format id var1 var2 var3 var4; merge b a; by id; 
proc print; run;
HAVE         WANT        
id var1 var2 var3 var4 id var1 var2 var3 var4
1 1 2 . 7 1 1 2 12 7
2 . 3 4 8 2 40 3 4 8
3 3 . 5 . 3 3 . 5 .
4 13 . 14 9 4 13 . 14 9

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Why

  1. ,,, does the result for ID 2, Var1 have a 40?  Shouldn't is be a 2?
  2. ... does the result for ID 3, Var2 have a missing value.  Shouldn't it be a 4?

 

If my suggested values are actuallly correct, then:

 

data want;
  update b a;
  by id;
run;

The UPDATE statement is a lot like merge.  When the corresponding value in the second dataset (A) is not missing, it will replace the value in B.  But if the value in A is missing then the original value in B is kept.

 

Another way to do this is:

data want;
  set a b;
  by id;
  array vars var1-var4;
  do over vars;
    vars=ifn(first.id=0,coalesce(lag(vars),vars),vars);
  end;
  if last.id;
run;

But the UPDATE statement, can't be extended to more-than-two datasets (say you have datasets A, B, and C, and you want the equivalent of coalesce(a.var,b.var,c.var)).  That could be done via:

 

data want;
  set  a b c;
  by id;
  array vars {4} var1-var4;
  array tmp {4} _temporary_ ;
  if first.id then call missing(of tmp{*});
  do i=1 to 4;
    tmp{i}=coalesce(tmp{i},vars{i});
  end;
  if last.id;
  do i=1 to 4;
    vars{i}=tmp{i};
  end;
run;

 

 But actually the update statement can be used with more-than-two datasets, per @Tom 's reply to my post in this thread.  True, it's an extra data step, but because the extra step is a data set VIEW (not a data set FILE), it does not require extra disk usage.  Very nice.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
mkeintz
PROC Star

Why

  1. ,,, does the result for ID 2, Var1 have a 40?  Shouldn't is be a 2?
  2. ... does the result for ID 3, Var2 have a missing value.  Shouldn't it be a 4?

 

If my suggested values are actuallly correct, then:

 

data want;
  update b a;
  by id;
run;

The UPDATE statement is a lot like merge.  When the corresponding value in the second dataset (A) is not missing, it will replace the value in B.  But if the value in A is missing then the original value in B is kept.

 

Another way to do this is:

data want;
  set a b;
  by id;
  array vars var1-var4;
  do over vars;
    vars=ifn(first.id=0,coalesce(lag(vars),vars),vars);
  end;
  if last.id;
run;

But the UPDATE statement, can't be extended to more-than-two datasets (say you have datasets A, B, and C, and you want the equivalent of coalesce(a.var,b.var,c.var)).  That could be done via:

 

data want;
  set  a b c;
  by id;
  array vars {4} var1-var4;
  array tmp {4} _temporary_ ;
  if first.id then call missing(of tmp{*});
  do i=1 to 4;
    tmp{i}=coalesce(tmp{i},vars{i});
  end;
  if last.id;
  do i=1 to 4;
    vars{i}=tmp{i};
  end;
run;

 

 But actually the update statement can be used with more-than-two datasets, per @Tom 's reply to my post in this thread.  True, it's an extra data step, but because the extra step is a data set VIEW (not a data set FILE), it does not require extra disk usage.  Very nice.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

Update seems like what they want.

 

To get around the limit of only two datasets (original and transactions) you could make a dataset view. 

 

Plus the original dataset needs to have unique observations per BY group.  Perhaps your original dataset has multiple observations per id, perhaps because of repeated observation over rime.

 

You could get around both limitations by using a data step view and OBS=0 dataset option.

 

So make a view that interleaves the datasets by the id variables.  Set the datasets in reverse order of precedence.  Use this as the transactions dataset.  Then use OBS=0 option on any of the dataset as the original dataset.

data all / view=all;
   set c b a ;
   by id;
run;
data want;
  update a(obs=0) all;
  by id;
run;

 

 

Solph
Pyrite | Level 9

Thanks mkeintz for the codes. All of them work beautifully. And Tom's way to get around for 3 datasets is brilliant and very concise. Only wish I could mark both your replies as solutions. 

ballardw
Super User

You are misusing "concatenate". That means to "add on to" or such. If I have a value of 'a' and concatenate with a value of 'b' the result is 'ab' (or if done in the other order) 'ba'. It does not replace values.

 

COALESCE works in a data step but you would need to rename at least one of the variables.

Your "example" have does not indicate 1) which data set is "have" (does not match either A or B), 2) and should show the two "have" data sets each clearly and then the "want".

 

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2822 views
  • 3 likes
  • 4 in conversation