Solved: Re: Merge: how to concatenate values of the same variables from both d...

Solph · Posted 02-19-2022 01:10 AM

Hi,

I have two data sets A and B with some common variables and case IDs, and I want to merge them. Most of all, for cases with the same variable names, I'd like to take values from first A; if missing then take from B. I know I can rename variables in B before merging and then use DO IF to get the job done. But I'm wondering if there is quick way without going through the hassle of renaming and DO IF (as I have need to deal with more than 15 variables on this.

In SQL I could use coalesce, such as proc sql; select coalesce(a.var1, b.var1) as var1 from A as a full join B as b on a.ID=b.ID ...

But in Data sets, is there a trick to do this in a couple of simple lines?

data a; input id 1 var1 3 var2 5 var3 7;
datalines;
1 1 2 
2 2 3 4
3 3 4 5
;
proc print; run;
data b; input id var1 var3 var4;
datalines;
1 11 12 7
2 12 13 8
4 13 14 9
;
proc print; run;

data comb; format id var1 var2 var3 var4; merge b a; by id; 
proc print; run;

HAVE					WANT
id	var1	var2	var3	var4	id	var1	var2	var3	var4
1	1	2	.	7	1	1	2	12	7
2	.	3	4	8	2	40	3	4	8
3	3	.	5	.	3	3	.	5	.
4	13	.	14	9	4	13	.	14	9

mkeintz · Posted 02-19-2022 02:09 AM

Why

,,, does the result for ID 2, Var1 have a 40? Shouldn't is be a 2?
... does the result for ID 3, Var2 have a missing value. Shouldn't it be a 4?

If my suggested values are actuallly correct, then:

data want;
  update b a;
  by id;
run;

The UPDATE statement is a lot like merge. When the corresponding value in the second dataset (A) is not missing, it will replace the value in B. But if the value in A is missing then the original value in B is kept.

Another way to do this is:

data want;
  set a b;
  by id;
  array vars var1-var4;
  do over vars;
    vars=ifn(first.id=0,coalesce(lag(vars),vars),vars);
  end;
  if last.id;
run;

But the UPDATE statement, can't be extended to more-than-two datasets (say you have datasets A, B, and C, and you want the equivalent of coalesce(a.var,b.var,c.var)). That could be done via:

data want;
  set  a b c;
  by id;
  array vars {4} var1-var4;
  array tmp {4} _temporary_ ;
  if first.id then call missing(of tmp{*});
  do i=1 to 4;
    tmp{i}=coalesce(tmp{i},vars{i});
  end;
  if last.id;
  do i=1 to 4;
    vars{i}=tmp{i};
  end;
run;

But actually the update statement can be used with more-than-two datasets, per @Tom 's reply to my post in this thread. True, it's an extra data step, but because the extra step is a data set VIEW (not a data set FILE), it does not require extra disk usage. Very nice.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 02-19-2022 02:09 AM

Why

,,, does the result for ID 2, Var1 have a 40? Shouldn't is be a 2?
... does the result for ID 3, Var2 have a missing value. Shouldn't it be a 4?

If my suggested values are actuallly correct, then:

data want;
  update b a;
  by id;
run;

The UPDATE statement is a lot like merge. When the corresponding value in the second dataset (A) is not missing, it will replace the value in B. But if the value in A is missing then the original value in B is kept.

Another way to do this is:

data want;
  set a b;
  by id;
  array vars var1-var4;
  do over vars;
    vars=ifn(first.id=0,coalesce(lag(vars),vars),vars);
  end;
  if last.id;
run;

But the UPDATE statement, can't be extended to more-than-two datasets (say you have datasets A, B, and C, and you want the equivalent of coalesce(a.var,b.var,c.var)). That could be done via:

data want;
  set  a b c;
  by id;
  array vars {4} var1-var4;
  array tmp {4} _temporary_ ;
  if first.id then call missing(of tmp{*});
  do i=1 to 4;
    tmp{i}=coalesce(tmp{i},vars{i});
  end;
  if last.id;
  do i=1 to 4;
    vars{i}=tmp{i};
  end;
run;

But actually the update statement can be used with more-than-two datasets, per @Tom 's reply to my post in this thread. True, it's an extra data step, but because the extra step is a data set VIEW (not a data set FILE), it does not require extra disk usage. Very nice.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 02-19-2022 01:38 PM

Update seems like what they want.

To get around the limit of only two datasets (original and transactions) you could make a dataset view.

Plus the original dataset needs to have unique observations per BY group. Perhaps your original dataset has multiple observations per id, perhaps because of repeated observation over rime.

You could get around both limitations by using a data step view and OBS=0 dataset option.

So make a view that interleaves the datasets by the id variables. Set the datasets in reverse order of precedence. Use this as the transactions dataset. Then use OBS=0 option on any of the dataset as the original dataset.

data all / view=all;
   set c b a ;
   by id;
run;
data want;
  update a(obs=0) all;
  by id;
run;

Solph · Posted 02-19-2022 04:27 PM

Thanks mkeintz for the codes. All of them work beautifully. And Tom's way to get around for 3 datasets is brilliant and very concise. Only wish I could mark both your replies as solutions.

ballardw · Posted 02-19-2022 01:14 PM

You are misusing "concatenate". That means to "add on to" or such. If I have a value of 'a' and concatenate with a value of 'b' the result is 'ab' (or if done in the other order) 'ba'. It does not replace values.

COALESCE works in a data step but you would need to rename at least one of the variables.

Your "example" have does not indicate 1) which data set is "have" (does not match either A or B), 2) and should show the two "have" data sets each clearly and then the "want".

Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Re: Merge: how to concatenate values of the same variables from both datasets

Registration is open

SAS Training: Just a Click Away