Solved: Re: data step equivalent of proc sql coalese

Julie99999 · Posted 07-27-2022 05:38 PM

Hi,

I have two dataset A and B, A and B have same column names but different data.

I use the following sql code to 'coalesce" data from A and B:

proc sql;

create table c as

select coalesce(a.column1, b.column1) as column 1,

...

coalesce(a.column100, b.column100) as column 100

from A as a full join B as b

on a.key = b.key and a.period = b.period;

quit;

My question is: is there any data step equivalent to the proc sql procedure listed above?

I tried

data c;

update a b;

by key, period;

run;

But the problem is that, whenever both table a and table b contains values for say, column1, then the column1 value from the transactional dataset b will be used. My intention is to keep value from the master table a whenever it's not missing and only use value from table b if such value is missing in a(like the way coalesce)

Thanks in advance!

PaigeMiller · Posted 07-27-2022 07:19 PM

In a data step, you can use the COALESCE function on numeric variables, or the COALESCEC function on character variables. You just have to merge the tables in a DATA step, and rename the variables in one of the tables.

In addition you can use an ARRAY to prevent the need from writing COALESCE 100 times, you code it once and loop through it 100 times.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 07-27-2022 07:19 PM

In a data step, you can use the COALESCE function on numeric variables, or the COALESCEC function on character variables. You just have to merge the tables in a DATA step, and rename the variables in one of the tables.

In addition you can use an ARRAY to prevent the need from writing COALESCE 100 times, you code it once and loop through it 100 times.

--
Paige Miller

Julie99999 · Posted 07-27-2022 10:38 PM

thank you so much my friend. I was thinking the same thing and your comment makes me more confident about the solution I was thinking about. By the way, I recognize your ID, you previously helped me (with my another ID). Thank you so much for continuing contributing to the community!

Astounding · Posted 07-27-2022 07:21 PM

The UPDATE approach will work, so long as the data sets are limited to one observation per PERIOD KEY combination. Just change the order in the UPDATE statement:

update b a;

Julie99999 · Posted 07-27-2022 10:41 PM

thank you for your reply. THis is my quesiton actually. Because for UPDATE, if we do update b a; then whenever there is value in a, the previous value in b will be overrided. that's not intended. What if I want to always keep values from b, but only get values from a if such values in b is missing? It seems for "update b a", the result is slightly different.

SASKiwi · Posted 07-27-2022 11:15 PM

@Julie99999 :

"update b a" works if you want to keep A if populated and only take B if A is not populated.

"update a b" works if you want to keep B if populated and only take A if B is not populated.

Julie99999 · Posted 07-28-2022 11:45 AM

thanks for the succinct take-away!

Astounding · Posted 07-28-2022 08:15 AM

Julie,

Actually, that's different than what you posted originally. The originally post looks for the value from A, but if that is missing use the value from B. You would get this (but by a slightly different route) using:

update b a;

You you start with the value from B. But if the value from A exists, use that to replace the value from B. So you would get the value from A (when it exists), but the value from B (when A has just a missing value).

Julie99999 · Posted 07-28-2022 11:45 AM

Astounding, I see your point. That's smart! That would work and solves exactly my question, how come I didn't realize by simply switching to "b a" would solve the problem.
update b a;
thanks for the "A ha" moment you've brought to me. 🙂

ballardw · Posted 07-27-2022 08:29 PM

My first suggestion would be to try this assuming the two data sets are sorted correctly:

data c;
   merge b a ;
   by key period;
run;

The Merge, assuming only ONE of the data sets has multiples of the By variables, will replace variables left to right on a appearance on the Merge statement. So like named variables in A will replace those of B when the By variables match. This will include missing values in A replacing values in B if such occur. If you do not want the missing values to replace them and as @Astounding said neither has duplicates of the By variables you can prevent update of missing values from A by using UPDATE.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away