Re: PROC SQL nested joins and Coalesce

BobHope · Posted 06-04-2018 08:26 AM

I encountered a pretty strange problem while reviewing some code programmed by another programmer. First of all he was using nested joins in PROC SQL. By nested I mean that there is joins appended like the following:


proc sql;
create table test as
select a.*, coalesce(b.var,c.var,d.var,e.var) as newvar
from data1 as a 
left join data2(where=(some filter)) as b on a.key=b.key
where some conditions for a and/or b
left join data3(where=(some filter)) as c on a.key=c.key
where some conditions for a and/or c
left join data4(where=(some filter)) as d on a.key=d.key;
where some conditions for a and/or d
quit;

And the joins SEEM to go as intended. I was not aware that this kind of nesting is even possible so does SAS process this sequentially so that it first performs the join with a and b and then the result table is joined with c etc.? For example I don't quite understand how SAS handles the situations where there are common columns in b,c and d and they are already contained in the result table that the new condition is then left joined...

And then we come to the second question which is probably somehow related to the above mentioned processing. The reason I previously wrote SEEM is because the coalesce does not output correct values. For example if we have a situation that the current iterations state is that b.var="" c.var="" and d.var="value" then the coalesce assigns newvar="" BUT if the order of the arguments are changed (but the state is still the same) for example to coalesce(b.var,d.var,c.var) then the newvar gets the value of "value"...?!?

So my initial reasoning was that the aliases a,b,c and d refers to the filtered data (for example data2(where=(some filter)) for b) and not to the joined data when using the coalesce (I am not quite sure if coalesce is transforming into a SAS function instead of SQL function because of several arguments, since the coalesce is supported by SQL if I am not wrong). But this reasoning was revoked because when I tried this manually by filtering all the datasets with the where filter and the outcome was just as desired but the coalesce output was still wrong).

Hopefully someone can get the hang of this issue even there are no specific codes. Thanks!

RW9 · Posted 06-04-2018 08:47 AM

To be honest, I would just scrap that code (its using SAS bits, and SQL bits for instance) and simply:

data test;
  merge a (where=(some filter) in=a)
        b (where=(some filter))
        ...;
  by var;
  if a;
run;

Easier to read?

As for why its doing that, well depends on where the data overlaps, what the data is (could it be a non missing space in one of those, that would then treat it as the first non-missing, even though its just a space). Its best to break i down, remove most of the left joins and add them comparing each run to see what is added. Once you build it up that way you should be able to trace what is happening.

BobHope · Posted 06-04-2018 08:59 AM

Yes I agree that it would be a lot wiser to just do it in parts. But the reason why I would like to know what is happening in this particular case is that since this is clearly used by some programmers it would be good to understand the clogs of this machine. And to point out I have done this in bits and it works when run step by step, it just fails to perform when using this nested approach.

RW9 · Posted 06-04-2018 10:38 AM

Sorry, you mis interpreted. You can block out code with /* */ and see what each bit is doing. Its the debug process. Can't say anything without the data as that drives half of it.

gamotte · Posted 06-04-2018 09:59 AM

Hello,

without example data that replicates your issue, it is difficult to answer. Can you identify the variables and observations where your coalesce problem occur and create a minimal example in the form of a datastep ?

I wouldn't call what is done "nested joins" but consecutive joins meaning you sequentially append new columns at the right of your dataset . Those are left joins meaning that the contents of data1 define which observations will be in the output dataset.

The query creates in memory a "big" dataset

[D1 | matching obs of D2 | matching Obs of D3 | matching Obs of D4 ]

which is then filtered by the where clauses.

"dataset foo as a" defines an alias "a" for the foo dataset. When there exist some columns with common names in several datasets

the ambiguity is removed by prefixing the variable name by the source dataset alias :a.my_var, b.my_var, ...

If you want a.my_var and b.my_var to be in the output dataset, one of them has to be renamed :

SELECT a.my_var, b.my_var AS my_var_of_b, ...

data3(where=(some filter))

Here the filter is applied before joining the resulting dataset

where some conditions for a and/or b

This filter is applied after the joins. I don't think there is a need to have several such statements. A unique where statement at the end of the query with all conditions should be equivalent.

coalesce(b.var,c.var,d.var,e.var) as newvar

creates a new variable newvar as the first non-missing value of variable var in the datasets data2, data3, data4.

Note : there is no dataset alias e in your query.

PROC SQL nested joins and Coalesce