2 weeks ago
I encountered a pretty strange problem while reviewing some code programmed by another programmer. First of all he was using nested joins in PROC SQL. By nested I mean that there is joins appended like the following:
proc sql; create table test as select a.*, coalesce(b.var,c.var,d.var,e.var) as newvar from data1 as a
left join data2(where=(some filter)) as b on a.key=b.key where some conditions for a and/or b left join data3(where=(some filter)) as c on a.key=c.key where some conditions for a and/or c left join data4(where=(some filter)) as d on a.key=d.key; where some conditions for a and/or d quit;
And the joins SEEM to go as intended. I was not aware that this kind of nesting is even possible so does SAS process this sequentially so that it first performs the join with a and b and then the result table is joined with c etc.? For example I don't quite understand how SAS handles the situations where there are common columns in b,c and d and they are already contained in the result table that the new condition is then left joined...
And then we come to the second question which is probably somehow related to the above mentioned processing. The reason I previously wrote SEEM is because the coalesce does not output correct values. For example if we have a situation that the current iterations state is that b.var="" c.var="" and d.var="value" then the coalesce assigns newvar="" BUT if the order of the arguments are changed (but the state is still the same) for example to coalesce(b.var,d.var,c.var) then the newvar gets the value of "value"...?!?
So my initial reasoning was that the aliases a,b,c and d refers to the filtered data (for example data2(where=(some filter)) for b) and not to the joined data when using the coalesce (I am not quite sure if coalesce is transforming into a SAS function instead of SQL function because of several arguments, since the coalesce is supported by SQL if I am not wrong). But this reasoning was revoked because when I tried this manually by filtering all the datasets with the where filter and the outcome was just as desired but the coalesce output was still wrong).
Hopefully someone can get the hang of this issue even there are no specific codes. Thanks!
2 weeks ago
To be honest, I would just scrap that code (its using SAS bits, and SQL bits for instance) and simply:
data test; merge a (where=(some filter) in=a) b (where=(some filter)) ...; by var;
if a; run;
Easier to read?
As for why its doing that, well depends on where the data overlaps, what the data is (could it be a non missing space in one of those, that would then treat it as the first non-missing, even though its just a space). Its best to break i down, remove most of the left joins and add them comparing each run to see what is added. Once you build it up that way you should be able to trace what is happening.
2 weeks ago
2 weeks ago
Sorry, you mis interpreted. You can block out code with /* */ and see what each bit is doing. Its the debug process. Can't say anything without the data as that drives half of it.
2 weeks ago
without example data that replicates your issue, it is difficult to answer. Can you identify the variables and observations where your coalesce problem occur and create a minimal example in the form of a datastep ?
I wouldn't call what is done "nested joins" but consecutive joins meaning you sequentially append new columns at the right of your dataset . Those are left joins meaning that the contents of data1 define which observations will be in the output dataset.
The query creates in memory a "big" dataset
[D1 | matching obs of D2 | matching Obs of D3 | matching Obs of D4 ]
which is then filtered by the where clauses.
"dataset foo as a" defines an alias "a" for the foo dataset. When there exist some columns with common names in several datasets
the ambiguity is removed by prefixing the variable name by the source dataset alias :a.my_var, b.my_var, ...
If you want a.my_var and b.my_var to be in the output dataset, one of them has to be renamed :
SELECT a.my_var, b.my_var AS my_var_of_b, ...
Here the filter is applied before joining the resulting dataset
where some conditions for a and/or b
This filter is applied after the joins. I don't think there is a need to have several such statements. A unique where statement at the end of the query with all conditions should be equivalent.
coalesce(b.var,c.var,d.var,e.var) as newvar
creates a new variable newvar as the first non-missing value of variable var in the datasets data2, data3, data4.
Note : there is no dataset alias e in your query.