- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello Team,
I wondered if users had any experience with "multiple conditional" LEFT JOINs in a PROC SQL?
What I currently have is a PROC SQL with a LEFT JOIN operation that retrieves the variable reference based on id, date, and loc1, which you can see in the code below (I am using SAS 9.4.):
proc sql;
create table table1 as
select a.*,b.reference
from event1 as a
left join master1 as b on a.id = b.id and a.date between b.date1 and b.date2 and a.loc1=b.loc1 and a.loc1^=.;
/* left join master1 as b on a.id = b.id and a.date between b.date1 and b.date2 and a.loc1=b.loc2 and a.loc1^=.;*/
/* left join master1 as b on a.id = b.id and a.date between b.date1 and b.date2 and a.loc3=b.loc3 and a.loc3^=.;*/
quit;
What I would like to do (in the same PROC SQL statement) is to have options for alternative locations (as id and date cannot change in my context) when there are no matching records in the right table (i.e., master1). For example, if there are no records for some rows, I would like to retrieve matches on a.loc1 = b.loc2, and if there is still not a match, then to retrieve matches based on a.loc3 = b.loc3 (see the comments in the SAS code above for the two additional LEFT JOIN operations that I would like to incorporate - in same way - in the PROC SQL). However, I would need to instruct SAS to use the second (or third) LEFT JOIN only for the rows where the current LEFT JOIN operation found no matching records in the right table. So when the first LEFT JOIN finds no matching records for some (or all) rows, move on to the second one, and finally to the third one, assuming the second one has some no matching records either.
Additionally, it would be great if the SAS code can create three additional variables (i.e., additional to the variable reference) where I can tell which LEFT JOIN operation the SAS code used for the matching record, say a dummy variable equal to 1 for var1 (when loc1 was used, zero otherwise), equal to 1 for var2 (when loc2 was used, zero otherwise), and equal to 1 for var3 (when loc3 was used, zero otherwise).
I would appreciate your help on this one and any advice (or example) as to how I can adjust the code in the above example, please.
Thank you
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @chrmav,
I think you can combine LinusH's and PaigeMiller's ideas: Use an OR condition in the ON clause and a corresponding CASE expression in the SELECT statement to identify the matching criterion (e.g., in a variable CRIT with values 1, 2, 3 [for the first, second and third criterion, resp.] and missing [if none of the three criteria was satisfied]). Then you can use that variable CRIT in a HAVING clause to eliminate the duplicate matches mentioned by LinusH, prioritizing CRIT=1 over CRIT=2 and both over CRIT3. The pertinent GROUP BY clause would use a combination of variables (or a single variable) identifying the observations of dataset EVENT1 uniquely. In the code example below I use the combination of ID, DATE, LOC1 and LOC3, but you may have better ideas, knowing the data. If there is no unique key, you should create one (e.g., the observation number) in a preliminary DATA step (view). Obviously, variable CRIT combines the information of the three indicator variables you have described.
proc sql;
create table want as
select a.*, b.reference, case when .^=a.loc1=b.loc1 then 1
when .^=a.loc1=b.loc2 then 2
when .^=a.loc3=b.loc3 then 3
else .
end as crit
from event1 as a
left join master1 as b on a.id = b.id and a.date between b.date1 and b.date2
and (.^=a.loc1=b.loc1 or .^=a.loc1=b.loc2 or .^=a.loc3=b.loc3)
group by a.id, a.date, a.loc1, a.loc3 /* or rather: group by a.<unique key variable(s) in EVENT1> */
having crit=min(crit);
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
CASE WHEN works in the ON clause
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply @PaigeMiller . I had a look online about CASE WHEN in the ON clause but I could not find anything relevant with the LEFT JOIN, so I am not sure how this may work. Any ideas?
Another solution that I thought of was to do three LEFT JOINs in the same PROC SQL and then do a DATA step with COALESCE to get what I want (btw loc variables are numeric). But it would be great to know if PROC SQL can perform this without the DATA step. Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is it possible to post test data as datalines in a data step?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @chrmav,
I think you can combine LinusH's and PaigeMiller's ideas: Use an OR condition in the ON clause and a corresponding CASE expression in the SELECT statement to identify the matching criterion (e.g., in a variable CRIT with values 1, 2, 3 [for the first, second and third criterion, resp.] and missing [if none of the three criteria was satisfied]). Then you can use that variable CRIT in a HAVING clause to eliminate the duplicate matches mentioned by LinusH, prioritizing CRIT=1 over CRIT=2 and both over CRIT3. The pertinent GROUP BY clause would use a combination of variables (or a single variable) identifying the observations of dataset EVENT1 uniquely. In the code example below I use the combination of ID, DATE, LOC1 and LOC3, but you may have better ideas, knowing the data. If there is no unique key, you should create one (e.g., the observation number) in a preliminary DATA step (view). Obviously, variable CRIT combines the information of the three indicator variables you have described.
proc sql;
create table want as
select a.*, b.reference, case when .^=a.loc1=b.loc1 then 1
when .^=a.loc1=b.loc2 then 2
when .^=a.loc3=b.loc3 then 3
else .
end as crit
from event1 as a
left join master1 as b on a.id = b.id and a.date between b.date1 and b.date2
and (.^=a.loc1=b.loc1 or .^=a.loc1=b.loc2 or .^=a.loc3=b.loc3)
group by a.id, a.date, a.loc1, a.loc3 /* or rather: group by a.<unique key variable(s) in EVENT1> */
having crit=min(crit);
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I ran the code you suggested and it appears to accomplish exactly what I need. Additionally, it is quite versatile, as I can easily adjust the matching order by specifying the priority in the CASE expression - indicating which option should be first (1), second (2), or third (3).
Thank you so much for you time and effort on this.