BookmarkSubscribeRSS Feed
0 Likes

Left joins can increase the number of rows in the left table if there are multiple matches in the right table.

 

I know we can use the data step to select only the first or last match a join finds, but this parameter should be something the user can select inside of EG's Query Builder/Join Tables function, instead of having to dip down into code. 

 

I can't seem to find a way to replicate this in the drag and drop of EG.  Have to use code directly.  If I use EG Query Builder and a left join on NameID, Bob shows up twice in the Final table as his NameID matches two orders.  The code below only selects the first order found because of the first.nameid statement; strike it and Bob shows up twice.   Ideally, you'd be able to handle multiple matches on the join inside of the EG Join Tables layout directly.

  

 

8 Comments
Reeza
Super User

This would make the SAS left join different than other SQL implementations. IMO this is a user responsibility, they need to create the correct filter to make the join specific to what they want. Either add another criteria to the join or a WHERE clause for the filter.  So you can 'limit' the results in the query builder if you have clear rules. 

 

Otherwise you add another step to remove the duplicates first, ie using SORT task.

ballardw
Super User

If you do not want all matches from the second set in the result then you should not use Left or Right Join on those sets. That is designed function of Left/Right Join.

 

It is up to you to select the records you want based on your criteria. Reduce either one or both sets to the records you want before the join or possibly in a subquery as part of the join query.

 

 

fifthand57th
SAS Employee
Status changed to: Not Planned
 
graal
Calcite | Level 5

Part of the reason for the suggestion is that users that migrate from Excel's VLOOKUP are used to obtaining a single match result, instead of ballooning the number of rows in the left table.  Yes, the suggestions above are certainly valid, but I thought adding an output limiter parameter at the Join Tables screen would be useful.

Reeza
Super User

@graal You can replicate VLOOKUP functionality using a user defined format. Also, if the table is designed as a VLOOKUP it would require unique entries which wouldn't result in duplicates. If the lookup table has duplicates this is when you run into the 'extra row' issue. Excel would have issues in this scenario as well. 

 

Left joins  are broader than a VLOOKUP and maintaining the consistency with other SQL implementations is more important IMO. 

 

Note that you couldn't limit it anyways with a click, you would have to define some variable to restrict it on and that's already possible using a where or filter. I'll post a screen shot of how to accomplish your task with just the GUI when I'm back in the office Monday and have access to EG. 

Quentin
Super User

Setting aside my bias against Excel Smiley Happy, for such a feature if you wanted to keep only one record from the right table, you would need have a number of options to decide which records to keep (first? last? some other criteria?)  And remember that "first" and "last" don't really have much meaning in SQL, which is a SET language where record order doesn't matter. It looks like VLOOKUP keeps only the first value found, "first" has a meaning in Excel. I think allowing the programmer to write their own subquery or pre process or post process the data gives them apprpriate control.

 

Interestingly, when I google "VLOOKUP Multiple Records" I find a lot of hits where people are teaching Excel workarounds for what to do when you want multiple records to be returned, but VLOOKUP won't do it for you.

graal
Calcite | Level 5

Bias against Excel?  It is a perfect program and does everything, including make omlettes.

 

Absolutely VLOOKUP isnt perfect and has limitations and is a halfway house version of a left join.  But it's commonly used by the casual Excel user, who now has to use SAS EG to tap bigger datasets but doesnt want to dive into code.  They're used to VLOOKUP's limitations and output, and may not realize the rows balloon in the left table on multiple matches.  It's not intuitive to someone who is only used to VLOOKUP. 

 

The majority of the folks on this board are completely comfortable with code, which is great, but not everyone is, and that's why EG's drag and drop is so useful.  Hence my suggestion to add a limiter feature in the drag and drop interface.

 

Yes you'd need a first or last or something else parameter in the Join Tables page.  And maybe EG wouldnt use SQL to execute it....might use a DATA step to get the results.  EG already generates code for you, this would just be a flavor.

 

Maybe an easier, lower impact alternative would be to add a warning to the Log output.  If a left join generated output that duplicated any row from the left table, a warning could be displayed that would alert the user.  Wouldnt prevent the ballooning of records but it would at least remind the user it happened, and point them towards a solution (de-dupe the right table so there are unique lookups).

Reeza
Super User

All programmers end up with a bias against Excel. There isn't a single conversion of code that I've done from Excel to SAS/R/Python/Fortran/C/Java that's not had a mistake it in the Excel file. 

Excel has its uses and is generally universal but it also causes a lot of issues.