Hi, I'm new to SAS, just completed Programming 1 and VA Fast Track and gained access to EG. While I wait for The Little SAS Book for Enterprise to arrive in the mail, I have a few questions about joining tables in EG.
Please correct my nomenclature or terminology when possible, so I can improve my knowledge of this platform. If anything is unclear, let me know and I will do my best.
My usual MO is to use Excel to join and lookup values to build tables from several .csv files. I have very adept at performing these actions, and while sometimes slow, it does the job. I have had it referred to as the tradesman way of doing things, and haven't got a great deal of experience with Access, SQL or relational databases, although I can waddle my way through.
I appreciate the power and might of SAS, and would like to start performing these actions in EG for myself, with a view that other non SAS users can run these as a stored process or project as backup in a pinch.
Ok, here is where I am at.
Importing Data - all good
Formats, Informats and Labels - all good
Converting to SAS datasets - all good, I have converted my lookup files into SAS sets, as well as my source files.
Calculated fields - ok, still working on it (missing values had me for a sec)
Joining Tables - hmmmm not working the same way as I thought it would
Example – Old Method join
I concatenate three columns into one value in all tables, to get one index value or primary key. Starting the query the tbl_Source1, I use this value to vlook other tables to return values to separate columns. Let's call it Key (var1,var2,var3).I have simplified it down to two returned values and a few other calc fields and a value lookup.
tbl_Source1
Key (var1,var2,var3) | var1 | var2 | var3 | var3_literal | X | Y | Calc_Field | Calc_Field2 |
redwhiteblue | Red | White | Blue | Turquoise | 1 | 2 | =(X+Y) | =(X*20) |
tbl_Source2
key | X |
redwhiteblue | 1 |
tbl_Source3
key | Y |
redwhiteblue | 2 |
tbl_Lookup1
var3 | var3_literal |
Blue | Turquoise |
New Method (the way I see it)
Using the query builder, I have tried to perform this however this seems to drop columns, and I can't see with clear visual inference when a value does not match to the other sets.
Q2. What type of join would I require to perform this?
I have tried several methods, and I don’t seem to be hitting the mark. I am performing all joins in the one query based on tbl_source1. I have had success with the calc fields using advanced expression.
tbl_source1 (post query)
var1 | var2 | var3 | var3_literal | X | Y | Calc_Field | Calc_Field2 |
Red | White | Blue | Turquoise | 1 | 2 | sum(X,Y) | (X*20) |
tbl_Source2
var1 | var2 | var3 | x |
Red | White | Blue | 1 |
tbl_source3
var1 | var2 | var3 | y |
Red | White | Blue | 2 |
tbl_Lookup1
var3 | var3_literal |
Blue | Turquoise |
Apart from the join, I would like to see other results as well.
If the three values (var1,var2,var3) is in tbl_source1, are not in tbl_source2 or tbl_source3, output a subset with these observations (non-matched records).
If the three values (var1,var2,var3) is in tbl_source2 or tbl_source3 but not in tbl_source1, output two subsets with these observations.
If var1, var2, var3 as separate and individual values appear don’t appear in all tables, create a method to identify them.
And for the lookup join (similar to last)
If var3 presents in any source table, but does not appear in tbl_Lookup1, create a subset of records, or identify values that are in the source files but does not appear in the lookup. A validation that the lookup contains all the values is required.
Some of this was covered in Prog 1, however not in this detail, and not in EG. Any guidance would be appreciated,
Michael
Hi Michael,
Great to see you online 🙂
The default join type in Enterprise Guide is the inner join, which means the rows that are returned are where there are matches from both tables. This obviously isn't what you are after so you need to change the join type. In the Query Builder you go into the join tables window and you can double click on the join circle/Venn diagram to change this or right mouse click and select the option. I think what you are after is a full join on the tables and then you want to select the rows of interest using the filter tab. If you are joining your 3 tables at the same time, and do a full join on each of the joins then you will be getting rows that match and don't match from ALL tables. You can then use the filter tab to subset the rows based on the criteria you want.
E.g. If the three values (var1,var2,var3) is in tbl_source1, are not in tbl_source2 or tbl_source3, output a subset with these observations (non-matched records). - To do this then you would need to specify in the filter tab, (tbl_Source2.var1 is missing and tbl_Source2.var2 is missing and tbl_Source2.var3 is missing) or (tbl_Source3.var1 is missing and tbl_Source3.var2 is missing and tbl_Source3.var3 is missing)
or you can do the subsetting on the values of the columns from tbl_Source2 and tbl_Source3 that are missing (X and Y), if it is those values that determine non-matches.
Similarly for your other queries, look at using an outer join type with filtering. Here's an EG 4.1 paper that you might find useful as a reference: http://www.scsug.org/SCSUGProceedings/2010/Schlechte/Using_Enterprise_Guide_for_Table_Joins_and_Sets...
As you mentioned you have done Prog 1, then I suggest to look at page 10-85 from the Prog 1 course notes where they have an example of creating multiple data sets when merging multiple tables using the in= data set options and subsetting if statement. If your data has many-to-many observations though, its probably best to use the EG query builder or write PROC SQL as the way the data step processes observation at a time will probably not produce what you want for a many-to-many merge (see pages 10-86 & 10-87 where data step is compared to sql).
Hope this helps.
Cheers,
Michelle
Hi Michael,
Great to see you online 🙂
The default join type in Enterprise Guide is the inner join, which means the rows that are returned are where there are matches from both tables. This obviously isn't what you are after so you need to change the join type. In the Query Builder you go into the join tables window and you can double click on the join circle/Venn diagram to change this or right mouse click and select the option. I think what you are after is a full join on the tables and then you want to select the rows of interest using the filter tab. If you are joining your 3 tables at the same time, and do a full join on each of the joins then you will be getting rows that match and don't match from ALL tables. You can then use the filter tab to subset the rows based on the criteria you want.
E.g. If the three values (var1,var2,var3) is in tbl_source1, are not in tbl_source2 or tbl_source3, output a subset with these observations (non-matched records). - To do this then you would need to specify in the filter tab, (tbl_Source2.var1 is missing and tbl_Source2.var2 is missing and tbl_Source2.var3 is missing) or (tbl_Source3.var1 is missing and tbl_Source3.var2 is missing and tbl_Source3.var3 is missing)
or you can do the subsetting on the values of the columns from tbl_Source2 and tbl_Source3 that are missing (X and Y), if it is those values that determine non-matches.
Similarly for your other queries, look at using an outer join type with filtering. Here's an EG 4.1 paper that you might find useful as a reference: http://www.scsug.org/SCSUGProceedings/2010/Schlechte/Using_Enterprise_Guide_for_Table_Joins_and_Sets...
As you mentioned you have done Prog 1, then I suggest to look at page 10-85 from the Prog 1 course notes where they have an example of creating multiple data sets when merging multiple tables using the in= data set options and subsetting if statement. If your data has many-to-many observations though, its probably best to use the EG query builder or write PROC SQL as the way the data step processes observation at a time will probably not produce what you want for a many-to-many merge (see pages 10-86 & 10-87 where data step is compared to sql).
Hope this helps.
Cheers,
Michelle
Cheers Michelle, I took the easy road with Tom's instructions, but I printed the SUGI, and it has convinced me to learn more about PROC SQL, I think that's one of the heavy lifters in what i want to achieve.
No problem Michael. Glad to hear that you were able to get what you wanted and were guided by the assistance of the SAS community. 🙂
Hi, Tucky
Welcome to the exciting (but sometimes overwhelming) world of SAS!
Everything Michelle says is spot-on. I'll add a few comments.
You're climbing a high wall, in that EG uses SQL under the query builder covers, and SQL is an enormously complex subject on its own. The good news, is you can start small and use pieces of it for what you need without knowing the whole thing.
When you talk about Excel, I assume you use the VLOOKUP function, which I use a lot as well. Here's what I would consider the equivalent in EG:
1. Start with your "master" table, in your example I think it's tbl_Source1.
2. Select Join Tables, Add Tables, and select your lookup table (tbl_Source2?)
3. EG may "guess" at a join, or not. Either way, you need to make sure that the corresponding key variables are linked, and that they are all the second option (left join). This will keep ALL of the records in tbl_Source1, and ONLY include data from tbl_Source2 when the keys match tbl_Source1.
Give this a try, and see if it works the way you expect. In the resulting table, you mention that you'd like to know if the records matched or not.
One way I do this is to include var1 from tbl_Source2 in the results (with a different name), and check its value. Since it's used to join to tbl_Source1, if a record was found in tbl_Source2, it won't be missing, and if there was no match, it will be missing.
Here's the SQL that EG generated for me...you can see 'include_flag', which is the renamed var1 from the lookup table (they'll all either be there or not be there). So this is how you can fulfill your requirement to find the records in tbl_Source1 that didn't match; just do a selection with a filter of 'include_flag IS MISSING'.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_TBL_SOURCE1 AS
SELECT t1.var1,
t1.var2,
t1.var3,
t2.x,
t2.var1 LABEL="include_flag" AS include_flag
FROM WORK.TBL_SOURCE1 t1
LEFT JOIN WORK.TBL_SOURCE2 t2 ON (t1.var1 = t2.var1) AND (t1.var2 = t2.var2) AND (t1.var3 = t2.var3);
QUIT;
A quick note about "missing" values. SAS started with this terminology in the late '70s. When SQL came along, they introduced the concept of NULL, which is very similar but not quite identical. For your purposes, you can consider them the same thing, with two different names.
For now, just join two tables at a time, and then do another query to add in your next table, until you get familiar with the whole shmozzle.
You're making a really good start at this. Keep it up, and ask lots of questions!
Tom
Thanks Tom, very helpful. I managed to find time to work on this during the week. Using your advice, managed the joins successfully. I used the query feature to create subsets with missing values on the finished sheets, then I could review the results and make changes to source files, then rerun the project. Stepping the project out two tables at a time helped, I think I was trying to create an "all in one", biting off more than I could chew.
Outstanding! At the end of the day, it's getting results that counts.
When getting used to SQL, the "try a small example, see what happens, and work from there", which you used, is essential. As time goes by, you'll be doing more and more in one step.
Michelle's advice is spot on. SQL is a very powerful tool in SAS, and well worth learning.
Good luck!
Tom
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.