Hi,
Can I have inner join and left join in one step in PROC SQL, or I've to do in 2 steps or do with subquery?
Here is an example.
1. I want ID in both tables A and B (so just ID=1 and 2), so it's inner join (in data step it's in A and B)
2. I then want to link the cases to table C to get their breakfast category. Since it's likely the category in C is not complete, I'd need to use left join to keep records (i.e. mushrrom) from step 1 (in data step it's in A).
data a; input id x1;
datalines;
1 11
2 22
;
data b; input id x2 breakfast $;
datalines;
1 1 apple
2 1 carrot
2 1 mushroom
3 1 pepper
;
data c; input breakfast $ category $;
datalines;
apple fruit
orange fruit
carrot vegetable
pepper vegetable
;
Data wanted:
ID breakfast category
1 apple fruit
2 carrot vegetable
2 mushroom .
SAS code in 2 steps:
proc sql; create table d as
select a.id, b.breakfast
from a as a, b as b
where a.id=b.id;
select d.id, d.breakfast, c.category
from d as d left join c as c
on d.breakfast=c.breakfast;
quit;
proc print data=want noobs; run;
A related question is can I do inner join (in A and B) and in A but not C in one step? I read that EXCEPT can do the job, but when I copied the example code, it didn't seem right. For the fruit example above, the data set should look like
ID breakfast category
2 mushroom .
Many thanks in advance.
It seems that you confuse left join with except logic, which do you want?
And yes, you can nestle joins in the same SQL step, but I recommend that you you explicit join syntax (i.e. a inner join b on...) to avoid unexpected results.
Hi,
Would the following piece of code do what you are expecting ?
proc sql;
create table d as
select a.id
, b.breakfast
, c.category
from a
inner join b
on a.id = b.id
left join c
on b.breakfast = c.breakfast;
quit;
data e;
set d (where=(missing(category)));
run;
Kind regards,
Florent
proc sql; create table res as ( select x.id,x.breakfast,c.category from (select a.id, b.breakfast from a join b on a.id=b.id) x left join c on x.breakfast=c.breakfast ); run;
Thanks NagendraKumarK for the subquery code. And thanks especially Florent for the code, which is exactly what I want, for my first question.
Just a question though, if it possible to do If A and B on ID, and If A but not C on breakfast in one code? I see Florent's code in two steps. Possibly in one step.
LinusH, sorry my questions weren't phrased clearly. It's two part questions. First how do I do inner join and left join in one code (I've never seen sample code as Florent's, so it really helped). Second, how do I do inner join (If A and B) and If A but not C in one code. Florent's two-step code is great. But I'm wondering it it can be done in one step.
The first step (table d) contains everything which is in A and B, and potentially in C. When there is no match with the table C then the content of the category field will be empty.
If you want to have table e created at the same time as table d (so in one step), then you should use make a dastep using the merge statement and play with the inA, inB, inC variable you can create at that time. The disadvantage of making a datastep instead of a proc sql is that it requires you firstly sort tables a, b and c in the same order... so in the end you would have 3 sort statements + 1 datastep. There is no way to create two datasets at a time with a "create table" statement of the proc sql.
I hope it answers to your questions.
Second question, just combine an inner join with a sub-query (not in() ).
Thanks for the replies. I've a related question still about doing various merges, hoping to find NO NEED to do sub-queries. The example above is a simpler one as it is always using A to inner join with B, and to left join with C. But the example below doesn't always start with A. I get the feeling that sub-queres are the only way to go in SQL, or use data step. Just want to confirm.
data a; input patient_id hospital_id los;
datalines;
1 100 1
1 102 2
2 100 4
3 101 5
3 103 10
;
data b; input patient_id hospital_id wtcases;
datalines;
1 100 1.5
1 102 2.2
2 100 1.1
3 103 4.1
;
data c; input hospital_id postcode $;
datalines;
100 58823
101 45000
102 45001
;
data d; input postcode $ area;
datalines;
58823 51
45000 52
99999 99
;
*The code below actualy wouldn't run uless I take out what's after SELECT, and use just SELECT *. Then I see it's outputting all combinations and realized the FROM statement isn't doing what I thought SQL could do.
Proc sql; create table want as
select a.id, a.los, b.hospital_id, b.wtcases, c.postcode, d.area
from a inner join b on a.patient_id=b.patient_id and a.hospital_id=b.hospital_id,
b left join c on b.hospital_id=c.hospital_id,
c left join d on c.postcode=d.postcode;
quit;
proc print data=a noobs; run;
proc print data=b noobs; run;
proc print data=c noobs; run;
proc print data=d noobs; run;
proc print data=want; run;.
When using the "inner join"/"left join"/"right join" syntax, you cannot put a comma after the join condition(s). Furthermore, you don't have to repeat the table names which have already been used in a join. With your code, it gives:
Proc sql;
create table want as
select a.id, a.los, b.hospital_id, b.wtcases, c.postcode, d.area
from a
inner join b
on a.patient_id = b.patient_id
and a.hospital_id = b.hospital_id
left join c
on b.hospital_id = c.hospital_id
left join d
on c.postcode=d.postcode;
quit;
proc print data=a noobs; run;
proc print data=b noobs; run;
proc print data=c noobs; run;
proc print data=d noobs; run;
proc print data=want; run;
Thanks a million. SQL is indeed very powerful. The code worked for a large part. What it didn't work is the following. If you have answers for it, let me know. Otherwise I'll change to answered.
1. I need to change from
SELECT a.id, a.los, b.hospital_id, b.wtcases, c.postcode, d.area
to
SELECT *
otherwise it's giving an error message "ERROR: Column id could not be found in the table/view identified with the correlation name A" even though there is nothing wrong.
2. It's outputting the following cases, I thought inner join A and B by patient ID and Hospital ID would get rid of patient ID=3 all together. But it got rid of the one from hospital 101 and kept the one from hospital 103.
patient_id | hospital_id | los | wtcases | postcode | area |
3 | 103 | 10 | 4.1 | . | |
1 | 102 | 2 | 2.2 | 45001 | . |
2 | 100 | 4 | 1.1 | 58823 | 51 |
1 | 100 | 1 | 1.5 | 58823 | 51 |
Anyway thanks very much for the reply. I didn't know I could use SQL this way. Let me know if there are answers to the above questions. Otherwise I'll close this discussion.
Hi,
Concerning the first point, it is clear that the "ID" is not available in table A. It should be replaced by "patient_id".
For the second bullet point, I do not agree with the fact that the line concerning hospital 103 and patient 3 should be excluded from the table. If you look at tables A and B, you will find this patient-hospital combination in both of them. Thus this is logical to find it back in your final table when using inner joins.
I hope it helps.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.