Hello
I need some help writing a SQL query which will group by a variable while taking the max of something where something is true. Sounds much simpler than it really is, so I'll jump right to my query...
proc sql;
create table INSPN_FY as
select distinct
a.HUB_Name,
b.PROP_ID,
b.PROP_NAME,
c.INSPN_FISC_YR,
e.FISC_YR as ASMT_FISC_YR,
c.INSPECTION_ID,
c.INSPECTION_CD,
c.INSPECTION_SCORE
from HUB_TBL a INNER JOIN PROPERTY_TBL b ON a.HUB_ID = b.HUB_ID
INNER JOIN
INSPECTION_TBL c ON c.PROP_ID = b.PROP_ID
LEFT JOIN INSPECTION_ASSESSMENT d ON c.INSPECTION_ID = d.INSPECTION_ID
LEFT JOIN ASSESSMENT_TBL e on d.ASMT_ID = e.ASMT_ID and d.GRP_ID = e.GRP_ID and d.VER_ID = e.VER_ID
where e.FISC_YR <> . and
c.INSPECTION_ID >=500000 and c.INSPECTION_ID <1000000 and
c.INSPECTION_CD in('RTN','IRA', 'VUR', 'AWP', 'VUI')
and c.PROP_ID is not null and
(c.INSPECTION_SCORE is not null OR c.INSPECTION_CD in( 'VUI'))
and b.PROPERTY_UNIT_CNT > 0
ORDER BY a.HUB_Name,
b.PROP_ID,
b.PROP_NAME,
c.INSPN_FISC_YR,
e.FISC_YR ;
quit;
I want to select inspections for properties by assessment FY, where the conditions in the WHERE clause are met. However, some cases have 2 inspections in the same assessment year, one with a VUI status and one released, such as:
PROPERTY_ID | PROPERTY_UNIT_CNT | INSPN_FISC_YR | ASMT_FISC_YR | INSPECTION_ID | INSPECTION_CODE |
XX999999 | 28 | 2016 | 2017 | 550505 | RTN |
XX999999 | 28 | 2017 | 2017 | 690908 | VUI |
And some have two released, such as:
PROPERTY_ID | PROPERTY_UNIT_CNT | INSPN_FISC_YR | ASMT_FISC_YR | INSPECTION_ID |
|
XX999999 | 74 | 2016 | 2017 | 570707 | RTN |
XX999999 | 74 | 2017 | 2017 | 680808 | RTN |
In the first case, I want the RTN, while in the second case I want the MAX inspection ID. However, if the only inspection for a FY is VUI, I still want it (only omit them if there is a duplicative released).
How can I do this in 1 SQL step? I need it in 1 step because I plan to have this table automated in an ETL. SQL only please.
The first thing I do when I see code like yours is to make it readable by
After that, your code looks like this:
proc sql;
create table INSPN_FY as select distinct
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENT_TBL.FISC_YR as ASMT_FISC_YR,
INSPECTION_TBL.INSPECTION_ID,
INSPECTION_TBL.INSPECTION_CD,
INSPECTION_TBL.INSPECTION_SCORE
from HUB_TBL INNER JOIN
PROPERTY_TBL ON HUB_TBL.HUB_ID = PROPERTY_TBL.HUB_ID INNER JOIN
INSPECTION_TBL ON INSPECTION_TBL.PROP_ID = PROPERTY_TBL.PROP_ID LEFT JOIN
INSPECTION_ASSESSMENT ON INSPECTION_TBL.INSPECTION_ID = INSPECTION_ASSESSMENT.INSPECTION_ID LEFT JOIN
ASSESSMENT_TBL on INSPECTION_ASSESSMENT.ASMT_ID = ASSESSMENT_TBL.ASMT_ID
and INSPECTION_ASSESSMENT.GRP_ID = ASSESSMENT_TBL.GRP_ID
and INSPECTION_ASSESSMENT.VER_ID = ASSESSMENT_TBL.VER_ID
where ASSESSMENT_TBL.FISC_YR <> .
and INSPECTION_TBL.INSPECTION_ID >=500000 and INSPECTION_TBL.INSPECTION_ID <1000000
and INSPECTION_TBL.INSPECTION_CD in('RTN','IRA', 'VUR', 'AWP', 'VUI')
and INSPECTION_TBL.PROP_ID is not null
and (INSPECTION_TBL.INSPECTION_SCORE is not null OR INSPECTION_TBL.INSPECTION_CD in( 'VUI'))
and PROPERTY_TBL.PROPERTY_UNIT_CNT > 0
ORDER BY
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENT_TBL.FISC_YR ;
quit;
The first thing that becomes obvious is that the LEFT joins are silly, they suggest that you may want to have stuff that is not in the ASSESMENT_TBL, but you explicitly make a WHERE clause specifying that assessment fiscal year cannot be missing. So really, it is all inner join.
As you are not using any data from the link table, my next suggestion would be to put the assessment stuff into a subquery:
proc sql;
create table INSPN_FY as select distinct
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENTS.FISC_YR as ASMT_FISC_YR,
INSPECTION_TBL.INSPECTION_ID,
INSPECTION_TBL.INSPECTION_CD,
INSPECTION_TBL.INSPECTION_SCORE
from HUB_TBL INNER JOIN
PROPERTY_TBL ON HUB_TBL.HUB_ID = PROPERTY_TBL.HUB_ID INNER JOIN
INSPECTION_TBL ON INSPECTION_TBL.PROP_ID = PROPERTY_TBL.PROP_ID JOIN
(select
INSPECTION_ASSESSMENT.INSPECTION_ID,
ASSESSMENT_TBL.FISC_YR
from INSPECTION_ASSESSMENT JOIN
ASSESSMENT_TBL on INSPECTION_ASSESSMENT.ASMT_ID = ASSESSMENT_TBL.ASMT_ID
and INSPECTION_ASSESSMENT.GRP_ID = ASSESSMENT_TBL.GRP_ID
and INSPECTION_ASSESSMENT.VER_ID = ASSESSMENT_TBL.VER_ID
where ASSESSMENT_TBL.FISC_YR <> . ) ASSESSMENTS on ASSESSMENTS.INSPECTION_ID=INSPECTION_TBL.INSPECTION_ID
where INSPECTION_TBL.INSPECTION_ID >=500000 and INSPECTION_TBL.INSPECTION_ID <1000000
and INSPECTION_TBL.INSPECTION_CD in('RTN','IRA', 'VUR', 'AWP', 'VUI')
and INSPECTION_TBL.PROP_ID is not null
and (INSPECTION_TBL.INSPECTION_SCORE is not null OR INSPECTION_TBL.INSPECTION_CD in( 'VUI'))
and PROPERTY_TBL.PROPERTY_UNIT_CNT > 0
ORDER BY
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENT_TBL.FISC_YR ;
quit;
Now, what you want is the maximum PROPERTY_ID for each ASMT_FISC_YR, except that INSPECTION_CODE='RTN' takes precedence. If you want to calculate that in a single query, the easiest way is probably to create a dummy variable, which is the property id plus a large number for 'RTN', and just the id for the others:
case INSPECTION_TBL.INSPECTION_CD
when 'RTN' then INSPECTION_TBL.INSPECTION_ID+1E8
else INSPECTION_TBL.INSPECTION_ID
end
You can then use that as your criterion for selecting, e.g.:
create table INSPN_FY as select distinct
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENTS.FISC_YR as ASMT_FISC_YR,
INSPECTION_TBL.INSPECTION_ID,
INSPECTION_TBL.INSPECTION_CD,
INSPECTION_TBL.INSPECTION_SCORE
from HUB_TBL INNER JOIN
PROPERTY_TBL ON HUB_TBL.HUB_ID = PROPERTY_TBL.HUB_ID INNER JOIN
INSPECTION_TBL ON INSPECTION_TBL.PROP_ID = PROPERTY_TBL.PROP_ID JOIN
(select
INSPECTION_ASSESSMENT.INSPECTION_ID,
ASSESSMENT_TBL.FISC_YR
from INSPECTION_ASSESSMENT JOIN
ASSESSMENT_TBL on INSPECTION_ASSESSMENT.ASMT_ID = ASSESSMENT_TBL.ASMT_ID
and INSPECTION_ASSESSMENT.GRP_ID = ASSESSMENT_TBL.GRP_ID
and INSPECTION_ASSESSMENT.VER_ID = ASSESSMENT_TBL.VER_ID
where ASSESSMENT_TBL.FISC_YR <> . ) ASSESSMENTS on ASSESSMENTS.INSPECTION_ID=INSPECTION_TBL.INSPECTION_ID
where INSPECTION_TBL.INSPECTION_ID >=500000 and INSPECTION_TBL.INSPECTION_ID <1000000
and INSPECTION_TBL.INSPECTION_CD in('RTN','IRA', 'VUR', 'AWP', 'VUI')
and INSPECTION_TBL.PROP_ID is not null
and (INSPECTION_TBL.INSPECTION_SCORE is not null OR INSPECTION_TBL.INSPECTION_CD in( 'VUI'))
and PROPERTY_TBL.PROPERTY_UNIT_CNT > 0
group by
INSPECTION_TBL.INSPECTION_ID,
ASSESSMENTS.FISC_YR
having case INSPECTION_TBL.INSPECTION_CD
when 'RTN' then INSPECTION_TBL.INSPECTION_ID+1E8
else INSPECTION_TBL.INSPECTION_ID
end
=max(case INSPECTION_TBL.INSPECTION_CD
when 'RTN' then INSPECTION_TBL.INSPECTION_ID+1E8
else INSPECTION_TBL.INSPECTION_ID
end)
ORDER BY
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENT_TBL.FISC_YR ;
quit;
It may be better, both logically and performance-wise, to calculate the criterion (and everything else) in a subquery, and then apply the GROUP BY, HAVING and ORDER BY to the results of that query. But I leave that as an exercise for the reader. And you may have to add some of the other columns to your GROUP BY.
A 5 table join is a lot to look at with out any data.
The first angle of attach is probably to change the join to table
INNER JOIN
INSPECTION_TBL c
to a join to subselect
INNER JOIN
( SELECT ... stuff ...
from INSPECTION_TBL
where ... correlation criteria ...
group by ...
having ... criteria for selecting desired row ...
) as c
Complex ordering and selection rules may require the subselect have additional subselects itself.
The first thing I do when I see code like yours is to make it readable by
After that, your code looks like this:
proc sql;
create table INSPN_FY as select distinct
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENT_TBL.FISC_YR as ASMT_FISC_YR,
INSPECTION_TBL.INSPECTION_ID,
INSPECTION_TBL.INSPECTION_CD,
INSPECTION_TBL.INSPECTION_SCORE
from HUB_TBL INNER JOIN
PROPERTY_TBL ON HUB_TBL.HUB_ID = PROPERTY_TBL.HUB_ID INNER JOIN
INSPECTION_TBL ON INSPECTION_TBL.PROP_ID = PROPERTY_TBL.PROP_ID LEFT JOIN
INSPECTION_ASSESSMENT ON INSPECTION_TBL.INSPECTION_ID = INSPECTION_ASSESSMENT.INSPECTION_ID LEFT JOIN
ASSESSMENT_TBL on INSPECTION_ASSESSMENT.ASMT_ID = ASSESSMENT_TBL.ASMT_ID
and INSPECTION_ASSESSMENT.GRP_ID = ASSESSMENT_TBL.GRP_ID
and INSPECTION_ASSESSMENT.VER_ID = ASSESSMENT_TBL.VER_ID
where ASSESSMENT_TBL.FISC_YR <> .
and INSPECTION_TBL.INSPECTION_ID >=500000 and INSPECTION_TBL.INSPECTION_ID <1000000
and INSPECTION_TBL.INSPECTION_CD in('RTN','IRA', 'VUR', 'AWP', 'VUI')
and INSPECTION_TBL.PROP_ID is not null
and (INSPECTION_TBL.INSPECTION_SCORE is not null OR INSPECTION_TBL.INSPECTION_CD in( 'VUI'))
and PROPERTY_TBL.PROPERTY_UNIT_CNT > 0
ORDER BY
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENT_TBL.FISC_YR ;
quit;
The first thing that becomes obvious is that the LEFT joins are silly, they suggest that you may want to have stuff that is not in the ASSESMENT_TBL, but you explicitly make a WHERE clause specifying that assessment fiscal year cannot be missing. So really, it is all inner join.
As you are not using any data from the link table, my next suggestion would be to put the assessment stuff into a subquery:
proc sql;
create table INSPN_FY as select distinct
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENTS.FISC_YR as ASMT_FISC_YR,
INSPECTION_TBL.INSPECTION_ID,
INSPECTION_TBL.INSPECTION_CD,
INSPECTION_TBL.INSPECTION_SCORE
from HUB_TBL INNER JOIN
PROPERTY_TBL ON HUB_TBL.HUB_ID = PROPERTY_TBL.HUB_ID INNER JOIN
INSPECTION_TBL ON INSPECTION_TBL.PROP_ID = PROPERTY_TBL.PROP_ID JOIN
(select
INSPECTION_ASSESSMENT.INSPECTION_ID,
ASSESSMENT_TBL.FISC_YR
from INSPECTION_ASSESSMENT JOIN
ASSESSMENT_TBL on INSPECTION_ASSESSMENT.ASMT_ID = ASSESSMENT_TBL.ASMT_ID
and INSPECTION_ASSESSMENT.GRP_ID = ASSESSMENT_TBL.GRP_ID
and INSPECTION_ASSESSMENT.VER_ID = ASSESSMENT_TBL.VER_ID
where ASSESSMENT_TBL.FISC_YR <> . ) ASSESSMENTS on ASSESSMENTS.INSPECTION_ID=INSPECTION_TBL.INSPECTION_ID
where INSPECTION_TBL.INSPECTION_ID >=500000 and INSPECTION_TBL.INSPECTION_ID <1000000
and INSPECTION_TBL.INSPECTION_CD in('RTN','IRA', 'VUR', 'AWP', 'VUI')
and INSPECTION_TBL.PROP_ID is not null
and (INSPECTION_TBL.INSPECTION_SCORE is not null OR INSPECTION_TBL.INSPECTION_CD in( 'VUI'))
and PROPERTY_TBL.PROPERTY_UNIT_CNT > 0
ORDER BY
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENT_TBL.FISC_YR ;
quit;
Now, what you want is the maximum PROPERTY_ID for each ASMT_FISC_YR, except that INSPECTION_CODE='RTN' takes precedence. If you want to calculate that in a single query, the easiest way is probably to create a dummy variable, which is the property id plus a large number for 'RTN', and just the id for the others:
case INSPECTION_TBL.INSPECTION_CD
when 'RTN' then INSPECTION_TBL.INSPECTION_ID+1E8
else INSPECTION_TBL.INSPECTION_ID
end
You can then use that as your criterion for selecting, e.g.:
create table INSPN_FY as select distinct
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENTS.FISC_YR as ASMT_FISC_YR,
INSPECTION_TBL.INSPECTION_ID,
INSPECTION_TBL.INSPECTION_CD,
INSPECTION_TBL.INSPECTION_SCORE
from HUB_TBL INNER JOIN
PROPERTY_TBL ON HUB_TBL.HUB_ID = PROPERTY_TBL.HUB_ID INNER JOIN
INSPECTION_TBL ON INSPECTION_TBL.PROP_ID = PROPERTY_TBL.PROP_ID JOIN
(select
INSPECTION_ASSESSMENT.INSPECTION_ID,
ASSESSMENT_TBL.FISC_YR
from INSPECTION_ASSESSMENT JOIN
ASSESSMENT_TBL on INSPECTION_ASSESSMENT.ASMT_ID = ASSESSMENT_TBL.ASMT_ID
and INSPECTION_ASSESSMENT.GRP_ID = ASSESSMENT_TBL.GRP_ID
and INSPECTION_ASSESSMENT.VER_ID = ASSESSMENT_TBL.VER_ID
where ASSESSMENT_TBL.FISC_YR <> . ) ASSESSMENTS on ASSESSMENTS.INSPECTION_ID=INSPECTION_TBL.INSPECTION_ID
where INSPECTION_TBL.INSPECTION_ID >=500000 and INSPECTION_TBL.INSPECTION_ID <1000000
and INSPECTION_TBL.INSPECTION_CD in('RTN','IRA', 'VUR', 'AWP', 'VUI')
and INSPECTION_TBL.PROP_ID is not null
and (INSPECTION_TBL.INSPECTION_SCORE is not null OR INSPECTION_TBL.INSPECTION_CD in( 'VUI'))
and PROPERTY_TBL.PROPERTY_UNIT_CNT > 0
group by
INSPECTION_TBL.INSPECTION_ID,
ASSESSMENTS.FISC_YR
having case INSPECTION_TBL.INSPECTION_CD
when 'RTN' then INSPECTION_TBL.INSPECTION_ID+1E8
else INSPECTION_TBL.INSPECTION_ID
end
=max(case INSPECTION_TBL.INSPECTION_CD
when 'RTN' then INSPECTION_TBL.INSPECTION_ID+1E8
else INSPECTION_TBL.INSPECTION_ID
end)
ORDER BY
HUB_TBL.HUB_Name,
PROPERTY_TBL.PROP_ID,
PROPERTY_TBL.PROP_NAME,
INSPECTION_TBL.INSPN_FISC_YR,
ASSESSMENT_TBL.FISC_YR ;
quit;
It may be better, both logically and performance-wise, to calculate the criterion (and everything else) in a subquery, and then apply the GROUP BY, HAVING and ORDER BY to the results of that query. But I leave that as an exercise for the reader. And you may have to add some of the other columns to your GROUP BY.
Thank you. This code has gone through numerous changes based on feedback from various SMEs, particularly with regard to the WHERE statement. I tested your solution, and with some minor tweaks, I was able to get it to provide the results I needed. I had never performed joins on subsets, so this was something I would never have thought of on my own.
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.