BookmarkSubscribeRSS Feed

When running a group by, one typically includes any variables in the GROUP BY that aren't summarized in the SELECT clause. However, if one field is unique (or a primary key), it can be costly to add min/max/other summary to ensure the query is only grouped on that unique variable.

 

any_value exists as a solution to this in other SQL variants: MySQL, Snowflake SQL, Google BigQuery, etc.

6 Comments
Tom
Super User
Super User

If the value is a constant then why not add it to the GROUP BY? If not then what is the point of using ANY_VALUE(X) instead of MIN(X) or MAX(X)?  

 

Also note that PROC SQL does not support "windowing" functions at all. It just has normal aggregate functions that operate over the full group of records.  The OVER or PARTITION BY keywords are not supported at all.

paulkaefer
Lapis Lazuli | Level 10

@Tom: the Snowflake example explains this well: min() or max() can be costly.

 

Sure, for many cases, this is not an issue, or performance is not noticeably impacted. But it's a feature suggestion as the function exists in these other SQL variants.

Quentin
Super User

Interesting.  Non-deterministic results from a query don't seem especially appealing to me.  Typically, even if I "know" a column is unique, I'm happy to include it in the group by clause so that I can prove/assert that it is unique.  But I can see how there is an efficiency trade-off.

ChrisNZ
Tourmaline | Level 20

1. I generally don't trust the data enough to trust such a function to always give me the expected result. I'd much rather use GROUP BY and be able to detect data issues later on. The DQ of most companies is just not there. For this use case, GROUP BY allows me to find DQ issues as part of the process. That's an overhead I am happy to pay.

 

2. I'd much rather have SQL window functions, or string summary functions (or even row-number) added, if SAS were ever to read and use the ballot pages for allocating R&D resources, which I think will never happen. The any_value function would be much less useful imho.

 

3. As least as useful as having this function would be that SAS use the table metadata and issues the "At least one nonessential grouping column reference has been removed" message when indexes or ICs with the UNIQUE constraint exist on the GROUP BY column.

 

paulkaefer
Lapis Lazuli | Level 10

1. great. This function might not be of use to you.

2. yep, window functions are super useful & I upvoted that. This suggestion is different in a nuanced way. Still used with a GROUP BY, but essentially like a neutral version of min/max. Honestly, min/max would be appropriate most of the time, and probably the dataset folks are using 90% of the time won't see a performance hit for min/max.

3. This function shouldn't remove any columns, but sure, SAS could log a message as it is wont to do. What are ICs? Just indexed columns?

ChrisNZ
Tourmaline | Level 20

> What are ICs?

Sorry. Integrity Constraints.

What I worded very badly and confusingly (now that I read it again) was: If a variable containing verified unique values appears in a GROUP BY clause, SAS should take that information into account. Not by dropping a variable from the GROUP BY clause as I stated, but by dropping any grouping at all.

The only vote this idea has is mine.  🙂