06-20-2016 05:16 AM
This has been bothering me for sometime. I am pretty good at writing SQL and have been doing so for a very long time. An ANSI compliant sql query with a select statement without a summary function but with group by statement removes duplicates from the recordset. Using SAS and the pass through facility does exactly that as it uses the native SQL of the database (which normally follows ANSI SQLrules). But I write a SQL query of local dataset using a Proc SQL, it does not remove duplicates and I get a log comment that the group by clause has been converted to an order by clause as the select or the optional having clause is not associated with a summaray function. Please note however, that using the DISTNCT word in the select statement removes the duplcates, that is ol!
Does that mean that Proc SQL is not ANSI compliant? I thought that SAS Proc SQL was fully compliant with ANSI rules and offers additional functionality over and above what ANSI requires for example 'remerging' facility where all of the select statement variables not associated with a summary function do not appear in the group by statement of the query.
Could someone please clarify this for me? I am using SAS 9.1 and 9.3
proc sql; create table Test.CalTable4SmartCast4 as Select a.CAL_DATE As CalDate , (Case When WeekDay(a.CAL_DATE) = 7 Then 'SATD' Else a.day_type End) As Day_Type , a.Region_Code From Test.CAL_TRADING_HH_V a Where Region_Code = 'VIC1' Group by a.CAL_DATE , (Case When WeekDay(a.CAL_DATE) = 7 Then 'SATD' Else a.day_type End) , a.Region_Code ; quit;
This produces an output with 192864 rows with duplicates - it should have produced an output with 4018 rows. Each row has been repeated 48 times as each date has 48 different interval for each half hour.
06-20-2016 05:39 AM
Add keyword DISTINCT into SELECT statement.
proc sql; select distinct sex, case when age<15 then 'Y' else 'N' end as dummy from sashelp.class group by sex,calculated dummy; quit;
06-20-2016 05:52 AM
Thanks for your response. That is not my query. I know that uisng the keyword distinct will do it.
What, in effect, I am asking is whether or not SAS SQL is fully compliant with ANSI SQL and additionally offers other functionality not offered by ANSI SQL (I gave an example in my initial posting).
ANSI SQL will remove duplicates if there is a group by statement and there are no variables in the select statement associated with any summary function. SAS SQL using Proc SQL when used on local datasets (not as SQL query using pass through facility) doesn't seem to do that. The question is : WHY?
Is this a bug is SAS SQL or is it by intent (which I am having difficulty accepting)?
06-20-2016 05:57 AM
Yes. It think SAS SQL is fully compliant with ANSI SQL. if you want this, you need a summary statistic variable: select sex, case ... end as dummy , sum(weight) as sum /*<--------*/
06-20-2016 06:16 AM
To be honest I have never come across anyone using group by to remove duplicates, so I am pretty much geussing here. Firstly, I couldn't find anywhere which states that ANSI SQL group by removes duplicates, in fact the only really interesting post I found was a guy stating that this wasn't its function:
So it sounds very much like its database specific. Another thing that comes to mind is that SAS datasets have automatic variables defined behind the scenes, such as _n_, which would make them distinct if you don't specify what to distinct select.
At the end of the day, its the programmers responsibility to exactly define the code (same goes for use of * and such like). You know your data, and your code, make it do what you want, and don't rely only quirks of the system, or guessing.
06-20-2016 06:25 AM
Where do you get the notion that ANSI SQL removes duplicates with a group by?
Please supply the relevant part of the ANSI SQL documentation.
06-20-2016 05:21 PM
Based on the other responses, whilst I do not the ANSI rules documentation, this is what I have been able to verify from at least 2 database applications in the last couple of days viz SQL Server 2008 R2 and Oracle 11g :-
If you have all the variable that appear in the Select list also appearing in the Group by list without any variable in the Select list being associated with any summary function, you will definitely remove the duplicates - there is no question about it. If my memory serves me right - this goes back a few years - this was also how the SQL worked in DB2 database application.
And this is exactly what I am trying to find out - if SAS SQL is different in its implementation to the ANSI SQL implementation because as I have said earlier in each of the above database application a query like so :-
Select Region, MonthlyDate, CountByDayType From Table1 Group by Region, MonthlyDate, CountByDayType
will remove duplicates by default and in Proc SQl on local datasets, it won't unless accompanied by Distinct keyword.
I am not looking for a solution to any problem, I am only looking for a confirmation if the implemenation of SAS SQL is different to ANSI SQL.
I got this emall from someone last night in response but now it says that the post I am looking for has been permanently deleted. Here is the text from that email :-
In most DB SQL implementations, if you have a GROUP BY, all variables listed in the select clause must be included in the GROUP BY or be part of a summary function.
This is not the case in SAS. If you include other variables that are not in the GROUP BY statement then you will not have distinct values, since the other variable will be also selected. This can be a useful feature but can also be unexpected if you're using to working with DB implementations of SQL.
Your particular query doesn't include any summary functions so I'm not sure what the GROUP BY is intended to achieve in this particular case. It seems like a implicit way to specify distinct?
Whosoever sent me this, please be it known that I agree with that too. I also understand how SAS SQL works with 'remerging back with the original dataset' when variables in select list do not all appear in the group by listng - no problem. I am fully aware that my query doesn't include any summary function and that is by intent to specifically determine if SAS and ANSI SQL (in some of the more common database applications) work differently!