BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
agarwaldvk
Calcite | Level 5

Hi Everybody

 

 

This has been bothering me for sometime. I am pretty good at writing SQL and have been doing so for a very long time. An ANSI compliant sql query with a select statement without a summary function but with group by statement removes duplicates from the recordset. Using SAS and the pass through facility does exactly that as it uses the native SQL of the database (which normally follows ANSI SQLrules). But I write a SQL query of local dataset using a Proc SQL, it does not remove duplicates and I get a log comment that the group by clause has been converted to an order by clause as the select or the optional having clause is not associated with a summaray function. Please note however, that using the DISTNCT word in the select statement removes the duplcates, that is ol!

 

Does that mean that Proc SQL is not ANSI compliant? I thought that SAS Proc SQL was fully compliant with ANSI rules and offers additional functionality over and above what ANSI requires for example 'remerging' facility where all of the select statement variables not associated with a summary function do not appear in the group by statement of the query.

 

Could someone please clarify this for me? I am using SAS 9.1 and 9.3

 

 

 

proc sql;
create table Test.CalTable4SmartCast4 as
Select	a.CAL_DATE As CalDate
	, (Case	When	WeekDay(a.CAL_DATE) = 7 Then
		        'SATD'
		Else	a.day_type
	  End) As Day_Type
	,	a.Region_Code
From	Test.CAL_TRADING_HH_V a
Where	Region_Code = 'VIC1'
Group by	a.CAL_DATE
		, (Case	When	WeekDay(a.CAL_DATE) = 7 Then
				'SATD'
			Else	a.day_type
		End)
		,	a.Region_Code	
;
quit;

 

 

This produces an output with 192864 rows with duplicates - it should have produced an output with 4018 rows. Each row has been repeated 48 times as each date has 48 different interval for each half hour.

 

Best regards

 

 

Deepak Agarwal

1 ACCEPTED SOLUTION

Accepted Solutions
kelpet
Calcite | Level 5

I'm a bit late to the party, but the behaviour you are seeing is in part a consequence of PROC SQL remerging by default.

You can overcome this globally by setting the option:

OPTIONS NOSQLREMERGE;

Or, on for an individual PROC SQL step:

PROC SQL NOREMERGE;

This combined with what @EtoUspeh mentioned (effectively, use the CALCULATED clause on the output fields in the SELECT statement).

 

For those that are from an SQL background, the SQLREMERGE option means that you can use aggregate functions without a GROUP BY clause. For example, this is valid in SAS, but not in ANSI SQL:

PROC SQL REMERGE;
SELECT id, COUNT(*) FROM tmp;
QUIT;

 

Editor's note: added the doc links for completeness.

For performance tuning, the SAS doc provides guidance on when you might want to disable REMERGE:

 

Disabling the Remerging of Data When Using Summary Functions

When you use a summary function in a SELECT clause or a HAVING clause, PROC SQL might remerge the data. Remerging the data involves two passes through the data. If you set the PROC SQL NOREMERGE option or the NOSQLREMERGE system option, PROC SQL will not process the remerging of data. When referencing database tables, performance is enhanced because it enables more of the query to be passed down to the database.

 

View solution in original post

9 REPLIES 9
Ksharp
Super User

Add keyword  DISTINCT into SELECT statement.

 

proc sql;
select distinct sex,
       case when age<15 then 'Y' 
        else 'N' end as dummy 
  from sashelp.class 
   group by sex,calculated dummy;
quit;
 
agarwaldvk
Calcite | Level 5

Hi Ksharp

 

 

Thanks for your response. That is not my query. I know that uisng the keyword distinct will do it.

 

What, in effect, I am asking is whether or not SAS SQL is fully compliant with ANSI SQL and additionally offers other functionality not offered by ANSI SQL (I gave an example in my initial posting).

 

ANSI SQL will remove duplicates if there is a group by statement and there are no variables in the select statement associated with any summary function. SAS SQL using Proc SQL when used on local datasets (not as SQL query using pass through facility) doesn't seem to do that. The question is : WHY?

 

Is this a bug is SAS SQL or is it by intent (which I am having difficulty accepting)?

 

 

Best regards

 

 

Deepak

Ksharp
Super User
Yes. It think SAS SQL is fully compliant with ANSI SQL.

if you want this, you need a summary statistic variable: 

select sex, case ... end as dummy ,  sum(weight) as sum   /*<--------*/


RW9
Diamond | Level 26 RW9
Diamond | Level 26

To be honest I have never come across anyone using group by to remove duplicates, so I am pretty much geussing here.  Firstly, I couldn't find anywhere which states that ANSI SQL group by removes duplicates, in fact the only really interesting post I found was a guy stating that this wasn't its function:

http://stackoverflow.com/questions/7152333/group-by-does-not-remove-duplicates


So it sounds very much like its database specific.  Another thing that comes to mind is that SAS datasets have automatic variables defined behind the scenes, such as _n_, which would make them distinct if you don't specify what to distinct select.  

 

At the end of the day, its the programmers responsibility to exactly define the code (same goes for use of * and such like).  You know your data, and your code, make it do what you want, and don't rely only quirks of the system, or guessing.

agarwaldvk
Calcite | Level 5

Hi Everyone

 

 

 

Based on the other responses, whilst I do not the ANSI rules documentation, this is what I have been able to verify from at least 2 database applications in the last couple of days viz SQL Server 2008 R2 and Oracle 11g :-

 

If you have all the variable that appear in the Select list also appearing in the Group by list without any variable in the Select list being associated with any summary function, you will definitely remove the duplicates - there is no question about it. If my memory serves me right - this goes back a few years - this was also how the SQL worked in DB2 database application.

 

And this is exactly what I am trying to find out - if SAS SQL is different in its implementation to the ANSI SQL implementation because as I have said earlier in each of the above database application a query like so :-

 

 

Select Region, MonthlyDate, CountByDayType
From Table1
Group by Region, MonthlyDate, CountByDayType

will remove duplicates by default and in Proc SQl on local datasets, it won't unless accompanied by Distinct keyword.

 

I am not looking for a solution to any problem, I am only looking for a confirmation if the implemenation of SAS SQL is different to ANSI SQL.

 

 

 

P.S.

 

I got this emall from someone last night in response but now it says that the post I am looking for has been permanently deleted. Here is the text from that email :-

 

Quote

In most DB SQL implementations, if you have a GROUP BY, all variables listed in the select clause must be included in the GROUP BY or be part of a summary function. 

 

This is not the case in SAS. If you include other variables that are not in the GROUP BY statement then you will not have distinct values, since the other variable will be also selected. This can be a useful feature but can also be unexpected if you're using to working with DB implementations of SQL. 

 

Your particular query doesn't include any summary functions so I'm not sure what the GROUP BY is intended to achieve in this particular case. It seems like a implicit way to specify distinct?

Unquote

 

Whosoever sent me this, please be it known that I agree with that too. I also understand how SAS SQL works with 'remerging back with the original dataset' when variables in select list do not all appear in the group by listng - no problem.  I am fully aware that my query doesn't include any summary function and that is by intent to specifically determine if SAS and ANSI SQL (in some of the more common database applications) work differently!

 

 

Best regards

 

 

 

Deepak

Reeza
Super User

Proc fedsql is ANSI compliant, proc SQL is not AFAIK. 

EtoUspeh
Fluorite | Level 6

By trial and error I found out that SAS does not understand that the same math expression in select and group by is actually the same variable. If you use var alias (text afer as in select) in group by than sas will remove duplicates

kelpet
Calcite | Level 5

I'm a bit late to the party, but the behaviour you are seeing is in part a consequence of PROC SQL remerging by default.

You can overcome this globally by setting the option:

OPTIONS NOSQLREMERGE;

Or, on for an individual PROC SQL step:

PROC SQL NOREMERGE;

This combined with what @EtoUspeh mentioned (effectively, use the CALCULATED clause on the output fields in the SELECT statement).

 

For those that are from an SQL background, the SQLREMERGE option means that you can use aggregate functions without a GROUP BY clause. For example, this is valid in SAS, but not in ANSI SQL:

PROC SQL REMERGE;
SELECT id, COUNT(*) FROM tmp;
QUIT;

 

Editor's note: added the doc links for completeness.

For performance tuning, the SAS doc provides guidance on when you might want to disable REMERGE:

 

Disabling the Remerging of Data When Using Summary Functions

When you use a summary function in a SELECT clause or a HAVING clause, PROC SQL might remerge the data. Remerging the data involves two passes through the data. If you set the PROC SQL NOREMERGE option or the NOSQLREMERGE system option, PROC SQL will not process the remerging of data. When referencing database tables, performance is enhanced because it enables more of the query to be passed down to the database.

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 42237 views
  • 7 likes
  • 7 in conversation