Solved: Distinct with group by in PROC SQL

goutamgm92 · Posted 05-12-2017 01:13 PM

Hi,

I am using a proc SQL procedure, and I find a difference in the number of rows between the following queries in SAS.

proc sql; 
create table flash.test1 as 
select distinct profile_number, sas_id, min(flash_signup_date) as mindate format=Date9.  
from one.subscriptions where is_flash EQ '1'
and signup_date GT canceled_date
group by profile_number;
quit;



proc sql; 
create table flash.test1 as 
select profile_number, sas_id, min(flash_signup_date) as mindate format=Date9.  
from one.subscriptions where is_flash EQ '1'
and signup_date GT canceled_date
group by profile_number;
quit;

I understand if the column sas_id was not present, there would be no use of the distinct statement with a group by clause. But can someone clarify how the distinct in the above SQL statements changes the grouping when used with another additional column?

Fun fact: Any combination of distinct and group by throws an error in other DBMS (tested on a couple)

Tom · Posted 05-12-2017 01:33 PM

Most databases will not accept a query that uses GROUP BY and includes columns in the result set which are neither one of the grouping variables or a summary statistic.

But SAS will happly remerge the summary statistics back onto all detail rows for you.

And if the there are duplicate rows then those are generated also. But if you add the DISTINCT keyword then the duplicate result rows are removed.

data have;
  input grp $ extra $ val ;
cards;
A X 2
A X 3
A Y 4
;

proc sql ;
 create table test1 as 
   select grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
 create table test2 as 
   select distinct grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
quit;

View solution in original post

Tom · Posted 05-12-2017 01:33 PM

Most databases will not accept a query that uses GROUP BY and includes columns in the result set which are neither one of the grouping variables or a summary statistic.

But SAS will happly remerge the summary statistics back onto all detail rows for you.

And if the there are duplicate rows then those are generated also. But if you add the DISTINCT keyword then the duplicate result rows are removed.

data have;
  input grp $ extra $ val ;
cards;
A X 2
A X 3
A Y 4
;

proc sql ;
 create table test1 as 
   select grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
 create table test2 as 
   select distinct grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
quit;

goutamgm92 · Posted 05-12-2017 01:42 PM

Awesome, thank you! One more observation irrespective of the number of rows in the output, the sum
(val) contained 9 for both scenarios.

kiranv_ · Posted 05-12-2017 01:43 PM

Let us try to understand by an example similar to yours

data a;

input profile_number sas_id date:date9.;

format date date9.;

datalines;

123 345 23SEP1974

123 228 24SEP1974

123 607 25SEP1974

224 601 22SEP1974

224 601 30SEP1974

;

After creating dataset you run the below query

proc sql;

create table abcs as

select profile_number, sas_id, min(date) as mindate format=Date9.

from a

group by profile_number;

quit;

what this does is create min date with each group and remerges within that group as you see this note

NOTE: The query requires remerging summary statistics back with the original data.

resultant dataset is below (min date for 123 profile_id is 23sep1974 and is repeated within that group)

profile_number sas_id mindate
123 607 23SEP1974
123 345 23SEP1974
123 228 23SEP1974
224 601 22SEP1974
224 601 22SEP1974

if you see above last 2 row are same and when you apply distinct only one will be left out. In this step distinct is applied after group by and gives result as shown below.

profile_number sas_id mindate
123 228 23SEP1974
123 345 23SEP1974
123 607 23SEP1974
224 601 22SEP1974

query for distinct is below

proc sql;

create table abc as

select distinct profile_number, sas_id, min(date) as mindate format=Date9.

from a

group by profile_number;

quit;

hope this example makes it clear.

jaskeerat_singh · Posted 07-07-2018 01:19 PM

Hi Kiran,

Although may not be related to topic but your example is what I was looking for. If you do the same syntax in an sql server the

result would be

123 345 23SEP1974

224 601 22SEP1974

It is major difference I have seen while trying to filter based on max/min group by. In SAS you would need following code

proc sql;

create table abcs as

select distinct profile_number, sas_id, date as mindate format=Date9.

from a

group by profile_number

having date = min(date)

;

quit;

I don't know how many types of SQLs are there but I tried on some online websites and they give filtering without using having.

Just wanted to confirm if this is a documented difference with SQL and proc SQL or different SQLs ( MYSQL oracle) also behave differently.

Regards

Jaskeerat

Tom · Posted 07-07-2018 01:32 PM

Please post your question as a new topic. You might want to put in a complete example input and required results out.

Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Catch up on SAS Innovate 2026

Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Re: Distinct with group by in PROC SQL

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away