Help using Base SAS procedures

Distinct with group by in PROC SQL

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

Distinct with group by in PROC SQL

Hi,

 

I am using a proc SQL procedure, and I find a difference in the number of rows between the following queries in SAS.

 

 

proc sql; 
create table flash.test1 as 
select distinct profile_number, sas_id, min(flash_signup_date) as mindate format=Date9.  
from one.subscriptions where is_flash EQ '1'
and signup_date GT canceled_date
group by profile_number;
quit;



proc sql; 
create table flash.test1 as 
select profile_number, sas_id, min(flash_signup_date) as mindate format=Date9.  
from one.subscriptions where is_flash EQ '1'
and signup_date GT canceled_date
group by profile_number;
quit;

 

I understand if the column sas_id was not present, there would be no use of the distinct statement with a group by clause. But can someone clarify how the distinct in the above SQL statements changes the grouping when used with another additional column? 

 

Fun fact: Any combination of distinct and group by throws an error in other DBMS (tested on a couple)


Accepted Solutions
Solution
‎05-12-2017 01:41 PM
Super User
Super User
Posts: 6,498

Re: Distinct with group by in PROC SQL

Most databases will not accept a query that uses GROUP BY and includes columns in the result set which are neither one of the grouping variables or a summary statistic.

 

But SAS will happly remerge the summary statistics back onto all detail rows for you.

 

And if the there are duplicate rows then those are generated also.  But if you add the DISTINCT keyword then the duplicate result rows are removed.

 

data have;
  input grp $ extra $ val ;
cards;
A X 2
A X 3
A Y 4
;

proc sql ;
 create table test1 as 
   select grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
 create table test2 as 
   select distinct grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
quit;

View solution in original post


All Replies
Solution
‎05-12-2017 01:41 PM
Super User
Super User
Posts: 6,498

Re: Distinct with group by in PROC SQL

Most databases will not accept a query that uses GROUP BY and includes columns in the result set which are neither one of the grouping variables or a summary statistic.

 

But SAS will happly remerge the summary statistics back onto all detail rows for you.

 

And if the there are duplicate rows then those are generated also.  But if you add the DISTINCT keyword then the duplicate result rows are removed.

 

data have;
  input grp $ extra $ val ;
cards;
A X 2
A X 3
A Y 4
;

proc sql ;
 create table test1 as 
   select grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
 create table test2 as 
   select distinct grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
quit;
New Contributor
Posts: 2

Re: Distinct with group by in PROC SQL

Awesome, thank you! One more observation irrespective of the number of rows in the output, the sum
(val) contained 9 for both scenarios.
PROC Star
Posts: 252

Re: Distinct with group by in PROC SQL

[ Edited ]

Let us try to understand by an example similar to yours

 

data a;

input profile_number sas_id date:date9.;

format date date9.;

datalines;

123 345 23SEP1974

123 228 24SEP1974

123 607 25SEP1974

224 601 22SEP1974

224 601 30SEP1974

;

 

After creating dataset you run the below query

 

proc sql;

create table abcs as

select profile_number, sas_id, min(date) as mindate format=Date9.

from a

group by profile_number;

quit;

 

what this does is create min date with each group and remerges within that group as you see this note

 

NOTE: The query requires remerging summary statistics back with the original data.

resultant dataset is below (min date for 123 profile_id is 23sep1974 and is repeated within that group)

profile_number sas_id mindate
123 607 23SEP1974
123 345 23SEP1974
123 228 23SEP1974
224 601 22SEP1974
224 601 22SEP1974

 

if you see above last 2 row are same and when you apply distinct only one will be left out. In this step distinct is applied after  group by and gives result as shown below.

 

profile_number sas_id mindate
123 228 23SEP1974
123 345 23SEP1974
123 607 23SEP1974
224 601 22SEP1974

 

query for distinct is below

proc sql;

create table abc as

select distinct profile_number, sas_id, min(date) as mindate format=Date9.

from a

group by profile_number;

quit;

 

 

 

hope this example makes it clear.

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 401 views
  • 0 likes
  • 3 in conversation