BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
goutamgm92
Calcite | Level 5

Hi,

 

I am using a proc SQL procedure, and I find a difference in the number of rows between the following queries in SAS.

 

 

proc sql; 
create table flash.test1 as 
select distinct profile_number, sas_id, min(flash_signup_date) as mindate format=Date9.  
from one.subscriptions where is_flash EQ '1'
and signup_date GT canceled_date
group by profile_number;
quit;



proc sql; 
create table flash.test1 as 
select profile_number, sas_id, min(flash_signup_date) as mindate format=Date9.  
from one.subscriptions where is_flash EQ '1'
and signup_date GT canceled_date
group by profile_number;
quit;

 

I understand if the column sas_id was not present, there would be no use of the distinct statement with a group by clause. But can someone clarify how the distinct in the above SQL statements changes the grouping when used with another additional column? 

 

Fun fact: Any combination of distinct and group by throws an error in other DBMS (tested on a couple)

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Most databases will not accept a query that uses GROUP BY and includes columns in the result set which are neither one of the grouping variables or a summary statistic.

 

But SAS will happly remerge the summary statistics back onto all detail rows for you.

 

And if the there are duplicate rows then those are generated also.  But if you add the DISTINCT keyword then the duplicate result rows are removed.

 

data have;
  input grp $ extra $ val ;
cards;
A X 2
A X 3
A Y 4
;

proc sql ;
 create table test1 as 
   select grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
 create table test2 as 
   select distinct grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
quit;

View solution in original post

5 REPLIES 5
Tom
Super User Tom
Super User

Most databases will not accept a query that uses GROUP BY and includes columns in the result set which are neither one of the grouping variables or a summary statistic.

 

But SAS will happly remerge the summary statistics back onto all detail rows for you.

 

And if the there are duplicate rows then those are generated also.  But if you add the DISTINCT keyword then the duplicate result rows are removed.

 

data have;
  input grp $ extra $ val ;
cards;
A X 2
A X 3
A Y 4
;

proc sql ;
 create table test1 as 
   select grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
 create table test2 as 
   select distinct grp,extra,sum(val) as tot_val
   from have
   group by 1
   order by 1,2,3
 ;
quit;
goutamgm92
Calcite | Level 5
Awesome, thank you! One more observation irrespective of the number of rows in the output, the sum
(val) contained 9 for both scenarios.
kiranv_
Rhodochrosite | Level 12

Let us try to understand by an example similar to yours

 

data a;

input profile_number sas_id date:date9.;

format date date9.;

datalines;

123 345 23SEP1974

123 228 24SEP1974

123 607 25SEP1974

224 601 22SEP1974

224 601 30SEP1974

;

 

After creating dataset you run the below query

 

proc sql;

create table abcs as

select profile_number, sas_id, min(date) as mindate format=Date9.

from a

group by profile_number;

quit;

 

what this does is create min date with each group and remerges within that group as you see this note

 

NOTE: The query requires remerging summary statistics back with the original data.

resultant dataset is below (min date for 123 profile_id is 23sep1974 and is repeated within that group)

profile_number sas_id mindate
123 607 23SEP1974
123 345 23SEP1974
123 228 23SEP1974
224 601 22SEP1974
224 601 22SEP1974

 

if you see above last 2 row are same and when you apply distinct only one will be left out. In this step distinct is applied after  group by and gives result as shown below.

 

profile_number sas_id mindate
123 228 23SEP1974
123 345 23SEP1974
123 607 23SEP1974
224 601 22SEP1974

 

query for distinct is below

proc sql;

create table abc as

select distinct profile_number, sas_id, min(date) as mindate format=Date9.

from a

group by profile_number;

quit;

 

 

 

hope this example makes it clear.

 

jaskeerat_singh
Calcite | Level 5

Hi Kiran,

 

Although may not be related to topic but your example is what I was looking for.  If you do the same syntax in an sql server the 

result would be

 

123 345 23SEP1974 

224 601 22SEP1974

 

It is major difference I have seen while trying to filter based on max/min group by. In SAS you would need following code

 

proc sql;

create table abcs as

select distinct profile_number, sas_id, date as mindate format=Date9.

from a

group by profile_number

having date = min(date)

;

quit;

 

I don't know how many types of SQLs are there but I tried on some online websites and they give filtering without using having.

 

Just wanted to confirm if this is a documented difference with SQL and proc SQL or different SQLs ( MYSQL oracle) also behave differently.

 

Regards

Jaskeerat

Tom
Super User Tom
Super User

Please post your question as a new topic.  You might want to put in a complete example input and required results out.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 24462 views
  • 1 like
  • 4 in conversation