Writing some SQL code which gives results that does not make sense. Is this a bug or is it suppose to be like this? I have used several SQL engines and none gives this results. Two examples below:
SAS:
proc SQL;
create table test (
a varchar(8),
b varchar(8));
insert into test (a, b) values ('XX_ABC01', 'ABCDXX01');
insert into test (a, b) values ('XX_ABC01', 'ABCDXX01');
insert into test (a, b) values ('XX_ABC02', 'ABCDXX02');
insert into test (a, b) values ('XY_DCY02', 'ABCDXX02');
;quit;
proc SQL;
select
substr(a, 1, 6) as a,
substr(b, 1, 6) as b,
count(*) as n
from test
group by
substr(a, 1, 6),
substr(b, 1, 6)
;quit;
result:
XX_ABC | ABCDXX | 3 |
XX_ABC | ABCDXX | 3 |
XY_DCY | ABCDXX | 1 |
SQL SERVER (See SQL fiddle for working example)
create table test (
a varchar(8),
b varchar(8));
insert into test (a, b) values ('XX_ABC01', 'ABCDXX01');
insert into test (a, b) values ('XX_ABC01', 'ABCDXX01');
insert into test (a, b) values ('XX_ABC02', 'ABCDXX02');
insert into test (a, b) values ('XY_DCY02', 'ABCDXX02');
select
substring(a, 1, 6) as a,
substring(b, 1, 6) as b,
count(*) as n
from test
group by
substring(a, 1, 6),
substring(b, 1, 6) ;
RESULT (correct in my opinion)
XX_ABC | ABCDXX | 3 |
XY_DCY | ABCDXX | 1 |
Since substr(a,1,6) (used in the group by) is a different object than new variable a from the select, SAS sees the need to remerge the summary statistics (see the NOTE in the log).
Use this instead:
proc SQL;
create table test (
a char(8),
b char(8)
);
insert into test (a, b) values ('XX_ABC01', 'ABCDXX01');
insert into test (a, b) values ('XX_ABC01', 'ABCDXX01');
insert into test (a, b) values ('XX_ABC02', 'ABCDXX02');
insert into test (a, b) values ('XY_DCY02', 'ABCDXX02');
quit;
proc SQL;
select
substr(a, 1, 6) as a,
substr(b, 1, 6) as b,
count(*) as n
from test
group by calculated a, calculated b
;
quit;
Note that SAS has no datatype varchar, only char.
Logically, it's the same, but technically it's just the result of a function (in the group by of the OP). 1,2 adresses the newly created variables, not the expressions, and is therefore equivalent to my usage of the calculated variables.
The SQL compiler is obviously not intelligent enough to check if the expressions in the group by are the same as those in the select. Maxim 31 in action.
LOL #31
All other SQL parsers manage the feat, so while maybe not a defect per se (and while I much prefer the calculated or 1,2 syntaxes anyway) I'd suggest this is at least a gap or a weakness.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.