I am using a count + group by to flag rows that are complete duplicates. The result removes the extra rows, while I prefer to keep the original dataset, with just an added column. I don't use a select distinct so can't understand why rows are being deleted. Any clue why it happens and how to avoid it?
data t;
input ID$ name$;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl;
run;
proc sql;
create table dup_flag as
select *, count(*) as n
from t
group by ID, name;
quit;
Hi @SarahDew, when you use GROUP BY, it's going to collapse those columns into a single record based on ID and NAME. If you want to keep the original structure, I would just join the original table to your original query:
data t;
input ID$ name :$6.;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl
;
run;
proc sql;
create table dup_flag as
select
t.*
, a.n
from
t
left join
(
select
id
, name
, count(*) as n
from
t
group by
id
, name
) a
on t.id = a.id
and t.name = a.name
;
quit;
ID name n a010 James 1 a010 Steve 1 a011 Harvey 1 a012 Carl 2 a012 Carl 2
I think that's what you're trying to get at, but I may not have understood the question. Let me know - happy to help.
Hi @SarahDew, when you use GROUP BY, it's going to collapse those columns into a single record based on ID and NAME. If you want to keep the original structure, I would just join the original table to your original query:
data t;
input ID$ name :$6.;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl
;
run;
proc sql;
create table dup_flag as
select
t.*
, a.n
from
t
left join
(
select
id
, name
, count(*) as n
from
t
group by
id
, name
) a
on t.id = a.id
and t.name = a.name
;
quit;
ID name n a010 James 1 a010 Steve 1 a011 Harvey 1 a012 Carl 2 a012 Carl 2
I think that's what you're trying to get at, but I may not have understood the question. Let me know - happy to help.
Hello @SarahDew,
Alternatively, you can group by a unique key derived from ID and name (instead of by ID, name). This will trigger automatic remerging (see the note in the log "The query requires remerging summary statistics back with the original data.") and thus prevent the unwanted aggregation. For your example data (and in most other cases) a simple concatenation works as the unique key:
group by ID||name;
... Or you can simply fool SAS/SQL into auto-remerge by pretending to perform an operation on one of the GROUP BY columns, for example:
proc sql;
/*create table dup_flag as */
select *, count(*) as n
from t
group by ID, trim(name);
quit;
If all your variables are in the GROUP BY clause, you must request the remerge explicitly, for example:
proc sql;
/*create table dup_flag as */
select *
from t natural join (
select *, count(*) as n
from t
group by ID, name);
quit;
The error is in your data step. The last step has a semi colon is at the incorrect location just after Carl
The corrected code is shown below. I moved the semicolon to the next line.
data t;
input ID$ name$;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl
;
run;
proc sql;
select ID, NAME, count(*) as n
from t
group by id, name;
quit;
The result would be as expected. Nothing is deleted
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.