topic Re: Group by and count removes duplicate rows in SAS Programming

Group by and count removes duplicate rows

SarahDew — Fri, 03 Mar 2023 13:02:20 GMT

I am using a count + group by to flag rows that are complete duplicates. The result removes the extra rows, while I prefer to keep the original dataset, with just an added column. I don't use a select distinct so can't understand why rows are being deleted. Any clue why it happens and how to avoid it?

data t;
input ID$ name$;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl;
run;

proc sql;
create table dup_flag as
select *, count(*) as n
from t
group by ID, name;
quit;

Re: Group by and count removes duplicate rows

maguiremq — Fri, 03 Mar 2023 13:20:05 GMT

Hi @SarahDew, when you use GROUP BY, it's going to collapse those columns into a single record based on ID and NAME. If you want to keep the original structure, I would just join the original table to your original query:

data t;
input ID$ name :$6.;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl
;
run;

proc sql;
	create table dup_flag as
		select 
			t.*
			, a.n
		from 
			t
		left join
			(
				select
					id
					, name
					, count(*) as n
				from
					t
				group by
					id
					, name
			) a
			on t.id = a.id
			and t.name = a.name
	; 
quit;

ID	name	n
a010	James	1
a010	Steve	1
a011	Harvey	1
a012	Carl	2
a012	Carl	2

I think that's what you're trying to get at, but I may not have understood the question. Let me know - happy to help.

Re: Group by and count removes duplicate rows

FreelanceReinh — Fri, 03 Mar 2023 15:26:21 GMT

Hello @SarahDew,

Alternatively, you can group by a unique key derived from ID and name (instead of by ID, name). This will trigger automatic remerging (see the note in the log "The query requires remerging summary statistics back with the original data.") and thus prevent the unwanted aggregation. For your example data (and in most other cases) a simple concatenation works as the unique key:

group by ID||name;

Re: Group by and count removes duplicate rows

PGStats — Fri, 03 Mar 2023 19:03:10 GMT

If all your variables are in the GROUP BY clause, you must request the remerge explicitly, for example:

proc sql;
/*create table dup_flag as */
select *
from t natural join (
  select *, count(*) as n
  from t
  group by ID, name);
quit;

Re: Group by and count removes duplicate rows

PGStats — Fri, 03 Mar 2023 19:13:50 GMT

... Or you can simply fool SAS/SQL into auto-remerge by pretending to perform an operation on one of the GROUP BY columns, for example:

proc sql;
/*create table dup_flag as */
  select *, count(*) as n
  from t
  group by ID, trim(name);
quit;

Re: Group by and count removes duplicate rows

Sajid01 — Fri, 03 Mar 2023 23:34:08 GMT

The error is in your data step. The last step has a semi colon is at the incorrect location just after Carl

The corrected code is shown below. I moved the semicolon to the next line.

data t;
input ID$ name$;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl
;
run;

proc sql;
select ID, NAME, count(*) as n
from t
group by id, name;
quit;

The result would be as expected. Nothing is deleted