Solved: Group by and count removes duplicate rows

SarahDew · Posted 03-03-2023 08:02 AM

I am using a count + group by to flag rows that are complete duplicates. The result removes the extra rows, while I prefer to keep the original dataset, with just an added column. I don't use a select distinct so can't understand why rows are being deleted. Any clue why it happens and how to avoid it?

data t;
input ID$ name$;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl;
run;

proc sql;
create table dup_flag as
select *, count(*) as n
from t
group by ID, name;
quit;

maguiremq · Posted 03-03-2023 08:20 AM

Hi @SarahDew, when you use GROUP BY, it's going to collapse those columns into a single record based on ID and NAME. If you want to keep the original structure, I would just join the original table to your original query:

data t;
input ID$ name :$6.;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl
;
run;

proc sql;
	create table dup_flag as
		select 
			t.*
			, a.n
		from 
			t
		left join
			(
				select
					id
					, name
					, count(*) as n
				from
					t
				group by
					id
					, name
			) a
			on t.id = a.id
			and t.name = a.name
	; 
quit;

ID	name	n
a010	James	1
a010	Steve	1
a011	Harvey	1
a012	Carl	2
a012	Carl	2

I think that's what you're trying to get at, but I may not have understood the question. Let me know - happy to help.

View solution in original post

maguiremq · Posted 03-03-2023 08:20 AM

Hi @SarahDew, when you use GROUP BY, it's going to collapse those columns into a single record based on ID and NAME. If you want to keep the original structure, I would just join the original table to your original query:

data t;
input ID$ name :$6.;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl
;
run;

proc sql;
	create table dup_flag as
		select 
			t.*
			, a.n
		from 
			t
		left join
			(
				select
					id
					, name
					, count(*) as n
				from
					t
				group by
					id
					, name
			) a
			on t.id = a.id
			and t.name = a.name
	; 
quit;

ID	name	n
a010	James	1
a010	Steve	1
a011	Harvey	1
a012	Carl	2
a012	Carl	2

I think that's what you're trying to get at, but I may not have understood the question. Let me know - happy to help.

FreelanceReinh · Posted 03-03-2023 10:26 AM

Hello @SarahDew,

Alternatively, you can group by a unique key derived from ID and name (instead of by ID, name). This will trigger automatic remerging (see the note in the log "The query requires remerging summary statistics back with the original data.") and thus prevent the unwanted aggregation. For your example data (and in most other cases) a simple concatenation works as the unique key:

group by ID||name;

PGStats · Posted 03-03-2023 02:13 PM

... Or you can simply fool SAS/SQL into auto-remerge by pretending to perform an operation on one of the GROUP BY columns, for example:

proc sql;
/*create table dup_flag as */
  select *, count(*) as n
  from t
  group by ID, trim(name);
quit;

PG

PGStats · Posted 03-03-2023 02:03 PM

If all your variables are in the GROUP BY clause, you must request the remerge explicitly, for example:

proc sql;
/*create table dup_flag as */
select *
from t natural join (
  select *, count(*) as n
  from t
  group by ID, name);
quit;

PG

Sajid01 · Posted 03-03-2023 06:34 PM

The error is in your data step. The last step has a semi colon is at the incorrect location just after Carl

The corrected code is shown below. I moved the semicolon to the next line.

data t;
input ID$ name$;
cards;
a010 Steve
a010 James
a011 Harvey
a012 Carl
a012 Carl
;
run;

proc sql;
select ID, NAME, count(*) as n
from t
group by id, name;
quit;

The result would be as expected. Nothing is deleted

Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Re: Group by and count removes duplicate rows

Registration is open

SAS Training: Just a Click Away