topic Re: Count Consecutive Duplicates in SAS Enterprise Guide

Count Consecutive Duplicates

Amfam1 — Wed, 21 Nov 2007 16:17:26 GMT

I have found some code to split duplicate records into a seperate query, but I'm looking for a way to count the consecutive duplicate rows in a query without splitting them into seperate datasets. For example, I'm looking for an output like the "Dupicate_Count" column below:

Name Duplicate_Count
Mary 0
Mary 1
Mary 2
Bob 0
Bob 1
Bob 2
Bob 3
Fred 0
Fred 1
Fred 2

Re: Count Consecutive Duplicates

Cynthia_sas — Wed, 21 Nov 2007 17:04:18 GMT

Hi:
I'm sure there is a way to do this in SQL, however, I can't think of it. I do have a DATA step solution to calculate a counter the way you want. First, I made some test data (called WORK.TESTDATA):
[pre]
Obs Name Age

1 Alfred 14
2 Alfred 19
3 Alfred 29
4 Alice 13
5 Alice 23
6 Barbara 13
7 Barbara 23
8 Carol 14
9 Carol 19
10 Carol 29
11 Henry 14
12 Henry 19
13 Henry 29
14 James 12
15 James 22
16 Jane 12
17 Jane 22
18 Janet 15
19 Janet 20
20 Janet 30
[/pre]

Next, I used this program to make the counter (called DUPCNT) and a copy of the original dataset with the DUPCNT variable added (data set is WORK.NEWTEST):
[pre]
data newtest;
retain dupcnt;
set testdata;
by name;
** set counter to 0 for first name;
if first.name then dupcnt = 0;

** output the record;
output;

** increment the counter -- if next name is the same;
** the new counter value will be used and output. Otherwise,;
** if diff name, the counter value restarts at 0 for every new name;
** the RETAIN statement keeps DUPCNT as a "running" total or cumulative counter;
dupcnt + 1;
run;

proc print data=newtest;
title 'After creating Duplicate counter';
var name age dupcnt;
run;

[/pre]

And the output looks like this:
[pre]
After creating Duplicate counter

Obs Name Age dupcnt

1 Alfred 14 0
2 Alfred 19 1
3 Alfred 29 2
4 Alice 13 0
5 Alice 23 1
6 Barbara 13 0
7 Barbara 23 1
8 Carol 14 0
9 Carol 19 1
10 Carol 29 2
11 Henry 14 0
12 Henry 19 1
13 Henry 29 2
14 James 12 0
15 James 22 1
16 Jane 12 0
17 Jane 22 1
18 Janet 15 0
19 Janet 20 1
20 Janet 30 2
[/pre]

If you need help with the DATA step program, you might consider contacting Tech Support. To use the program in EG, you'd have to put the code in a code node.

cynthia

Re: Count Consecutive Duplicates

Doc_Duke — Thu, 22 Nov 2007 05:09:56 GMT

One thing to always be aware of in removing or counting "duplicates" is the definition of "duplicate" and the process of obtaining it. Your example and Cynthia's response will look at what are effectively duplicate KEYS, not duplicate RECORDS. To look at duplicate RECORDS, you have to sort on all variables in the dataset, and compare on all values. Recall this note from the PROC SORT documentation

"Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables."

I believe SAS has had this "feature" since at least version 5.

Doc Muhlbaier
Duke

Re: Count Consecutive Duplicates

deleted_user — Thu, 22 Nov 2007 16:02:37 GMT

as a comparative exercise here is some SQL
It is a little different in the result which is discussed below.
Method: join/merge in summary counts
something like
[pre]proc sql;
create table ages as
select name, age, ages
from testdata
join ( select name as namect, count( distinct age) as ages
from testdata
group by name )
on name= namect
;
quit;[/pre]

of course, not really having "row order" information in sql, it seems a bit more complex to replicate the increasing counter within name.
It returns the rows of result in the order of the rows of testdata, so it might be useful that it doesn't need to be in any special order.
Drop the "distinct" word if all within name should be counted instead of counting unique ages .

PeterC

Re: Count Consecutive Duplicates

Cynthia_sas — Thu, 22 Nov 2007 17:52:23 GMT

I was hoping somebody would raise that issue (duplicate "records" versus duplicate counter). That's why I showed the differing values for AGE. What I showed was how to create the cumulative counter. Since the request was for the counter (and NOT the removal of the duplicated rows), I didn't compare anything except the NAME column or mention using PROC SORT.

My solution is only correct if you want a dup counter on duplicates for 1 variable only, but as Doc pointed out, if you want to set the counter for duplicate "records" then you would need to compare every field and set the counter accordingly. If I had taken NAME and AGE into account, then every row would have been unique.

cynthia