DATA Step, Macro, Functions and more

Checking dupes in earliest row for an ID

Frequent Contributor
Posts: 138

Checking dupes in earliest row for an ID



I have data at the person-month level, that is, each row is one person in one month. There are dupes. The variables I have are ID, month/year, and two categorical vars. 


ID         mth_year     categ_1     categ_2

1           200901            abc          xyz

1           200901            abc          pqr

1           200902            abc          pqr

2           201001            def           xyz

2           201001            ghi           xyz

2           201002            ghi           xyz

3           200907            abc          pqr

3           200908            abc          pqr


What I want is to do a data check: I want to know how many people have duplicate rows (as in, same ID and mth_year--not same categ_1 and categ_2) in their earliest mth_year. Then within that group, I want to know how many of them have different values of 1. categ_1 and 2. categ_2 across their dupe rows.


Any help is much appreciated. 

Super User
Posts: 5,075

Re: Checking dupes in earliest row for an ID

[ Edited ]

The first question is easy to address.  Here's one way:


proc sort data=have;

by id mth_year;


data want;

set have;

by id mth_year;

if and last.mth_year=0;



This gives you a data set with all the offenders within (one observation apiece).  If you only want to know the count, the log will tell you how many are in the data set.


Many posters could have gone this far.  I think the reason you haven't gotten responses has to do with the second question.  There are some ambiguities about what you are asking for. 


Are different values limited to the first month only, or could they appear in later months?


Are different values of categ_1 to be counted without inspecting categ_2?


With a little more clarification, I think that SQL could compare the original HAVE data set with the WANT data set above ... easily enough for a SQL programmer (though not for me).  I could do it in a DATA step if I had to, but it's a bit clumsy.  For example:


proc sort data=have;

by id mth_year categ_1;


data want_categ1;

merge want (in=keepme) have;

by id mth_year;

if keepme;

if first.mth_year then counter = 0;

if first.categ_1 then counter + 1;

if last.mth_year and counter > 1;



Then similarly for categ_2.

Trusted Advisor
Posts: 1,115

Re: Checking dupes in earliest row for an ID

Hi @Walternate,


This should answer your questions:


proc sql;
create table earliest as
select *
from have
group by id
having mth_year=min(mth_year);

create table dupes as
select id, mth_year,
       count(*) as n,                    /* # obs. in a duplicate group */
      (count(distinct categ_1)>1) as d1, /* 0/1 flag "different CATEG_1 values" */
      (count(distinct categ_2)>1) as d2  /* 0/1 flag "different CATEG_2 values" */
from earliest
group by id, mth_year
having n>1;

proc freq data=dupes;
tables d1*d2;

Dataset EARLIEST contains the observations of the earliest MTH_YEAR of each person from your existing dataset (called HAVE). For your sample data it would comprise 5 observations.



In dataset DUPES duplicates are aggregated to person level with three aggregated measures:

  • n = number of obs. in the duplicate group
  • d1 = 1 if there are at least two different non-missing values of CATEG_1 in the duplicate group, otherwise 0.
  • d2 = 1 if there are at least two different non-missing values of CATEG_2 in the duplicate group, otherwise 0.

IDs with only one observation in dataset EARLIEST are not contained in DUPES. Hence, the number of observations in DUPES (=2 for your sample data) is the number of people who have "duplicate rows."


The PROC FREQ step determines the number (and percentage) of people with different (non-missing!) CATEG_1 values, ditto for CATEG_2 and also for each of the four possible combinations of 0/1 flags d1 and d2.

Ask a Question
Discussion stats
  • 2 replies
  • 3 in conversation