Re: Checking dupes in earliest row for an ID

Walternate · Posted 04-25-2016 03:51 PM

Hi,

I have data at the person-month level, that is, each row is one person in one month. There are dupes. The variables I have are ID, month/year, and two categorical vars.

ID mth_year categ_1 categ_2

1 200901 abc xyz

1 200901 abc pqr

1 200902 abc pqr

2 201001 def xyz

2 201001 ghi xyz

2 201002 ghi xyz

3 200907 abc pqr

3 200908 abc pqr

What I want is to do a data check: I want to know how many people have duplicate rows (as in, same ID and mth_year--not same categ_1 and categ_2) in their earliest mth_year. Then within that group, I want to know how many of them have different values of 1. categ_1 and 2. categ_2 across their dupe rows.

Any help is much appreciated.

Astounding · Posted 04-25-2016 05:20 PM

The first question is easy to address. Here's one way:

proc sort data=have;

by id mth_year;

run;

data want;

set have;

by id mth_year;

if first.id and last.mth_year=0;

run;

This gives you a data set with all the offenders within (one observation apiece). If you only want to know the count, the log will tell you how many are in the data set.

Many posters could have gone this far. I think the reason you haven't gotten responses has to do with the second question. There are some ambiguities about what you are asking for.

Are different values limited to the first month only, or could they appear in later months?

Are different values of categ_1 to be counted without inspecting categ_2?

With a little more clarification, I think that SQL could compare the original HAVE data set with the WANT data set above ... easily enough for a SQL programmer (though not for me). I could do it in a DATA step if I had to, but it's a bit clumsy. For example:

proc sort data=have;

by id mth_year categ_1;

run;

data want_categ1;

merge want (in=keepme) have;

by id mth_year;

if keepme;

if first.mth_year then counter = 0;

if first.categ_1 then counter + 1;

if last.mth_year and counter > 1;

run;

Then similarly for categ_2.

FreelanceReinh · Posted 04-26-2016 07:41 AM

Hi @Walternate,

This should answer your questions:

proc sql;
create table earliest as
select *
from have
group by id
having mth_year=min(mth_year);

create table dupes as
select id, mth_year,
       count(*) as n,                    /* # obs. in a duplicate group */
      (count(distinct categ_1)>1) as d1, /* 0/1 flag "different CATEG_1 values" */
      (count(distinct categ_2)>1) as d2  /* 0/1 flag "different CATEG_2 values" */
from earliest
group by id, mth_year
having n>1;
quit;

proc freq data=dupes;
tables d1*d2;
run;

Dataset EARLIEST contains the observations of the earliest MTH_YEAR of each person from your existing dataset (called HAVE). For your sample data it would comprise 5 observations.

In dataset DUPES duplicates are aggregated to person level with three aggregated measures:

n = number of obs. in the duplicate group
d1 = 1 if there are at least two different non-missing values of CATEG_1 in the duplicate group, otherwise 0.
d2 = 1 if there are at least two different non-missing values of CATEG_2 in the duplicate group, otherwise 0.

IDs with only one observation in dataset EARLIEST are not contained in DUPES. Hence, the number of observations in DUPES (=2 for your sample data) is the number of people who have "duplicate rows."

The PROC FREQ step determines the number (and percentage) of people with different (non-missing!) CATEG_1 values, ditto for CATEG_2 and also for each of the four possible combinations of 0/1 flags d1 and d2.

Checking dupes in earliest row for an ID

Re: Checking dupes in earliest row for an ID

Re: Checking dupes in earliest row for an ID

Click image to register for webinar

Classroom Training Available!