BookmarkSubscribeRSS Feed
Walternate
Obsidian | Level 7

Hi,

 

I have data at the person-month level, that is, each row is one person in one month. There are dupes. The variables I have are ID, month/year, and two categorical vars. 

 

ID         mth_year     categ_1     categ_2

1           200901            abc          xyz

1           200901            abc          pqr

1           200902            abc          pqr

2           201001            def           xyz

2           201001            ghi           xyz

2           201002            ghi           xyz

3           200907            abc          pqr

3           200908            abc          pqr

 

What I want is to do a data check: I want to know how many people have duplicate rows (as in, same ID and mth_year--not same categ_1 and categ_2) in their earliest mth_year. Then within that group, I want to know how many of them have different values of 1. categ_1 and 2. categ_2 across their dupe rows.

 

Any help is much appreciated. 

2 REPLIES 2
Astounding
PROC Star

The first question is easy to address.  Here's one way:

 

proc sort data=have;

by id mth_year;

run;

data want;

set have;

by id mth_year;

if first.id and last.mth_year=0;

run;

 

This gives you a data set with all the offenders within (one observation apiece).  If you only want to know the count, the log will tell you how many are in the data set.

 

Many posters could have gone this far.  I think the reason you haven't gotten responses has to do with the second question.  There are some ambiguities about what you are asking for. 

 

Are different values limited to the first month only, or could they appear in later months?

 

Are different values of categ_1 to be counted without inspecting categ_2?

 

With a little more clarification, I think that SQL could compare the original HAVE data set with the WANT data set above ... easily enough for a SQL programmer (though not for me).  I could do it in a DATA step if I had to, but it's a bit clumsy.  For example:

 

proc sort data=have;

by id mth_year categ_1;

run;

data want_categ1;

merge want (in=keepme) have;

by id mth_year;

if keepme;

if first.mth_year then counter = 0;

if first.categ_1 then counter + 1;

if last.mth_year and counter > 1;

run;

 

Then similarly for categ_2.

FreelanceReinh
Jade | Level 19

Hi @Walternate,

 

This should answer your questions:

 

proc sql;
create table earliest as
select *
from have
group by id
having mth_year=min(mth_year);

create table dupes as
select id, mth_year,
       count(*) as n,                    /* # obs. in a duplicate group */
      (count(distinct categ_1)>1) as d1, /* 0/1 flag "different CATEG_1 values" */
      (count(distinct categ_2)>1) as d2  /* 0/1 flag "different CATEG_2 values" */
from earliest
group by id, mth_year
having n>1;
quit;

proc freq data=dupes;
tables d1*d2;
run;

Dataset EARLIEST contains the observations of the earliest MTH_YEAR of each person from your existing dataset (called HAVE). For your sample data it would comprise 5 observations.

 

 

In dataset DUPES duplicates are aggregated to person level with three aggregated measures:

  • n = number of obs. in the duplicate group
  • d1 = 1 if there are at least two different non-missing values of CATEG_1 in the duplicate group, otherwise 0.
  • d2 = 1 if there are at least two different non-missing values of CATEG_2 in the duplicate group, otherwise 0.

IDs with only one observation in dataset EARLIEST are not contained in DUPES. Hence, the number of observations in DUPES (=2 for your sample data) is the number of people who have "duplicate rows."

 

The PROC FREQ step determines the number (and percentage) of people with different (non-missing!) CATEG_1 values, ditto for CATEG_2 and also for each of the four possible combinations of 0/1 flags d1 and d2.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 760 views
  • 0 likes
  • 3 in conversation