## Dealing with duplicate sets of rows

Frequent Contributor
Posts: 138

# Dealing with duplicate sets of rows

Hi,

I have a dataset in which each row is a person's score for a given event on a given date. People have 5 events in each day, as well as a total score which incorporates the 5 event scores but is not a straight sum. People can be in the dataset on multiple days.

ID      Date        Event      Score

1    10/1/2012     1                5

1    10/1/2012     2                3

1    10/1/2012     3                7

1    10/1/2012     4                2

1    10/1/2012     5                3

1    10/1/2012    tot              17

1    10/1/2012    tot              14

2    9/7/2014       1                3

2    9/7/2014       1             null

2    9/7/2014       2               7

2    9/7/2014       2             null

I'm trying to clean the data but it has 3 problems:

1. One person will have multiple sets of events on one day, in which one set has values and one set is null/missing

2. One person will have multiple sets of events on one day, in which both sets have identical values

3. One person will have multiple sets of events in one day, in which both sets have different values

This is my solution to 1 and 2:

Proc sort data=have;

By ID date event descending score; Run;

Data want;

Set have;

By ID date event descending score;

If first.event output want;

Run;

That should output the first of two identical rows or a non-missing row if one set of responses has values and the other set is missing. However, it does not address the third problem. I know I can't get rid of a row at random so if I could maybe create a flag or some sort of indicator that would show that the person had two distinct scores for that event, that would be helpful.

Additionally, when I was going through the data, the ones that I noticed people having multiple scores on the same day for were the total scores. If event=tot were the only ones for which a person had two distinct values of score, I could just delete the event=tot rows altogether, transpose the data, then calculate the total scores myself. However, I would need to be able to verify that it was only the event=tot rows that had two distinct responses, and I'm not sure how to do that.

Any help is much appreciated.

Super User
Posts: 5,878

## Re: Dealing with duplicate sets of rows

Are you saying that the stored total that is not a straight sum is wrong, and your calculation would be better?

First, why store totals, if they are, totals. That's the job for the reporting procedure.

Second, go back to your source to get better data, or at least more details, Maybe ther's a timestamp for each record, and you could use that for a cleansing rule.

Other than that, there is no silver bullet - you need to create a rule, and communicate that very clear to any person looking at this data/report. If you chose the first/last/medium, or just set it to missing/null - pretty much up to business requirements.

Data never sleeps
Contributor
Posts: 66

## Re: Dealing with duplicate sets of rows

This should give you what you need ...

``````DATA have;
input ID Date :MMDDYY10. Event \$ Score;
format date DATE9.;
datalines;
1 10/1/2012 1    5
1 10/1/2012 2    3
1 10/1/2012 3    7
1 10/1/2012 4    2
1 10/1/2012 5    3
1 10/1/2012 tot 17
1 10/1/2012 tot 14
2 9/7/2014  1    3
2 9/7/2014  1    .
2 9/7/2014  2    7
2 9/7/2014  2    .
;
RUN;

PROC SQL;
CREATE TABLE want AS
SELECT ID
, Date
, Event
, MAX(Score)            AS Max_Score
/* other helper variables are listed below ... */
, NMISS(Score)          AS Num_Missing_Score
, COUNT(DISTINCT Score) AS Num_Distinct_Score
FROM have
GROUP BY ID, Date, Event;
QUIT;``````

Here is the output:

Discussion stats
• 2 replies
• 208 views
• 0 likes
• 3 in conversation