Solved: Checking for duplicates

Maisha_Huq · Posted 12-08-2015 09:12 AM

Hi there,

Whatr are possibilities to check for duplicates in a dataset where the expectation is that the duplicate entries for the same, say, patient will have differences in data entry for the key identifying variables (i.e. patient ID, date of visit, program name)? The rest of the variables, say, v1 - v20, however, may be the same.

Normally i may have : (i) used the nodup key, or (ii) created a concatenation of key identifying variables and singled out non-unique records for manual review.

In this case, would a sort of probabilistic data matching work better? what are ways to do this in sas?

Let me know if I can clarify -thank you!

Steelers_In_DC · Posted 12-08-2015 11:04 AM

Without an example dataset to work from I am guessing and making this up as I go along. This will give you each individual ID number and how many columns they have that are matching. It does not show what the differences are, only the amount of columns that are the same between two datasets:

data one;
input id$ var1 var2 var3;
cards;
100 1 1 1
200 2 2 2
300 3 3 3
;

data two;
input id$ var1 var2 var3;
cards;
100 1 0 0
200 1 1 1
300 3 3 3
;

proc transpose data=one out=tranone;by id;

proc transpose data=two out=trantwo(rename=(col1 = COL2));by id;

data prep;
merge tranone(in=a)
trantwo(in=b);
by id;
if a and b and col1 = col2;
count + 1;
if first.id then count = 1;
run;

data want;
set prep;
by id;
if last.id then output;
keep id count;
run;

View solution in original post

Steelers_In_DC · Posted 12-08-2015 09:23 AM

Can you give an example dataset, where you have some entries that you want to be captured and others that you do not?

RW9 · Posted 12-08-2015 09:26 AM

Hi,

So how does your data look. The reason I ask is because you mention there are data entry differences in Patient ID. Now dates I could understand, but Patient ID is a fixed Identification number (hence the name), if that is different then I see no way of acurately identifying tht subjects data, nor would any process using that data be within compliance. An ID is specifically that, and should be set by a pre-generated randomisation list, and only created as the subject folder. There should be no differences, unless the subject gets re-randomised and gets another number.

Maisha_Huq · Posted 12-08-2015 10:45 AM

Thanks Steelers and RW9!

Is there a way to identify, for each record in a dataset, the degree to which all the variable reponses (or a subset of variable responses) matches the the variable reponses (or a subset of variable responses) for each other record in the dataset?

I could then hand review only those records with a very high "matching score" to identify duplicates.

For context: I'm asking if there s way to do something like this in SAS because although the programs where the data is collect from assign Patient IDs uniquely, the dataset is gathered from data entry done from paper forms; sometimes a form is data entered twice by two diff people who enter different patient ID s for the same person.

Steelers_In_DC · Posted 12-08-2015 11:04 AM

Without an example dataset to work from I am guessing and making this up as I go along. This will give you each individual ID number and how many columns they have that are matching. It does not show what the differences are, only the amount of columns that are the same between two datasets:

data one;
input id$ var1 var2 var3;
cards;
100 1 1 1
200 2 2 2
300 3 3 3
;

data two;
input id$ var1 var2 var3;
cards;
100 1 0 0
200 1 1 1
300 3 3 3
;

proc transpose data=one out=tranone;by id;

proc transpose data=two out=trantwo(rename=(col1 = COL2));by id;

data prep;
merge tranone(in=a)
trantwo(in=b);
by id;
if a and b and col1 = col2;
count + 1;
if first.id then count = 1;
run;

data want;
set prep;
by id;
if last.id then output;
keep id count;
run;

morglum · Posted 12-08-2015 09:51 AM

Hi Maisha,

If you want to check which duplicates you have, you could do something like this:

I am assuming that a patientID never changes, but the programname and date for that person can change (multiple visits).

proc sql;

create table dupes as

select distinct PatientID, Date, ProgramName, Count(*) as Count

from input_base

group by PatientID, Date, ProgramName

having count>=2;

quit;

Reeza · Posted 12-08-2015 11:20 AM

Let's talk about the scope of such an endeavor, which is not to say it's not impossible. First you have multiple records for each person. To determine if someone else is the same 'person' based solely on answers wouldn't you need to compare their answers to every other users answers. And if you expect the answers to be the same, why are you taking multiple measurements, or is safe to assume that the person would respond on the same date.

This type of analysis is called collusion analysis and would be similar to the algorithms that check for someone cheating on an exam. Your multiple answer scenario complicates it a bit, or make it more simple if you can consider the multiple responses and better indicators of a match.

That being said...I would have no idea how to implement such a method in SAS 😞

Checking for duplicates

Re: Checking for duplicates

Re: Checking for duplicates

Re: Checking for duplicates

Re: Checking for duplicates

Re: Checking for duplicates

Re: Checking for duplicates

Re: Checking for duplicates

Classroom Training Available!