topic Re: Checking for duplicates in SAS Programming

Checking for duplicates

Maisha_Huq — Tue, 08 Dec 2015 14:12:51 GMT

Hi there,

Whatr are possibilities to check for duplicates in a dataset where the expectation is that the duplicate entries for the same, say, patient will have differences in data entry for the key identifying variables (i.e. patient ID, date of visit, program name)? The rest of the variables, say, v1 - v20, however, may be the same.

Normally i may have : (i) used the nodup key, or (ii) created a concatenation of key identifying variables and singled out non-unique records for manual review.

In this case, would a sort of probabilistic data matching work better? what are ways to do this in sas?

Let me know if I can clarify -thank you!

Re: Checking for duplicates

Steelers_In_DC — Tue, 08 Dec 2015 14:23:49 GMT

Can you give an example dataset, where you have some entries that you want to be captured and others that you do not?

Re: Checking for duplicates

RW9 — Tue, 08 Dec 2015 14:26:30 GMT

Hi,

So how does your data look. The reason I ask is because you mention there are data entry differences in Patient ID. Now dates I could understand, but Patient ID is a fixed Identification number (hence the name), if that is different then I see no way of acurately identifying tht subjects data, nor would any process using that data be within compliance. An ID is specifically that, and should be set by a pre-generated randomisation list, and only created as the subject folder. There should be no differences, unless the subject gets re-randomised and gets another number.

Re: Checking for duplicates

morglum — Tue, 08 Dec 2015 14:52:38 GMT

Hi Maisha,

If you want to check which duplicates you have, you could do something like this:

I am assuming that a patientID never changes, but the programname and date for that person can change (multiple visits).

proc sql;

create table dupes as

select distinct PatientID, Date, ProgramName, Count(*) as Count

from input_base

group by PatientID, Date, ProgramName

having count>=2;

quit;

Re: Checking for duplicates

Maisha_Huq — Tue, 08 Dec 2015 15:45:05 GMT

Thanks Steelers and RW9!

Is there a way to identify, for each record in a dataset, the degree to which all the variable reponses (or a subset of variable responses) matches the the variable reponses (or a subset of variable responses) for each other record in the dataset?

I could then hand review only those records with a very high "matching score" to identify duplicates.

For context: I'm asking if there s way to do something like this in SAS because although the programs where the data is collect from assign Patient IDs uniquely, the dataset is gathered from data entry done from paper forms; sometimes a form is data entered twice by two diff people who enter different patient ID s for the same person.

Re: Checking for duplicates

Steelers_In_DC — Tue, 08 Dec 2015 16:04:08 GMT

Without an example dataset to work from I am guessing and making this up as I go along. This will give you each individual ID number and how many columns they have that are matching. It does not show what the differences are, only the amount of columns that are the same between two datasets:

data one;
input id$ var1 var2 var3;
cards;
100 1 1 1
200 2 2 2
300 3 3 3
;

data two;
input id$ var1 var2 var3;
cards;
100 1 0 0
200 1 1 1
300 3 3 3
;

proc transpose data=one out=tranone;by id;

proc transpose data=two out=trantwo(rename=(col1 = COL2));by id;

data prep;
merge tranone(in=a)
trantwo(in=b);
by id;
if a and b and col1 = col2;
count + 1;
if first.id then count = 1;
run;

data want;
set prep;
by id;
if last.id then output;
keep id count;
run;

Re: Checking for duplicates

Reeza — Tue, 08 Dec 2015 16:20:39 GMT

Let's talk about the scope of such an endeavor, which is not to say it's not impossible. First you have multiple records for each person. To determine if someone else is the same 'person' based solely on answers wouldn't you need to compare their answers to every other users answers. And if you expect the answers to be the same, why are you taking multiple measurements, or is safe to assume that the person would respond on the same date.

This type of analysis is called collusion analysis and would be similar to the algorithms that check for someone cheating on an exam. Your multiple answer scenario complicates it a bit, or make it more simple if you can consider the multiple responses and better indicators of a match.

That being said...I would have no idea how to implement such a method in SAS 😞