Hi there,
Whatr are possibilities to check for duplicates in a dataset where the expectation is that the duplicate entries for the same, say, patient will have differences in data entry for the key identifying variables (i.e. patient ID, date of visit, program name)? The rest of the variables, say, v1 - v20, however, may be the same.
Normally i may have : (i) used the nodup key, or (ii) created a concatenation of key identifying variables and singled out non-unique records for manual review.
In this case, would a sort of probabilistic data matching work better? what are ways to do this in sas?
Let me know if I can clarify -thank you!
Without an example dataset to work from I am guessing and making this up as I go along. This will give you each individual ID number and how many columns they have that are matching. It does not show what the differences are, only the amount of columns that are the same between two datasets:
data one;
input id$ var1 var2 var3;
cards;
100 1 1 1
200 2 2 2
300 3 3 3
;
data two;
input id$ var1 var2 var3;
cards;
100 1 0 0
200 1 1 1
300 3 3 3
;
proc transpose data=one out=tranone;by id;
proc transpose data=two out=trantwo(rename=(col1 = COL2));by id;
data prep;
merge tranone(in=a)
trantwo(in=b);
by id;
if a and b and col1 = col2;
count + 1;
if first.id then count = 1;
run;
data want;
set prep;
by id;
if last.id then output;
keep id count;
run;
Can you give an example dataset, where you have some entries that you want to be captured and others that you do not?
Hi,
So how does your data look. The reason I ask is because you mention there are data entry differences in Patient ID. Now dates I could understand, but Patient ID is a fixed Identification number (hence the name), if that is different then I see no way of acurately identifying tht subjects data, nor would any process using that data be within compliance. An ID is specifically that, and should be set by a pre-generated randomisation list, and only created as the subject folder. There should be no differences, unless the subject gets re-randomised and gets another number.
Thanks Steelers and RW9!
Is there a way to identify, for each record in a dataset, the degree to which all the variable reponses (or a subset of variable responses) matches the the variable reponses (or a subset of variable responses) for each other record in the dataset?
I could then hand review only those records with a very high "matching score" to identify duplicates.
For context: I'm asking if there s way to do something like this in SAS because although the programs where the data is collect from assign Patient IDs uniquely, the dataset is gathered from data entry done from paper forms; sometimes a form is data entered twice by two diff people who enter different patient ID s for the same person.
Without an example dataset to work from I am guessing and making this up as I go along. This will give you each individual ID number and how many columns they have that are matching. It does not show what the differences are, only the amount of columns that are the same between two datasets:
data one;
input id$ var1 var2 var3;
cards;
100 1 1 1
200 2 2 2
300 3 3 3
;
data two;
input id$ var1 var2 var3;
cards;
100 1 0 0
200 1 1 1
300 3 3 3
;
proc transpose data=one out=tranone;by id;
proc transpose data=two out=trantwo(rename=(col1 = COL2));by id;
data prep;
merge tranone(in=a)
trantwo(in=b);
by id;
if a and b and col1 = col2;
count + 1;
if first.id then count = 1;
run;
data want;
set prep;
by id;
if last.id then output;
keep id count;
run;
Hi Maisha,
If you want to check which duplicates you have, you could do something like this:
I am assuming that a patientID never changes, but the programname and date for that person can change (multiple visits).
proc sql;
create table dupes as
select distinct PatientID, Date, ProgramName, Count(*) as Count
from input_base
group by PatientID, Date, ProgramName
having count>=2;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.