BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Maisha_Huq
Quartz | Level 8

Hi there,


Whatr are possibilities to check for duplicates in a dataset where the expectation is that the duplicate entries for the same, say, patient will have differences in data entry for the key identifying variables (i.e. patient ID, date of visit, program name)?  The rest of the variables, say, v1 - v20, however, may be the same.

 

Normally i may have : (i) used the nodup key, or (ii) created a concatenation of key identifying variables and singled out non-unique records for manual review. 

 

In this case, would a sort of probabilistic data matching work better? what are ways to do this in sas?

Let me know if I can clarify -thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Steelers_In_DC
Barite | Level 11

Without an example dataset to work from I am guessing and making this up as I go along.  This will give you each individual ID number and how many columns they have that are matching.  It does not show what the differences are, only the amount of columns that are the same between two datasets:

 

data one;
input id$ var1 var2 var3;
cards;
100 1 1 1
200 2 2 2
300 3 3 3
;

data two;
input id$ var1 var2 var3;
cards;
100 1 0 0
200 1 1 1
300 3 3 3
;

proc transpose data=one out=tranone;by id;

proc transpose data=two out=trantwo(rename=(col1 = COL2));by id;

data prep;
merge tranone(in=a)
      trantwo(in=b);
by id;
if a and b and col1 = col2;
count + 1;
if first.id then count = 1;
run;

data want;
set prep;
by id;
if last.id then output;
keep id count;
run;

 

 

 

 

View solution in original post

6 REPLIES 6
Steelers_In_DC
Barite | Level 11

Can you give an example dataset, where you have some entries that you want to be captured and others that you do not?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

 

So how does your data look.  The reason I ask is because you mention there are data entry differences in Patient ID.  Now dates I could understand, but Patient ID is a fixed Identification number (hence the name), if that is different then I see no way of acurately identifying tht subjects data, nor would any process using that data be within compliance.  An ID is specifically that, and should be set by a pre-generated randomisation list, and only created as the subject folder.  There should be no differences, unless the subject gets re-randomised and gets another number.

Maisha_Huq
Quartz | Level 8

Thanks Steelers and RW9!

 

Is there a way to identify, for each record in a dataset, the degree to which all the variable reponses (or a subset of variable responses) matches the the variable reponses (or a subset of variable responses)  for each other record in the dataset?


I could then hand review only those records with a very high "matching score" to identify duplicates.

 

For context: I'm asking if there s way to do something like this in SAS because although the programs where the data is collect from assign Patient IDs uniquely, the dataset is gathered from data entry done from paper forms; sometimes a form is data entered twice by two diff people who enter different patient ID s for the same person. 

 

 

Steelers_In_DC
Barite | Level 11

Without an example dataset to work from I am guessing and making this up as I go along.  This will give you each individual ID number and how many columns they have that are matching.  It does not show what the differences are, only the amount of columns that are the same between two datasets:

 

data one;
input id$ var1 var2 var3;
cards;
100 1 1 1
200 2 2 2
300 3 3 3
;

data two;
input id$ var1 var2 var3;
cards;
100 1 0 0
200 1 1 1
300 3 3 3
;

proc transpose data=one out=tranone;by id;

proc transpose data=two out=trantwo(rename=(col1 = COL2));by id;

data prep;
merge tranone(in=a)
      trantwo(in=b);
by id;
if a and b and col1 = col2;
count + 1;
if first.id then count = 1;
run;

data want;
set prep;
by id;
if last.id then output;
keep id count;
run;

 

 

 

 

morglum
Quartz | Level 8

 Hi Maisha,

 

If you want to check which duplicates you have, you could do something like this:

 

I am assuming that a patientID never changes, but the programname and date for that person can change (multiple visits).

 

proc sql;

create table dupes as

select distinct PatientID, Date, ProgramName, Count(*) as Count

from input_base

group by PatientID, Date, ProgramName

having count>=2;

quit;

 

 

Reeza
Super User
Let's talk about the scope of such an endeavor, which is not to say it's not impossible. First you have multiple records for each person. To determine if someone else is the same 'person' based solely on answers wouldn't you need to compare their answers to every other users answers. And if you expect the answers to be the same, why are you taking multiple measurements, or is safe to assume that the person would respond on the same date.

This type of analysis is called collusion analysis and would be similar to the algorithms that check for someone cheating on an exam. Your multiple answer scenario complicates it a bit, or make it more simple if you can consider the multiple responses and better indicators of a match.

That being said...I would have no idea how to implement such a method in SAS 😞

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 2719 views
  • 3 likes
  • 5 in conversation