BookmarkSubscribeRSS Feed
mayasak
Quartz | Level 8

I am attempting to "fuzzy match" first names and last names of individual patients with similar first/last names in a single dataset to remove duplicates and keep unique patients with similar "specimen_source".  I tried the following code but I did not get the result needed.

 

data want;
set Disease_Inc;
tmp1=soundex(first_name);
do i=1 to nobs;
set Disease_Inc (rename=(first_name=first_name2)) point=i nobs=nobs;
tmp2=soundex(first_name2);
dif=compged(tmp1,tmp2);
if dif>=70 then do;
possible_match='Yes';
drop i tmp1 tmp2 fname2;
output;
end;
end;
run;

 

This is the data: 

 

data Disease_Inc;
input Person_Id Incident_ID MRN $ Last_Name $ First_Name $ specimen_source $ Gender $ Race $;
datalines;
122345 56788 M776 Doe John Blood M W
34567 12344 L87765 Mouse Mickey Stool M B
90876 36476 009875 Mouse Minnie Blood F A
777654 227364 26376 Duck Louis Saliva M W
34567 12344 L87765 Moose Mickey Stool M B
90876 36476 009875 Mouse Minie Blood F A
777654 227364 26376 Duck Lewis blood M W
56761 23676 M765 Pippin Abel . F W
56761 23676 M765 Pipin Able Blood F W
667785 3624554 L97960 Smith Oliver . M I
345564 34567 9432 Johns Karen Stool F W
79908 123540 M78765 Doe Kevin Blood M W
677654 378754 L8544 Wayne Bruce Blood M B
run;

 

This is the intended result:

122345 56788 M776 Doe John Blood M W
34567 12344 L87765 Mouse Mickey Stool M B
90876 36476 009875 Mouse Minnie Blood F A
777654 227364 26376 Duck Louis Saliva M W
34567 12344 L87765 Moose Mickey Stool M B
90876 36476 009875 Mouse Minie Blood F B
777654 227364 26376 Duck Lewis blood M W
56761 23676 M765 Pippin Abel . F W
56761 23676 M765 Pipin Able Blood F W
667785 3624554 L97960 Smith Oliver . M I
345564 34567 9432 Johns Karen Stool F W
79908 123540 M78765 Doe Kevin Blood M W
677654 378754 L8544 Wayne Bruce Blood M B
run;

I would appreciate any and all suggestions or ideas and thank you in advance for your assistance.

Thank you

5 REPLIES 5
ballardw
Super User

Question: what external rule are you applying that says you keep Mickey Mouse and not Mickey Moose? and Minie Mouse vs Minnie Mouse?

 

There or just keep one of the two and you don't really care which?

 

Since Pippin Abel with missing specimen_source technically has a "different" specimen that is not the same as the one with Blood so should be kept, OR it is inconsistent with the rules as stated that do not remove one of the "Duck Lewis" or "Duck Louis".

 

An important question: You ask about names but apparently the Person_Id, Incident_ID and MRN should match from the examples you provide. If that is NOT the actual case then you should provide a more representative data set as those variables from the example are the obvious start to filter likely matches.

 

 

mayasak
Quartz | Level 8

Hi ballardw,

 

Thank you for your reply. 

  • There is no rule on which one to keep.
  • The data source is merged, so each of the observations have different variables that are not available in the second one. 
  •  Is there a rule to remove the one that have missing specimen_source such as Duck Lewis?
  • For the rest of the variables such as the mrn, incident_id and person_id, it doesn't matter if they are the same or different.

Thank you

mkeintz
PROC Star

@mayasak wrote:

Hi ballardw,

 

Thank you for your reply. 

  • There is no rule on which one to keep.

 


If there is no rule, then how can we provide a program to produce the results you want?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mayasak
Quartz | Level 8

I meant it doesn't matter which one to keep.

mkeintz
PROC Star
@mayasak wrote:

Hi ballardw,

 

Thank you for your reply. 

  • There is no rule on which one to keep.

 

If there is no rule, then how can we provide a program to produce the results you want?


Then @mayasak wrote:

I meant it doesn't matter which one to keep.


OK.  Then what is your rule to determine that both should not be kept?  I.e. how do you decide whether you have "similar first/last names"?    Is mouse similar to moose?  to mount?  to maus? to muose?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 682 views
  • 0 likes
  • 3 in conversation