Solved: Re: same person with different name

dapenDaniel · Posted 11-27-2019 01:08 AM

Hi SAS experts!

I have a dataset as below.

FirmID Name

10001 Smith, Bob Alpha

10001 Smith, Bob A.

10002 Bloomberg, Jack Beta

10002 Bloomberg, Jack B.

The first two names are actually for the same person and the last two names are also for the same person. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name. Is there any way that I can get the dataset like below?

FirmID Name

10001 Smith, Bob Alpha

10002 Bloomberg, Jack Beta

Thanks!

sustagens · Posted 11-27-2019 07:49 PM

If all the shortened names have a period in your data and your actual scenario is as simple as your sample, then you can just eliminate all rows that contain '.'

PROC SQL;
   CREATE TABLE WANT AS 
   SELECT FirmID, 
          Name
      FROM HAVE
      WHERE (Name NOT CONTAINS '.');
QUIT;

View solution in original post

PeterClemmensen · Posted 11-27-2019 01:22 AM

What is the logic here? If a name is contained in another name within the same FirmID, then consider as the same Name?

The DATA to DATA Step Macro
Blog: SASnrd

dapenDaniel · Posted 11-27-2019 11:11 AM

Hi @PeterClemmensen

Thanks for your reply. You are correct. I also revised my question further. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.

andreas_lds · Posted 11-27-2019 01:23 AM

If you can define the rules when to keep which observation, i am sure that code can be written to do so. From the data you have posted, i don't see a rule explaining why for FirmID 10001 the first name is taken, and for the other FirmID the second name.

dapenDaniel · Posted 11-27-2019 11:10 AM

Hi @andreas_lds

Thanks for your reply. I have revised my question. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.

andreas_lds · Posted 11-28-2019 01:32 AM

@dapenDaniel wrote:

Hi @andreas_lds

Thanks for your reply. I have revised my question. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.

Maybe this is a language problem, but

Bloomberg, Jack B.

is not part of

Bloomberg, Jack Beta

At least not, if punctuation marks are not removed before comparison.

Ksharp · Posted 11-27-2019 06:53 AM

There are too many things you need to consider about.

Here could get you a start .

data have;
input FirmID                 Name $40.;
cards;
10001           Smith, Bob Alpha
10001           Smith, Bob A.
10002           Bloomberg, Jack Beta
10002           Bloomberg, Jack B.
;
proc sql;
select distinct a.*
 from have as a,have as b
  where a.FirmID=b.FirmID and a.Name ne b.Name and
   a.Name contains strip(substr(b.Name,1,length(b.name)-1));
quit;

mkeintz · Posted 11-27-2019 03:36 PM

Can we presume that last names will always match exactly for any given person? So we only need to do "contains" tests for the rest of each name?

And are all the realizations of a "contain" situations in which the shorter name exactly matches the first part to the remaining name? Or can you have

Smith, B. James

Smith, Brian James

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sustagens · Posted 11-27-2019 07:49 PM

If all the shortened names have a period in your data and your actual scenario is as simple as your sample, then you can just eliminate all rows that contain '.'

PROC SQL;
   CREATE TABLE WANT AS 
   SELECT FirmID, 
          Name
      FROM HAVE
      WHERE (Name NOT CONTAINS '.');
QUIT;

SAS Innovate 2025: Register Now