BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dapenDaniel
Obsidian | Level 7

Hi SAS experts!

 

I have a dataset as below.

 

FirmID                 Name

10001           Smith, Bob Alpha

10001           Smith, Bob A.

10002           Bloomberg, Jack Beta

10002           Bloomberg, Jack B.

 

The first two names are actually for the same person and the last two names are also for the same person. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name. Is there any way that I can get the dataset like below?

 

FirmID                 Name

10001           Smith, Bob Alpha

10002           Bloomberg, Jack Beta

 

Thanks!

 

1 ACCEPTED SOLUTION

Accepted Solutions
sustagens
Pyrite | Level 9

If all the shortened names have a period in your data and your actual scenario is as simple as your sample, then you can just eliminate all rows that contain '.'

 

PROC SQL;
   CREATE TABLE WANT AS 
   SELECT FirmID, 
          Name
      FROM HAVE
      WHERE (Name NOT CONTAINS '.');
QUIT;

 

View solution in original post

8 REPLIES 8
PeterClemmensen
Tourmaline | Level 20

What is the logic here? If a name is contained in another name within the same FirmID, then consider as the same Name?

dapenDaniel
Obsidian | Level 7

Hi @PeterClemmensen 

 

Thanks for your reply. You are correct. I also revised my question further. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.

andreas_lds
Jade | Level 19

If you can define the rules when to keep which observation, i am sure that code can be written to do so. From the data you have posted, i don't see a rule explaining why for FirmID 10001 the first name is taken, and for the other FirmID the second name.

dapenDaniel
Obsidian | Level 7

Hi @andreas_lds

 

Thanks for your reply. I have revised my question. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.

andreas_lds
Jade | Level 19

@dapenDaniel wrote:

Hi @andreas_lds

 

Thanks for your reply. I have revised my question. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.


Maybe this is a language problem, but

Bloomberg, Jack B.

is not part of

Bloomberg, Jack Beta

At least not, if punctuation marks are not removed before comparison.

Ksharp
Super User

There are too many things you need to consider about. 

Here could get you a start .

 

data have;
input FirmID                 Name $40.;
cards;
10001           Smith, Bob Alpha
10001           Smith, Bob A.
10002           Bloomberg, Jack Beta
10002           Bloomberg, Jack B.
;
proc sql;
select distinct a.*
 from have as a,have as b
  where a.FirmID=b.FirmID and a.Name ne b.Name and
   a.Name contains strip(substr(b.Name,1,length(b.name)-1));
quit;
 
mkeintz
PROC Star

Can we presume that last names will always match exactly for any given person? So we only need to do "contains" tests for the rest of each name?

 

And are all the realizations of a "contain" situations in which the shorter name exactly matches the first part to the remaining name?  Or can you have

   Smith, B. James

   Smith, Brian James

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sustagens
Pyrite | Level 9

If all the shortened names have a period in your data and your actual scenario is as simple as your sample, then you can just eliminate all rows that contain '.'

 

PROC SQL;
   CREATE TABLE WANT AS 
   SELECT FirmID, 
          Name
      FROM HAVE
      WHERE (Name NOT CONTAINS '.');
QUIT;

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 698 views
  • 0 likes
  • 6 in conversation