Hi SAS experts!
I have a dataset as below.
FirmID Name
10001 Smith, Bob Alpha
10001 Smith, Bob A.
10002 Bloomberg, Jack Beta
10002 Bloomberg, Jack B.
The first two names are actually for the same person and the last two names are also for the same person. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name. Is there any way that I can get the dataset like below?
FirmID Name
10001 Smith, Bob Alpha
10002 Bloomberg, Jack Beta
Thanks!
If all the shortened names have a period in your data and your actual scenario is as simple as your sample, then you can just eliminate all rows that contain '.'
PROC SQL;
CREATE TABLE WANT AS
SELECT FirmID,
Name
FROM HAVE
WHERE (Name NOT CONTAINS '.');
QUIT;
What is the logic here? If a name is contained in another name within the same FirmID, then consider as the same Name?
Thanks for your reply. You are correct. I also revised my question further. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.
If you can define the rules when to keep which observation, i am sure that code can be written to do so. From the data you have posted, i don't see a rule explaining why for FirmID 10001 the first name is taken, and for the other FirmID the second name.
Hi @andreas_lds
Thanks for your reply. I have revised my question. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.
@dapenDaniel wrote:
Hi @andreas_lds
Thanks for your reply. I have revised my question. The logic is if one name (like Bloomgerg, Jack B.) is contained in the full name (like Bloomberg, Jack Beta) in the same company, then delete the short name and keep the completed name.
Maybe this is a language problem, but
Bloomberg, Jack B.
is not part of
Bloomberg, Jack Beta
At least not, if punctuation marks are not removed before comparison.
There are too many things you need to consider about.
Here could get you a start .
data have;
input FirmID Name $40.;
cards;
10001 Smith, Bob Alpha
10001 Smith, Bob A.
10002 Bloomberg, Jack Beta
10002 Bloomberg, Jack B.
;
proc sql;
select distinct a.*
from have as a,have as b
where a.FirmID=b.FirmID and a.Name ne b.Name and
a.Name contains strip(substr(b.Name,1,length(b.name)-1));
quit;
Can we presume that last names will always match exactly for any given person? So we only need to do "contains" tests for the rest of each name?
And are all the realizations of a "contain" situations in which the shorter name exactly matches the first part to the remaining name? Or can you have
Smith, B. James
Smith, Brian James
If all the shortened names have a period in your data and your actual scenario is as simple as your sample, then you can just eliminate all rows that contain '.'
PROC SQL;
CREATE TABLE WANT AS
SELECT FirmID,
Name
FROM HAVE
WHERE (Name NOT CONTAINS '.');
QUIT;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.