DATA Step, Macro, Functions and more

Unify the name variation in a data set

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

Unify the name variation in a data set

[ Edited ]

I have a column of firm names, where each firm has several departments (for example different series of one investment fund) and each of those department has a separate string of names. The names of one company's departments only differ lightly in the end, with completely random lengths and structure.

 

I would like to merge all those observations into one representative observation for the whole company. Precisely, I'd like to remove the words like "*something* series" or "series *something*" (please see the sample data I included following). Maybe some way to remove the last 2 whole words in the name.

 

While I suppose I can make a new variable that consists a string of the first several characters of the name and take 'last.name', the names won't be of correct length.

 

Here is the sample of my data. I have more than 150,000 lines like this so brute force method is a very willing NO.

 

I would really appreciate any suggestions, as I have been squeezing my brain 2 3 days now. Thank you very much in advance.

 

 

Data I have:

AGF Dividend Income Fund MF Series
AGF Dividend Income Fund Series D
AGF Dividend Income Fund Series F
AGF Dividend Income Fund Series V
Anchor Managed High Income Fund Class A 220
Anchor Managed High Income Fund Class F 221
BMO Dividend Class Advisor Series
BMO Dividend Class Series A
BMO Dividend Class Series H
BMO Dividend Fund Advisor Series
BMO Dividend Fund Series A
BMO Dividend Fund Series D
BMO Dividend Fund Series F
BMO Dividend Fund Series F6
BMO Dividend Fund Series T5
BMO Enhanced Equity Income Fund Advisors Series
BMO Enhanced Equity Income Fund Series A
BMO Enhanced Equity Income Fund Series D
BMO Enhanced Equity Income Fund Series F
BMO GDN Dividend Growth Fund Class F5
BMO GDN Dividend Growth Fund Class T5
BMO Growth and Income Fund Advisor Series
BMO Growth and Income Fund Classic Series
BMO Growth and Income Fund Series F
BMO Growth and Income Fund Series T5
BMO Growth and Income Fund Series T8
BMO Monthly High Income Fund II Advisor Series
BMO Monthly High Income Fund II Series A
BMO Monthly High Income Fund II Series D
BMO Monthly High Income Fund II Series F
BMO Monthly High Income Fund II Series T5
BMO Monthly High Income Fund II Series T8
Beutel Goodman Canadian Dividend Fund Class B
Beutel Goodman Canadian Dividend Fund Class D
Beutel Goodman Canadian Dividend Fund Class F
Beutel Goodman Canadian Dividend Fund Class I

 

Data I want:

AGF Dividend Income Fund
Anchor Managed High Income Fund

BMO Dividend Class

BMO Dividend Fund 

BMO Enhanced Equity Income Fund

BMO GDN Dividend Growth Fund

BMO Growth and Income Fund

BMO Monthly High Income Fund II

Beutel Goodman Canadian Dividend Fund

 


Accepted Solutions
Solution
‎12-11-2017 08:46 PM
Super User
Posts: 6,543

Re: Merge observations by name in groups

Removing the last two words isn't that difficult.  For example:

 

if countw(oldvar)  > 3 then newvar = substr(oldvar, 1, length(oldvar) - 2 - length(scan(oldvar, -1)) - length(scan(oldvar, -2)));

View solution in original post


All Replies
PROC Star
Posts: 1,351

Re: Merge observations by name in groups

Please take a look at PRX functions and apply regex. That should help I think.

Occasional Contributor
Posts: 5

Re: Merge observations by name in groups

Posted in reply to novinosrin
Thank you for your reply. I'll look into those functions. As I am new to SAS, may I ask if PRX and regex are newbie-friendly or advanced knowledge?
PROC Star
Posts: 1,351

Re: Merge observations by name in groups

Definitely not newbie, but definitely not hard for a newbie to learn either. I could write the code for you, but I am about to go home. I am pretty certain one of super users will write that for you. Just hang in there and wait for responses. PRX makes my eyes and sinuses hurt looking at them though lol

Occasional Contributor
Posts: 5

Re: Merge observations by name in groups

[ Edited ]
Posted in reply to novinosrin

Thank you for your kind and helpful reply. I'd appreciate it a lot. I'll try to see if I can make anything of use from PRX too.

Super User
Posts: 22,874

Re: Merge observations by name in groups

PRX is basically Perl Regular expressions. You can find a lot of tutorials on how to build it online. 

 

Basically you're not doing an exact match so that makes your logic difficult to implement. You can look at COMPGED/COMPLEV as well for distance calculations but fuzzy matching is time intensive work in general. 

Solution
‎12-11-2017 08:46 PM
Super User
Posts: 6,543

Re: Merge observations by name in groups

Removing the last two words isn't that difficult.  For example:

 

if countw(oldvar)  > 3 then newvar = substr(oldvar, 1, length(oldvar) - 2 - length(scan(oldvar, -1)) - length(scan(oldvar, -2)));

Occasional Contributor
Posts: 5

Re: Merge observations by name in groups

Posted in reply to Astounding

That worked out very well. I wasn't familiar with some of those functions before, so it would help a ton if you could briefly explain the logic behind your codes.

 

Sadly my data name variable does not have any more reliable structure (2 words, 3 words, mixed up order etc.), so after removing 2 last words I'll still have to manually check for exceptions.

 

But still, the codes cut down probably hours of work for dummy me Smiley Very Happy Appreciate it.

Super User
Posts: 6,543

Re: Merge observations by name in groups

The functions ...

 

COUNTW counts the number of words in a string.  You can control the delimiters used (not necessary here).

 

SCAN retrieves a specific word from a list of words.  The second argument of -1 or -2 means start at the right and move from right to left (instead of moving from left to right).

 

SUBSTR retrieves a portion of a character string.  Here, it begins at character #1, and takes the number of characters indicated by the third parameter (total length, minus 2, minus length of the last two words).

 

LENGTH is obvious except for one quirk that doesn't apply here.  LENGTH never returns a zero.  If the incoming string is blank, it still returns a 1.

Occasional Contributor
Posts: 5

Re: Merge observations by name in groups

Posted in reply to Astounding

Thank you for your insightful reply. You're a life saver Smiley Very Happy

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 260 views
  • 7 likes
  • 4 in conversation