topic Re: Merge observations by name in groups in SAS Programming

Unify the name variation in a data set

KrisD — Tue, 12 Dec 2017 01:47:24 GMT

I have a column of firm names, where each firm has several departments (for example different series of one investment fund) and each of those department has a separate string of names. The names of one company's departments only differ lightly in the end, with completely random lengths and structure.

I would like to merge all those observations into one representative observation for the whole company. Precisely, I'd like to remove the words like "*something* series" or "series *something*" (please see the sample data I included following). Maybe some way to remove the last 2 whole words in the name.

While I suppose I can make a new variable that consists a string of the first several characters of the name and take 'last.name', the names won't be of correct length.

Here is the sample of my data. I have more than 150,000 lines like this so brute force method is a very willing NO.

I would really appreciate any suggestions, as I have been squeezing my brain 2 3 days now. Thank you very much in advance.

Data I have:

AGF Dividend Income Fund MF Series
AGF Dividend Income Fund Series D
AGF Dividend Income Fund Series F
AGF Dividend Income Fund Series V
Anchor Managed High Income Fund Class A 220
Anchor Managed High Income Fund Class F 221
BMO Dividend Class Advisor Series
BMO Dividend Class Series A
BMO Dividend Class Series H
BMO Dividend Fund Advisor Series
BMO Dividend Fund Series A
BMO Dividend Fund Series D
BMO Dividend Fund Series F
BMO Dividend Fund Series F6
BMO Dividend Fund Series T5
BMO Enhanced Equity Income Fund Advisors Series
BMO Enhanced Equity Income Fund Series A
BMO Enhanced Equity Income Fund Series D
BMO Enhanced Equity Income Fund Series F
BMO GDN Dividend Growth Fund Class F5
BMO GDN Dividend Growth Fund Class T5
BMO Growth and Income Fund Advisor Series
BMO Growth and Income Fund Classic Series
BMO Growth and Income Fund Series F
BMO Growth and Income Fund Series T5
BMO Growth and Income Fund Series T8
BMO Monthly High Income Fund II Advisor Series
BMO Monthly High Income Fund II Series A
BMO Monthly High Income Fund II Series D
BMO Monthly High Income Fund II Series F
BMO Monthly High Income Fund II Series T5
BMO Monthly High Income Fund II Series T8
Beutel Goodman Canadian Dividend Fund Class B
Beutel Goodman Canadian Dividend Fund Class D
Beutel Goodman Canadian Dividend Fund Class F
Beutel Goodman Canadian Dividend Fund Class I

Data I want:

AGF Dividend Income Fund
Anchor Managed High Income Fund

BMO Dividend Class

BMO Dividend Fund

BMO Enhanced Equity Income Fund

BMO GDN Dividend Growth Fund

BMO Growth and Income Fund

BMO Monthly High Income Fund II

Beutel Goodman Canadian Dividend Fund

Re: Merge observations by name in groups

novinosrin — Mon, 11 Dec 2017 21:56:38 GMT

Please take a look at PRX functions and apply regex. That should help I think.

Re: Merge observations by name in groups

KrisD — Mon, 11 Dec 2017 22:00:47 GMT

Thank you for your reply. I'll look into those functions. As I am new to SAS, may I ask if PRX and regex are newbie-friendly or advanced knowledge?

Re: Merge observations by name in groups

novinosrin — Mon, 11 Dec 2017 22:03:49 GMT

Definitely not newbie, but definitely not hard for a newbie to learn either. I could write the code for you, but I am about to go home. I am pretty certain one of super users will write that for you. Just hang in there and wait for responses. PRX makes my eyes and sinuses hurt looking at them though lol

Re: Merge observations by name in groups

KrisD — Mon, 11 Dec 2017 22:08:05 GMT

Thank you for your kind and helpful reply. I'd appreciate it a lot. I'll try to see if I can make anything of use from PRX too.

Re: Merge observations by name in groups

Reeza — Mon, 11 Dec 2017 22:39:21 GMT

PRX is basically Perl Regular expressions. You can find a lot of tutorials on how to build it online.

Basically you're not doing an exact match so that makes your logic difficult to implement. You can look at COMPGED/COMPLEV as well for distance calculations but fuzzy matching is time intensive work in general.

Re: Merge observations by name in groups

Astounding — Tue, 12 Dec 2017 00:10:35 GMT

Removing the last two words isn't that difficult. For example:

if countw(oldvar) > 3 then newvar = substr(oldvar, 1, length(oldvar) - 2 - length(scan(oldvar, -1)) - length(scan(oldvar, -2)));

Re: Merge observations by name in groups

KrisD — Tue, 12 Dec 2017 01:46:04 GMT

That worked out very well. I wasn't familiar with some of those functions before, so it would help a ton if you could briefly explain the logic behind your codes.

Sadly my data name variable does not have any more reliable structure (2 words, 3 words, mixed up order etc.), so after removing 2 last words I'll still have to manually check for exceptions.

But still, the codes cut down probably hours of work for dummy me 😄 Appreciate it.

Re: Merge observations by name in groups

Astounding — Tue, 12 Dec 2017 02:19:31 GMT

The functions ...

COUNTW counts the number of words in a string. You can control the delimiters used (not necessary here).

SCAN retrieves a specific word from a list of words. The second argument of -1 or -2 means start at the right and move from right to left (instead of moving from left to right).

SUBSTR retrieves a portion of a character string. Here, it begins at character #1, and takes the number of characters indicated by the third parameter (total length, minus 2, minus length of the last two words).

LENGTH is obvious except for one quirk that doesn't apply here. LENGTH never returns a zero. If the incoming string is blank, it still returns a 1.

Re: Merge observations by name in groups

KrisD — Tue, 12 Dec 2017 02:23:20 GMT

Thank you for your insightful reply. You're a life saver 😄