BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
KrisD
Fluorite | Level 6

I have a column of firm names, where each firm has several departments (for example different series of one investment fund) and each of those department has a separate string of names. The names of one company's departments only differ lightly in the end, with completely random lengths and structure.

 

I would like to merge all those observations into one representative observation for the whole company. Precisely, I'd like to remove the words like "*something* series" or "series *something*" (please see the sample data I included following). Maybe some way to remove the last 2 whole words in the name.

 

While I suppose I can make a new variable that consists a string of the first several characters of the name and take 'last.name', the names won't be of correct length.

 

Here is the sample of my data. I have more than 150,000 lines like this so brute force method is a very willing NO.

 

I would really appreciate any suggestions, as I have been squeezing my brain 2 3 days now. Thank you very much in advance.

 

 

Data I have:

AGF Dividend Income Fund MF Series
AGF Dividend Income Fund Series D
AGF Dividend Income Fund Series F
AGF Dividend Income Fund Series V
Anchor Managed High Income Fund Class A 220
Anchor Managed High Income Fund Class F 221
BMO Dividend Class Advisor Series
BMO Dividend Class Series A
BMO Dividend Class Series H
BMO Dividend Fund Advisor Series
BMO Dividend Fund Series A
BMO Dividend Fund Series D
BMO Dividend Fund Series F
BMO Dividend Fund Series F6
BMO Dividend Fund Series T5
BMO Enhanced Equity Income Fund Advisors Series
BMO Enhanced Equity Income Fund Series A
BMO Enhanced Equity Income Fund Series D
BMO Enhanced Equity Income Fund Series F
BMO GDN Dividend Growth Fund Class F5
BMO GDN Dividend Growth Fund Class T5
BMO Growth and Income Fund Advisor Series
BMO Growth and Income Fund Classic Series
BMO Growth and Income Fund Series F
BMO Growth and Income Fund Series T5
BMO Growth and Income Fund Series T8
BMO Monthly High Income Fund II Advisor Series
BMO Monthly High Income Fund II Series A
BMO Monthly High Income Fund II Series D
BMO Monthly High Income Fund II Series F
BMO Monthly High Income Fund II Series T5
BMO Monthly High Income Fund II Series T8
Beutel Goodman Canadian Dividend Fund Class B
Beutel Goodman Canadian Dividend Fund Class D
Beutel Goodman Canadian Dividend Fund Class F
Beutel Goodman Canadian Dividend Fund Class I

 

Data I want:

AGF Dividend Income Fund
Anchor Managed High Income Fund

BMO Dividend Class

BMO Dividend Fund 

BMO Enhanced Equity Income Fund

BMO GDN Dividend Growth Fund

BMO Growth and Income Fund

BMO Monthly High Income Fund II

Beutel Goodman Canadian Dividend Fund

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Removing the last two words isn't that difficult.  For example:

 

if countw(oldvar)  > 3 then newvar = substr(oldvar, 1, length(oldvar) - 2 - length(scan(oldvar, -1)) - length(scan(oldvar, -2)));

View solution in original post

9 REPLIES 9
novinosrin
Tourmaline | Level 20

Please take a look at PRX functions and apply regex. That should help I think.

KrisD
Fluorite | Level 6
Thank you for your reply. I'll look into those functions. As I am new to SAS, may I ask if PRX and regex are newbie-friendly or advanced knowledge?
novinosrin
Tourmaline | Level 20

Definitely not newbie, but definitely not hard for a newbie to learn either. I could write the code for you, but I am about to go home. I am pretty certain one of super users will write that for you. Just hang in there and wait for responses. PRX makes my eyes and sinuses hurt looking at them though lol

KrisD
Fluorite | Level 6

Thank you for your kind and helpful reply. I'd appreciate it a lot. I'll try to see if I can make anything of use from PRX too.

Reeza
Super User

PRX is basically Perl Regular expressions. You can find a lot of tutorials on how to build it online. 

 

Basically you're not doing an exact match so that makes your logic difficult to implement. You can look at COMPGED/COMPLEV as well for distance calculations but fuzzy matching is time intensive work in general. 

Astounding
PROC Star

Removing the last two words isn't that difficult.  For example:

 

if countw(oldvar)  > 3 then newvar = substr(oldvar, 1, length(oldvar) - 2 - length(scan(oldvar, -1)) - length(scan(oldvar, -2)));

KrisD
Fluorite | Level 6

That worked out very well. I wasn't familiar with some of those functions before, so it would help a ton if you could briefly explain the logic behind your codes.

 

Sadly my data name variable does not have any more reliable structure (2 words, 3 words, mixed up order etc.), so after removing 2 last words I'll still have to manually check for exceptions.

 

But still, the codes cut down probably hours of work for dummy me 😄 Appreciate it.

Astounding
PROC Star

The functions ...

 

COUNTW counts the number of words in a string.  You can control the delimiters used (not necessary here).

 

SCAN retrieves a specific word from a list of words.  The second argument of -1 or -2 means start at the right and move from right to left (instead of moving from left to right).

 

SUBSTR retrieves a portion of a character string.  Here, it begins at character #1, and takes the number of characters indicated by the third parameter (total length, minus 2, minus length of the last two words).

 

LENGTH is obvious except for one quirk that doesn't apply here.  LENGTH never returns a zero.  If the incoming string is blank, it still returns a 1.

KrisD
Fluorite | Level 6

Thank you for your insightful reply. You're a life saver 😄

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1123 views
  • 7 likes
  • 4 in conversation