BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
maxjiang6999
Calcite | Level 5

First I split the original company names by space and get var1-var16

 

data that have:

company names         var1              var2                   var3  .... var16

A & B                           A                    &                      B

Pear Inc.                     Pear               Inc.

Pears Corp.                Pears              Corp.

 

the ideal output:

company names         var1              var2                   var3 .... var16                   reduced_name

A & F                           A                    &                      F                                       A F

Pear Inc.                     Pear               Inc.                                                            Pear

Pears Corp.                Pears              Corp.                                                        Pears

 

The format of the company names is Char 30. (so there're different trailing blanks). So when I directly cat() or cats() them, I don't get what I want.

Can anyone help?

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You could use regular expressions:

 

data have;
length name $32;
input name $32.;
datalines;
A & B
Pear Inc.
Pears Corp. 
;

data want;
length cleanName $32;
if not prxID then 
    prxID + prxparse("s/\b(INC|LTD|HOLDINGS|CORP|CORPORATION|LLC|THE)\b\.?//io");
set have;
cleanName = compbl(prxChange(prxID, -1, translate(name," ", "&")));
drop prxID;
run;

proc print data=want noobs; run;
                                clean
                                Name     name

                                A B      A & B
                                Pear     Pear Inc.
                                Pears    Pears Corp.
PG

View solution in original post

13 REPLIES 13
Astounding
PROC Star

You don't show any of the rules for calculating the REDUCED_NAME, so I will leave that part to you.

 

The function you should be using is CATX, to leave a single blank between the pieces of the reduced name.  For example:

 

reduced_name = catx(' ', var1, var3);

maxjiang6999
Calcite | Level 5

Thanks for the quick reply. What I want to do for reduced_name is to exclude words like: "LTD","HOLDINGS","CORP","CORP.","CORPORATION","&","LLC","THE". I want to remove those "unimportant" words. Can you help on this?

Reeza
Super User

What about using TRANWRD() instead to replace the words you don't want?

If all the terms are in a separate file you can load the temporary array from the file. 

 

data clean;
set have;

array temp(7) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD');

cleanVar = oldVar;

do i=1 to dim(temp);
cleanVar = tranwrd(cleanVar, temp(i), "");
end;

run;
maxjiang6999
Calcite | Level 5

I tried this code but it didn't work out.

The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized. 

There is no error. But in my result, I only see missing values.

Reeza
Super User

@maxjiang6999 wrote:

I tried this code but it didn't work out.

The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized. 

There is no error. But in my result, I only see missing values.


Did you change the variable names to reflect your variable names? 'didn't work out' isn't very informative, it doesn't convey any information about what you did, what happened and what the errors were. Including the code you ran and log is much more helpful. 

 

I tested it and it works partly for me, this was intended to get you started, by showing the process. You'll need to determine the rules and details because cleaning data is a finicky task and tedious. The part not working is the & and I did change it to make everything uppercase because INC does not match inc. You should get the new variable as an uppercase variable regardless. 

 

 78         data clean;
 79         set have;
 80         
 81         array temp(8) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD', 'CORP');
 82         
 83         cleanVar = upcase(oldVar);
 84         
 85         do i=1 to dim(temp);
 86         cleanVar = tranwrd(cleanVar, temp(i), "");
 87         end;
 88         
 89         run;
 
 NOTE: There were 3 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.CLEAN has 3 observations and 3 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       cpu time            0.00 seconds
       

 

Full code:

data have;
input oldVar $20.;
cards;
A & B
Pears Inc.
Pears Corp
;;;;
run;

data clean;
set have;

array temp(8) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD', 'CORP');

cleanVar = upcase(oldVar);

do i=1 to dim(temp);
    cleanVar = tranwrd(cleanVar, temp(i), "");
end;

run;

@maxjiang6999 wrote:

I tried this code but it didn't work out.

The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized. 

There is no error. But in my result, I only see missing values.


 

maxjiang6999
Calcite | Level 5

Hi I re-tried the code and it did work out partially. What I did:

 

data test; set peers4;
array temp(8) $_temporary_ ('INC.','COMPANY','INC','GROUP','CORPORATION','&','LLC','LTD','THE', 'LTD.', 'HOLDINGS', 'CORP', 'PLC', 'CORP.','CO');
cleanvar = upcase(peername);
do i = 1 to dim(temp);
cleanvar = tranwrd(cleanvar, temp(i),'');
end;
run;

 

Then the log said: too many values for initialization of the array temp. Excess values are ignored.

Can you help? Really appreciate it!

Reeza
Super User

array temp(8)

 

The 8 here indicates the number of elements in the temporary array. You need to increase it to account for the new words you've added.

I don't know if that can be dynamic.

PGStats
Opal | Level 21

You could use regular expressions:

 

data have;
length name $32;
input name $32.;
datalines;
A & B
Pear Inc.
Pears Corp. 
;

data want;
length cleanName $32;
if not prxID then 
    prxID + prxparse("s/\b(INC|LTD|HOLDINGS|CORP|CORPORATION|LLC|THE)\b\.?//io");
set have;
cleanName = compbl(prxChange(prxID, -1, translate(name," ", "&")));
drop prxID;
run;

proc print data=want noobs; run;
                                clean
                                Name     name

                                A B      A & B
                                Pear     Pear Inc.
                                Pears    Pears Corp.
PG
maxjiang6999
Calcite | Level 5

What I get is still all missing values. 

PGStats
Opal | Level 21

Can't guess what's wrong without seeing what you tried. My example up there DOES work.

PG
maxjiang6999
Calcite | Level 5

I re-tried the code and it works! Thank you very much. But I do have a following question. In my dataset, I have some companies named as INC XXX Inc.. If I applied the code to those, I will only get XXX. Is there any way for me to just get rid of the Inc. at the end?

 

Thanks again!

PGStats
Opal | Level 21

There is always a way. But the code I submitted was based on your stated requirement, it doesn't do that.The function prxchange finds matching patterns in the target string from left to right, so I think that you can't get it to change only the last match.

PG
maxjiang6999
Calcite | Level 5

Sorry that I was not very clear about my question at the first place.  Let me explain it in another way.

What I have: the original company name

What I want: a new column (reduced_name) = company name exclude ('INC.','COMPANY','INC','GROUP','CORPORATION','&','LLC','LTD','THE', 'LTD.', 'HOLDINGS', 'CORP', 'PLC', 'CORP.')

What I did: 1) separate the company name using SCAN() by space and get var1-var16 for each element/word of the name 2) concatenate var1-var16 using CATX(' ', OF var1-var16) based on so many IF conditions (what you see is just part). (please see the screenshot below)

 

Is there any way that's more convenient to conclude my IF statements?

Thanks for all the help from you guys. 

 

 

 

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 1969 views
  • 0 likes
  • 4 in conversation