First I split the original company names by space and get var1-var16
data that have:
company names var1 var2 var3 .... var16
A & B A & B
Pear Inc. Pear Inc.
Pears Corp. Pears Corp.
the ideal output:
company names var1 var2 var3 .... var16 reduced_name
A & F A & F A F
Pear Inc. Pear Inc. Pear
Pears Corp. Pears Corp. Pears
The format of the company names is Char 30. (so there're different trailing blanks). So when I directly cat() or cats() them, I don't get what I want.
Can anyone help?
You could use regular expressions:
data have;
length name $32;
input name $32.;
datalines;
A & B
Pear Inc.
Pears Corp.
;
data want;
length cleanName $32;
if not prxID then
prxID + prxparse("s/\b(INC|LTD|HOLDINGS|CORP|CORPORATION|LLC|THE)\b\.?//io");
set have;
cleanName = compbl(prxChange(prxID, -1, translate(name," ", "&")));
drop prxID;
run;
proc print data=want noobs; run;
clean Name name A B A & B Pear Pear Inc. Pears Pears Corp.
You don't show any of the rules for calculating the REDUCED_NAME, so I will leave that part to you.
The function you should be using is CATX, to leave a single blank between the pieces of the reduced name. For example:
reduced_name = catx(' ', var1, var3);
Thanks for the quick reply. What I want to do for reduced_name is to exclude words like: "LTD","HOLDINGS","CORP","CORP.","CORPORATION","&","LLC","THE". I want to remove those "unimportant" words. Can you help on this?
What about using TRANWRD() instead to replace the words you don't want?
If all the terms are in a separate file you can load the temporary array from the file.
data clean;
set have;
array temp(7) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD');
cleanVar = oldVar;
do i=1 to dim(temp);
cleanVar = tranwrd(cleanVar, temp(i), "");
end;
run;
I tried this code but it didn't work out.
The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized.
There is no error. But in my result, I only see missing values.
@maxjiang6999 wrote:
I tried this code but it didn't work out.
The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized.
There is no error. But in my result, I only see missing values.
Did you change the variable names to reflect your variable names? 'didn't work out' isn't very informative, it doesn't convey any information about what you did, what happened and what the errors were. Including the code you ran and log is much more helpful.
I tested it and it works partly for me, this was intended to get you started, by showing the process. You'll need to determine the rules and details because cleaning data is a finicky task and tedious. The part not working is the & and I did change it to make everything uppercase because INC does not match inc. You should get the new variable as an uppercase variable regardless.
78 data clean; 79 set have; 80 81 array temp(8) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD', 'CORP'); 82 83 cleanVar = upcase(oldVar); 84 85 do i=1 to dim(temp); 86 cleanVar = tranwrd(cleanVar, temp(i), ""); 87 end; 88 89 run; NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.CLEAN has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Full code:
data have;
input oldVar $20.;
cards;
A & B
Pears Inc.
Pears Corp
;;;;
run;
data clean;
set have;
array temp(8) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD', 'CORP');
cleanVar = upcase(oldVar);
do i=1 to dim(temp);
cleanVar = tranwrd(cleanVar, temp(i), "");
end;
run;
@maxjiang6999 wrote:
I tried this code but it didn't work out.
The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized.
There is no error. But in my result, I only see missing values.
Hi I re-tried the code and it did work out partially. What I did:
data test; set peers4;
array temp(8) $_temporary_ ('INC.','COMPANY','INC','GROUP','CORPORATION','&','LLC','LTD','THE', 'LTD.', 'HOLDINGS', 'CORP', 'PLC', 'CORP.','CO');
cleanvar = upcase(peername);
do i = 1 to dim(temp);
cleanvar = tranwrd(cleanvar, temp(i),'');
end;
run;
Then the log said: too many values for initialization of the array temp. Excess values are ignored.
Can you help? Really appreciate it!
array temp(8)
The 8 here indicates the number of elements in the temporary array. You need to increase it to account for the new words you've added.
I don't know if that can be dynamic.
You could use regular expressions:
data have;
length name $32;
input name $32.;
datalines;
A & B
Pear Inc.
Pears Corp.
;
data want;
length cleanName $32;
if not prxID then
prxID + prxparse("s/\b(INC|LTD|HOLDINGS|CORP|CORPORATION|LLC|THE)\b\.?//io");
set have;
cleanName = compbl(prxChange(prxID, -1, translate(name," ", "&")));
drop prxID;
run;
proc print data=want noobs; run;
clean Name name A B A & B Pear Pear Inc. Pears Pears Corp.
What I get is still all missing values.
Can't guess what's wrong without seeing what you tried. My example up there DOES work.
I re-tried the code and it works! Thank you very much. But I do have a following question. In my dataset, I have some companies named as INC XXX Inc.. If I applied the code to those, I will only get XXX. Is there any way for me to just get rid of the Inc. at the end?
Thanks again!
There is always a way. But the code I submitted was based on your stated requirement, it doesn't do that.The function prxchange finds matching patterns in the target string from left to right, so I think that you can't get it to change only the last match.
Sorry that I was not very clear about my question at the first place. Let me explain it in another way.
What I have: the original company name
What I want: a new column (reduced_name) = company name exclude ('INC.','COMPANY','INC','GROUP','CORPORATION','&','LLC','LTD','THE', 'LTD.', 'HOLDINGS', 'CORP', 'PLC', 'CORP.')
What I did: 1) separate the company name using SCAN() by space and get var1-var16 for each element/word of the name 2) concatenate var1-var16 using CATX(' ', OF var1-var16) based on so many IF conditions (what you see is just part). (please see the screenshot below)
Is there any way that's more convenient to conclude my IF statements?
Thanks for all the help from you guys.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.