Solved: Re: How to Concatenate multiple Colns to one var/col excluding unneces...

maxjiang6999 · Posted 07-28-2019 08:26 PM

First I split the original company names by space and get var1-var16

data that have:

company names var1 var2 var3 .... var16

A & B A & B

Pear Inc. Pear Inc.

Pears Corp. Pears Corp.

the ideal output:

company names var1 var2 var3 .... var16 reduced_name

A & F A & F A F

Pear Inc. Pear Inc. Pear

Pears Corp. Pears Corp. Pears

The format of the company names is Char 30. (so there're different trailing blanks). So when I directly cat() or cats() them, I don't get what I want.

Can anyone help?

PGStats · Posted 07-29-2019 12:20 AM

You could use regular expressions:

data have;
length name $32;
input name $32.;
datalines;
A & B
Pear Inc.
Pears Corp. 
;

data want;
length cleanName $32;
if not prxID then 
    prxID + prxparse("s/\b(INC|LTD|HOLDINGS|CORP|CORPORATION|LLC|THE)\b\.?//io");
set have;
cleanName = compbl(prxChange(prxID, -1, translate(name," ", "&")));
drop prxID;
run;

proc print data=want noobs; run;

                                clean
                                Name     name

                                A B      A & B
                                Pear     Pear Inc.
                                Pears    Pears Corp.

PG

View solution in original post

Astounding · Posted 07-28-2019 10:00 PM

You don't show any of the rules for calculating the REDUCED_NAME, so I will leave that part to you.

The function you should be using is CATX, to leave a single blank between the pieces of the reduced name. For example:

reduced_name = catx(' ', var1, var3);

maxjiang6999 · Posted 07-28-2019 10:34 PM

Thanks for the quick reply. What I want to do for reduced_name is to exclude words like: "LTD","HOLDINGS","CORP","CORP.","CORPORATION","&","LLC","THE". I want to remove those "unimportant" words. Can you help on this?

Reeza · Posted 07-28-2019 10:18 PM

What about using TRANWRD() instead to replace the words you don't want?

If all the terms are in a separate file you can load the temporary array from the file.

data clean;
set have;

array temp(7) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD');

cleanVar = oldVar;

do i=1 to dim(temp);
cleanVar = tranwrd(cleanVar, temp(i), "");
end;

run;

maxjiang6999 · Posted 07-28-2019 10:36 PM

I tried this code but it didn't work out.

The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized.

There is no error. But in my result, I only see missing values.

Reeza · Posted 07-28-2019 10:52 PM

@maxjiang6999 wrote:

I tried this code but it didn't work out.

The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized.

There is no error. But in my result, I only see missing values.

Did you change the variable names to reflect your variable names? 'didn't work out' isn't very informative, it doesn't convey any information about what you did, what happened and what the errors were. Including the code you ran and log is much more helpful.

I tested it and it works partly for me, this was intended to get you started, by showing the process. You'll need to determine the rules and details because cleaning data is a finicky task and tedious. The part not working is the & and I did change it to make everything uppercase because INC does not match inc. You should get the new variable as an uppercase variable regardless.

 78         data clean;
 79         set have;
 80         
 81         array temp(8) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD', 'CORP');
 82         
 83         cleanVar = upcase(oldVar);
 84         
 85         do i=1 to dim(temp);
 86         cleanVar = tranwrd(cleanVar, temp(i), "");
 87         end;
 88         
 89         run;
 
 NOTE: There were 3 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.CLEAN has 3 observations and 3 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       cpu time            0.00 seconds

Full code:

data have;
input oldVar $20.;
cards;
A & B
Pears Inc.
Pears Corp
;;;;
run;

data clean;
set have;

array temp(8) $ _temporary_ ('INC', 'INC.', '&', 'CO', 'CO.', 'LTD.', 'LTD', 'CORP');

cleanVar = upcase(oldVar);

do i=1 to dim(temp);
    cleanVar = tranwrd(cleanVar, temp(i), "");
end;

run;

@maxjiang6999 wrote:

I tried this code but it didn't work out.

The log only said: numeric values have been converted to character values at he places give by XXX. Variable oldvar is uninitialized.

There is no error. But in my result, I only see missing values.

maxjiang6999 · Posted 07-29-2019 01:30 PM

Hi I re-tried the code and it did work out partially. What I did:

data test; set peers4;
array temp(8) $_temporary_ ('INC.','COMPANY','INC','GROUP','CORPORATION','&','LLC','LTD','THE', 'LTD.', 'HOLDINGS', 'CORP', 'PLC', 'CORP.','CO');
cleanvar = upcase(peername);
do i = 1 to dim(temp);
cleanvar = tranwrd(cleanvar, temp(i),'');
end;
run;

Then the log said: too many values for initialization of the array temp. Excess values are ignored.

Can you help? Really appreciate it!

Reeza · Posted 07-29-2019 03:40 PM

array temp(8)

The 8 here indicates the number of elements in the temporary array. You need to increase it to account for the new words you've added.

I don't know if that can be dynamic.

PGStats · Posted 07-29-2019 12:20 AM

You could use regular expressions:

data have;
length name $32;
input name $32.;
datalines;
A & B
Pear Inc.
Pears Corp. 
;

data want;
length cleanName $32;
if not prxID then 
    prxID + prxparse("s/\b(INC|LTD|HOLDINGS|CORP|CORPORATION|LLC|THE)\b\.?//io");
set have;
cleanName = compbl(prxChange(prxID, -1, translate(name," ", "&")));
drop prxID;
run;

proc print data=want noobs; run;

                                clean
                                Name     name

                                A B      A & B
                                Pear     Pear Inc.
                                Pears    Pears Corp.

PG

maxjiang6999 · Posted 07-29-2019 01:12 PM

What I get is still all missing values.

PGStats · Posted 07-29-2019 01:17 PM

Can't guess what's wrong without seeing what you tried. My example up there DOES work.

PG

maxjiang6999 · Posted 07-29-2019 02:03 PM

I re-tried the code and it works! Thank you very much. But I do have a following question. In my dataset, I have some companies named as INC XXX Inc.. If I applied the code to those, I will only get XXX. Is there any way for me to just get rid of the Inc. at the end?

Thanks again!

PGStats · Posted 07-29-2019 03:04 PM

There is always a way. But the code I submitted was based on your stated requirement, it doesn't do that.The function prxchange finds matching patterns in the target string from left to right, so I think that you can't get it to change only the last match.

PG

maxjiang6999 · Posted 07-29-2019 01:12 PM

Sorry that I was not very clear about my question at the first place. Let me explain it in another way.

What I have: the original company name

What I want: a new column (reduced_name) = company name exclude ('INC.','COMPANY','INC','GROUP','CORPORATION','&','LLC','LTD','THE', 'LTD.', 'HOLDINGS', 'CORP', 'PLC', 'CORP.')

What I did: 1) separate the company name using SCAN() by space and get var1-var16 for each element/word of the name 2) concatenate var1-var16 using CATX(' ', OF var1-var16) based on so many IF conditions (what you see is just part). (please see the screenshot below)

Is there any way that's more convenient to conclude my IF statements?

Thanks for all the help from you guys.

How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words

Re: How to Concatenate multiple Colns to one var/col excluding unnecessary words