clean the company suffix

Alexxxxxxx · Posted 04-09-2019 08:20 PM

Dear all,

I am trying to clean the company suffix.

I expect to 1.check the strings from the back to the front, 2.convert the first L.T.D' or 'L. T. D.' to 'LTD'. 3.(if possible) not convert the 'LTD' when they are the first part of the strings.

for example

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
;
run;

I expect to get

L.T.D. HOUSE LTD
L. T. D. HOUSE LTD
L. T. D. HOUSE LTD
L.T.D. HOUSE LTD
L.T.D. HOUSE 
L. T. D. HOUSE

Could you please give me some suggestion about this?

Besides, Could you please explain the meaning of 'o'

rx=prxparse("s/(L\.T\.D\.)$|(L\. *T\. *D\.)$/LTD/o");

to me? and where I can find the explanation of all of this type of word?

thanks in advance.

ScottBass · Posted 04-10-2019 12:59 AM

This sniffs of Perl Regular Expressions.

Your last example is unclear to me: you ignore L.T.D. #1, change L.T.D. #2, ignore L.T.D #3, and remove "Co".

Regardless, the below seems to match your described problem, if not your "want" output. This may get you started:

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.L.T.D. HOUSE CoL. T. D. HOUSE Co
;
run;

data want;
   rx=prxparse("s/(L\.T\.D\.)$|(L\. *T\. *D\.)$/LTD/o");
   set have;
   newname=prxchange(rx,-1,strip(name));
run;

The regular expression says:

1) Substitute matches (the "s")

2) Match on L.T.D.<end of line> or L.<any number of spaces>T.<any number of spaces>D.<end of line>. Note that the period is a Perl metacharacter so I have escaped it via backslash. If you don't want <any number of spaces> remove the asterisks.

3) So, this will only match L.T.D. or L. T. D. (plus additional spaces) when it occurs at the end of line.

4) Make sure you strip "name" since the data value in the PDV will be padded with spaces to the length of the variable (and thus not match <end of line>

Google Perl Regular Expressions or the SAS documentation for more details.

P.S.: Thanks heaps for posting your problem using a working, self contained data step using datalines. It made "my life" much easier!

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Alexxxxxxx · Posted 04-10-2019 11:46 AM

Dear @ScottBass

I really appreciate your advice and apologise for my mistake.

please allow me to explain my question in detail.

I expect to 1.check the strings from the back to the front, 2.convert the first L.T.D' or 'L. T. D.' ( which I find beginning from the back of strings) 'LTD'. 3.(if possible) not convert the 'LTD' when they are the first part of the strings.

for example,

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co 
L. T. D. HOUSE L. T. D. Co
;
run;

I expect to get

L.T.D. HOUSE LTD
L. T. D. HOUSE LTD
L. T. D. HOUSE LTD
L.T.D. HOUSE LTD

L.T.D. HOUSE Co /the 'L.T.D.' should not be converted as it is at the beginning of the strings*/

L. T. D. HOUSE Co /the 'L.T.D.' should not be converted as it is at the beginning of the strings*/

L. T. D. HOUSE LTD Co /* the first 'LTD' from the back to the front of strings*/

L. T. D. HOUSE LTD Co /* the first 'LTD' from the back to the front of strings*/

Could you give me some suggestion about this?

ChrisNZ · Posted 04-10-2019 05:46 PM

@Alexxxxxxx Have you even tried the code provided by @ScottBass ?

It should give you the expected result.

A slightly different alternative that may (or not) suit you better:

data WANT;
  RX=prxparse('s/L\. ?T\. ?D\.$/LTD/o');
  set HAVE;
  NEWNAME=prxchange(RX, 1, strip(NAME));
run;

NEWNAME
L.T.D. HOUSE LTD
L. T. D. HOUSE LTD
L. T. D. HOUSE LTD
L.T.D. HOUSE L. T. D.L.T.D. HOUSE CoL. T. D. HOUSE Co

The regular expression reads:

1. Match this: L then dot then optional space then T then dot then optional space then D then dot then end of string

2. Replace the match with LTD

Then a request to do this once on the stripped string

Lots of RegEx tutorials of the web. That's how I learnt.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-10-2019 05:55 PM

Now for the second set of data

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co 
L. T. D. HOUSE L. T. D. Co
run;

I would either :

- Reverse the string (using the function with that name) and reverse the regular expression (i.e. match DTL, not LTD)

or probably easier

- Ignore the first character of the string (remove it using function substr() and then add it back after the RegEx substitution is applied)

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-17-2019 05:52 PM

This works too if you only have 2 LTD groups (otherwise use reverse() as mentioned):

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co 
L. T. D. HOUSE L. T. D. Co
run;
data WANT;
  RX=prxparse('s/(.+)(L\. ?T\. ?D\.)(.*)/$1LTD$3/o');
  set HAVE;
  NEWNAME=prxchange(RX, 1, strip(NAME));
run;

NAME	NEWNAME
L.T.D. HOUSE L.T.D.	L.T.D. HOUSE LTD
L. T. D. HOUSE L.T.D.	L. T. D. HOUSE LTD
L. T. D. HOUSE L. T. D.	L. T. D. HOUSE LTD
L.T.D. HOUSE L. T. D.	L.T.D. HOUSE LTD
L.T.D. HOUSE Co	L.T.D. HOUSE Co
L. T. D. HOUSE Co	L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co	L. T. D. HOUSE LTD Co
L. T. D. HOUSE L. T. D. Co	L. T. D. HOUSE LTD Co

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 04-17-2019 07:48 PM

Dear @ChrisNZ,

thanks for your kind help. I review my method and data in the past few days, the result may not clear because of my old logic. So, I update the method and divide it into two steps, the first is to process the company suffix if they are at the ending of strings. (which is most common).

for example,

data HAVE;
input NAME &:$150.;
infile datalines missover;
datalines;
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI A.S 
HYDRO ALUMINIUM A. S. 
HYDRO ALUMINIUM A.S. 
HYDRO ALUMINIUM A.S 
;
run;

I can use the following code to unify them to 'AS'.

data want;
set have;
NAME_new=prxchange("s/(A\.S\.)$|(A\. *S\.)$|(A\.S)$/ AS /o",-1,strip(NAME));
run;

However, I worry about the coincidence that the 'AS' is not a company suffix but a part of a company name. By using the data HAVE as an example,

data HAVE2;
input NAME &:$150.;
infile datalines missover;
datalines;
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X.A.S 
HYDRO ALUMINIUM X. A. S. 
HYDRO ALUMINIUM X.A.S. 
HYDRO ALUMINIUM X.A.S 
;
run;

the strings 'X.A.S' or 'X. A. S.' which writen in the same format, are assumed to be a word, so they should not proceed. however, I get the following result by code,

NAME	NAME6
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X.A.S	ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X. AS
HYDRO ALUMINIUM X. A. S.	HYDRO ALUMINIUM X. AS
HYDRO ALUMINIUM X.A.S.	HYDRO ALUMINIUM X. AS
HYDRO ALUMINIUM X.A.S	HYDRO ALUMINIUM X. AS

How can I fix this problem? Could you please give me some suggestions about this? THX in advance.

ScottBass · Posted 04-17-2019 10:06 PM

https://www.google.com/search?q=online+regex+builder

Try a few, decide which one works best for you.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Alexxxxxxx · Posted 04-18-2019 01:08 PM

hello @ScottBass,

thanks for your recommendation.

I have started to use this and tried a lot of times, but not work. Could you please give me some suggestions about the code?

ChrisNZ · Posted 04-18-2019 12:04 AM

From what you are now showing, you have to look at the preceding characters (and ensure they are not dot or dot-space).

This is done by using a negative look-behind assertion. The RegEx syntax is (?<! )

High-Performance SAS Coding - Third Edition

Alexxxxxxx · Posted 04-18-2019 01:07 PM

Hello @ChrisNZ

do you meaning

data want;
set have;
NAME_new=prxchange("s/(?<! )(A\.S\.)$|(?<! )(A\. *S\.)$|(?<! )(A\.S)$/ AS /o",-1,strip(NAME));
run;

?

But it does not work.

ChrisNZ · Posted 04-20-2019 12:03 AM

> But it does not work.

@Alexxxxxxx You have been supplied with many examples of regular expressions -and their explanation- as you kept updating the requirements for your data in successive posts. It is now time that you learn how to write your own.

Search the web for negative look-behind assertion to obtain countless examples and tutorials. The syntax you provide does not do what you need.

In my opinion, the step even before that would be to define exactly what you need.

High-Performance SAS Coding - Third Edition

ScottBass · Posted 04-11-2019 12:06 AM

I'm really busy with "stuff", so this will be my last reply. If this doesn't meet your needs, then Google and/or the SAS documentation are your friends.

I can only go by your sample data. If it is not representative of your actual and complete data, then this may not work.

In your sample data, you always want to convert some aberration of "L.T.D." to "LTD", after the text "HOUSE". If that is representative of your complete data, then this will work. It's not the most robust; if you need a more robust approach, see above, or keep posting and perhaps someone else will spoon feed you the complete solution.

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co 
L. T. D. HOUSE L. T. D. Co
;
run;

data want;
   rx=prxparse("s/(HOUSE) *(L\. *T\. *D\.)(.*)$/\1 LTD\3/o");
   set have;
   newname=prxchange(rx,-1,strip(name));
run;

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Re: clean the company suffix

Registration is open

Registration is open

SAS Training: Just a Click Away