BookmarkSubscribeRSS Feed
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

I am trying to clean the company suffix. 

 

I expect to 1.check the strings from the back to the front, 2.convert the first L.T.D' or 'L. T. D.' to 'LTD'. 3.(if possible) not convert the 'LTD' when they are the first part of the strings.

for example

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co ; run;

I expect to get 

 

L.T.D. HOUSE LTD
L. T. D. HOUSE LTD
L. T. D. HOUSE LTD
L.T.D. HOUSE LTD
L.T.D. HOUSE
L. T. D. HOUSE

Could you please give me some suggestion about this?

 

Besides, Could you please explain the meaning of 'o'

rx=prxparse("s/(L\.T\.D\.)$|(L\. *T\. *D\.)$/LTD/o");

to me? and where I can find the explanation of all of this type of word? 

thanks in advance.

 

12 REPLIES 12
ScottBass
Rhodochrosite | Level 12

This sniffs of Perl Regular Expressions.

 

Your last example is unclear to me:  you ignore L.T.D. #1, change L.T.D. #2, ignore L.T.D #3, and remove "Co".

 

Regardless, the below seems to match your described problem, if not your "want" output.  This may get you started:

 

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.L.T.D. HOUSE CoL. T. D. HOUSE Co
;
run;

data want;
   rx=prxparse("s/(L\.T\.D\.)$|(L\. *T\. *D\.)$/LTD/o");
   set have;
   newname=prxchange(rx,-1,strip(name));
run;

The regular expression says:

 

1) Substitute matches (the "s")

2) Match on L.T.D.<end of line> or L.<any number of spaces>T.<any number of spaces>D.<end of line>.  Note that the period is a Perl metacharacter so I have escaped it via backslash.  If you don't want <any number of spaces> remove the asterisks.

3) So, this will only match L.T.D. or L. T. D. (plus additional spaces) when it occurs at the end of line.

4) Make sure you strip "name" since the data value in the PDV will be padded with spaces to the length of the variable (and thus not match <end of line>

 

Google Perl Regular Expressions or the SAS documentation for more details.

 

P.S.:  Thanks heaps for posting your problem using a working, self contained data step using datalines.  It made "my life" much easier!


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Alexxxxxxx
Pyrite | Level 9

Dear @ScottBass 

 

I really appreciate your advice and apologise for my mistake.

please allow me to explain my question in detail.

 

I expect to  1.check the strings from the back to the front, 2.convert the first L.T.D' or 'L. T. D.' ( which I find beginning from the back of strings) 'LTD'. 3.(if possible) not convert the 'LTD' when they are the first part of the strings.

for example,

 

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co 
L. T. D. HOUSE L. T. D. Co
;
run;

I expect to get

L.T.D. HOUSE LTD
L. T. D. HOUSE LTD
L. T. D. HOUSE LTD
L.T.D. HOUSE LTD

L.T.D. HOUSE Co /the 'L.T.D.' should not be converted as it is at the beginning of the strings*/

L. T. D. HOUSE Co /the 'L.T.D.' should not be converted as it is at the beginning of the strings*/

L. T. D. HOUSE LTD Co /* the first 'LTD' from the back to the front of strings*/

L. T. D. HOUSE LTD Co /* the first 'LTD' from the back to the front of strings*/

Could you give me some suggestion about this?

 

ChrisNZ
Tourmaline | Level 20

@Alexxxxxxx Have you even tried the code provided by @ScottBass ?

It should give you the expected result.

A slightly different alternative that may (or not) suit you better:

data WANT;
  RX=prxparse('s/L\. ?T\. ?D\.$/LTD/o');
  set HAVE;
  NEWNAME=prxchange(RX, 1, strip(NAME));
run;      

NEWNAME
L.T.D. HOUSE LTD
L. T. D. HOUSE LTD
L. T. D. HOUSE LTD
L.T.D. HOUSE L. T. D.L.T.D. HOUSE CoL. T. D. HOUSE Co

 

The regular expression reads:

1. Match this:  L then dot then optional space then T then dot then optional space then D then dot then end of string

2. Replace the match with LTD

Then a request to do this once on the stripped string

 

Lots of RegEx tutorials of the web. That's how I learnt.

ChrisNZ
Tourmaline | Level 20

Now for the second set of data

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co 
L. T. D. HOUSE L. T. D. Co
run;

I would either :

- Reverse the string (using the function with that name) and reverse the regular expression (i.e. match DTL, not LTD)

or probably easier 

- Ignore the first character of the string (remove it using function substr() and then add it back after the RegEx substitution is applied)

ChrisNZ
Tourmaline | Level 20

This works too if you only have 2 LTD groups (otherwise use reverse() as mentioned):

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co 
L. T. D. HOUSE L. T. D. Co
run;
data WANT;
  RX=prxparse('s/(.+)(L\. ?T\. ?D\.)(.*)/$1LTD$3/o');
  set HAVE;
  NEWNAME=prxchange(RX, 1, strip(NAME));
run;         

NAME NEWNAME
L.T.D. HOUSE L.T.D. L.T.D. HOUSE LTD
L. T. D. HOUSE L.T.D. L. T. D. HOUSE LTD
L. T. D. HOUSE L. T. D. L. T. D. HOUSE LTD
L.T.D. HOUSE L. T. D. L.T.D. HOUSE LTD
L.T.D. HOUSE Co L.T.D. HOUSE Co
L. T. D. HOUSE Co L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co L. T. D. HOUSE LTD Co
L. T. D. HOUSE L. T. D. Co L. T. D. HOUSE LTD Co

 

 

 

Alexxxxxxx
Pyrite | Level 9

Dear @ChrisNZ,

 

thanks for your kind help. I review my method and data in the past few days, the result may not clear because of my old logic. So, I update the method and divide it into two steps, the first is to process the company suffix if they are at the ending of strings. (which is most common).

 

for example,

data HAVE;
input NAME &:$150.;
infile datalines missover;
datalines;
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI A.S 
HYDRO ALUMINIUM A. S. 
HYDRO ALUMINIUM A.S. 
HYDRO ALUMINIUM A.S 
;
run;

I can use the following code to unify them to 'AS'.

data want;
set have;
NAME_new=prxchange("s/(A\.S\.)$|(A\. *S\.)$|(A\.S)$/ AS /o",-1,strip(NAME));
run;

However, I worry about the coincidence that the 'AS' is not a company suffix but a part of a company name. By using the data HAVE as an example, 

data HAVE2;
input NAME &:$150.;
infile datalines missover;
datalines;
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X.A.S 
HYDRO ALUMINIUM X. A. S. 
HYDRO ALUMINIUM X.A.S. 
HYDRO ALUMINIUM X.A.S 
;
run;

the strings 'X.A.S' or 'X. A. S.' which writen in the same format, are assumed to be a word, so they should not proceed. however, I get the following result by code,

NAMENAME6
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X.A.SALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X. AS
HYDRO ALUMINIUM X. A. S.HYDRO ALUMINIUM X. AS
HYDRO ALUMINIUM X.A.S.HYDRO ALUMINIUM X. AS
HYDRO ALUMINIUM X.A.SHYDRO ALUMINIUM X. AS

How can I fix this problem? Could you please give me some suggestions about this? THX  in advance.

ScottBass
Rhodochrosite | Level 12

https://www.google.com/search?q=online+regex+builder

 

Try a few, decide which one works best for you.


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Alexxxxxxx
Pyrite | Level 9

hello @ScottBass,

 

thanks for your recommendation.

 

I have started to use this and tried a lot of times, but not work. Could you please give me some suggestions about the code?

ChrisNZ
Tourmaline | Level 20

From what you are now showing, you have to look at the preceding characters (and ensure they are not dot or dot-space).

This is done by using a negative look-behind assertion. The RegEx syntax is (?<! )

Alexxxxxxx
Pyrite | Level 9

Hello @ChrisNZ 

 

do you meaning 

 

data want;
set have;
NAME_new=prxchange("s/(?<! )(A\.S\.)$|(?<! )(A\. *S\.)$|(?<! )(A\.S)$/ AS /o",-1,strip(NAME));
run;

?

 

But it does not work. 

 

ChrisNZ
Tourmaline | Level 20

> But it does not work.

@Alexxxxxxx You have been supplied with many examples of regular expressions -and their explanation-  as you kept updating the requirements for your data in successive posts. It is now time that you learn how to write your own.

Search the web for negative look-behind assertion to obtain countless examples and tutorials. The syntax you provide does not do what you need. 

In my opinion, the step even before that would be to define exactly what you need.

ScottBass
Rhodochrosite | Level 12

I'm really busy with "stuff", so this will be my last reply.  If this doesn't meet your needs, then Google and/or the SAS documentation are your friends.  

 

I can only go by your sample data.  If it is not representative of your actual and complete data, then this may not work.

 

In your sample data, you always want to convert some aberration of "L.T.D." to "LTD", after the text "HOUSE".  If that is representative of your complete data, then this will work.  It's not the most robust; if you need a more robust approach, see above, or keep posting and perhaps someone else will spoon feed you the complete solution.  

 

data HAVE;
input NAME &:$100.;
infile datalines  missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co 
L. T. D. HOUSE L. T. D. Co
;
run;

data want;
   rx=prxparse("s/(HOUSE) *(L\. *T\. *D\.)(.*)$/\1 LTD\3/o");
   set have;
   newname=prxchange(rx,-1,strip(name));
run;

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 2618 views
  • 7 likes
  • 3 in conversation