Dear all,
I am trying to clean the company suffix.
I expect to 1.check the strings from the back to the front, 2.convert the first L.T.D' or 'L. T. D.' to 'LTD'. 3.(if possible) not convert the 'LTD' when they are the first part of the strings.
for example
data HAVE;
input NAME &:$100.;
infile datalines missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
;
run;
I expect to get
L.T.D. HOUSE LTD
L. T. D. HOUSE LTD
L. T. D. HOUSE LTD
L.T.D. HOUSE LTD
L.T.D. HOUSE
L. T. D. HOUSE
Could you please give me some suggestion about this?
Besides, Could you please explain the meaning of 'o'
rx=prxparse("s/(L\.T\.D\.)$|(L\. *T\. *D\.)$/LTD/o");
to me? and where I can find the explanation of all of this type of word?
thanks in advance.
This sniffs of Perl Regular Expressions.
Your last example is unclear to me: you ignore L.T.D. #1, change L.T.D. #2, ignore L.T.D #3, and remove "Co".
Regardless, the below seems to match your described problem, if not your "want" output. This may get you started:
data HAVE;
input NAME &:$100.;
infile datalines missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.L.T.D. HOUSE CoL. T. D. HOUSE Co
;
run;
data want;
rx=prxparse("s/(L\.T\.D\.)$|(L\. *T\. *D\.)$/LTD/o");
set have;
newname=prxchange(rx,-1,strip(name));
run;
The regular expression says:
1) Substitute matches (the "s")
2) Match on L.T.D.<end of line> or L.<any number of spaces>T.<any number of spaces>D.<end of line>. Note that the period is a Perl metacharacter so I have escaped it via backslash. If you don't want <any number of spaces> remove the asterisks.
3) So, this will only match L.T.D. or L. T. D. (plus additional spaces) when it occurs at the end of line.
4) Make sure you strip "name" since the data value in the PDV will be padded with spaces to the length of the variable (and thus not match <end of line>
Google Perl Regular Expressions or the SAS documentation for more details.
P.S.: Thanks heaps for posting your problem using a working, self contained data step using datalines. It made "my life" much easier!
Dear @ScottBass
I really appreciate your advice and apologise for my mistake.
please allow me to explain my question in detail.
I expect to 1.check the strings from the back to the front, 2.convert the first L.T.D' or 'L. T. D.' ( which I find beginning from the back of strings) 'LTD'. 3.(if possible) not convert the 'LTD' when they are the first part of the strings.
for example,
data HAVE;
input NAME &:$100.;
infile datalines missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co
L. T. D. HOUSE L. T. D. Co
;
run;
I expect to get
L.T.D. HOUSE LTD
L. T. D. HOUSE LTD
L. T. D. HOUSE LTD
L.T.D. HOUSE LTD
L.T.D. HOUSE Co /the 'L.T.D.' should not be converted as it is at the beginning of the strings*/
L. T. D. HOUSE Co /the 'L.T.D.' should not be converted as it is at the beginning of the strings*/
L. T. D. HOUSE LTD Co /* the first 'LTD' from the back to the front of strings*/
L. T. D. HOUSE LTD Co /* the first 'LTD' from the back to the front of strings*/
Could you give me some suggestion about this?
@Alexxxxxxx Have you even tried the code provided by @ScottBass ?
It should give you the expected result.
A slightly different alternative that may (or not) suit you better:
data WANT;
RX=prxparse('s/L\. ?T\. ?D\.$/LTD/o');
set HAVE;
NEWNAME=prxchange(RX, 1, strip(NAME));
run;
NEWNAME |
---|
L.T.D. HOUSE LTD |
L. T. D. HOUSE LTD |
L. T. D. HOUSE LTD |
L.T.D. HOUSE L. T. D.L.T.D. HOUSE CoL. T. D. HOUSE Co |
The regular expression reads:
1. Match this: L then dot then optional space then T then dot then optional space then D then dot then end of string
2. Replace the match with LTD
Then a request to do this once on the stripped string
Lots of RegEx tutorials of the web. That's how I learnt.
Now for the second set of data
data HAVE;
input NAME &:$100.;
infile datalines missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co
L. T. D. HOUSE L. T. D. Co
run;
I would either :
- Reverse the string (using the function with that name) and reverse the regular expression (i.e. match DTL, not LTD)
or probably easier
- Ignore the first character of the string (remove it using function substr() and then add it back after the RegEx substitution is applied)
This works too if you only have 2 LTD groups (otherwise use reverse() as mentioned):
data HAVE;
input NAME &:$100.;
infile datalines missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co
L. T. D. HOUSE L. T. D. Co
run;
data WANT;
RX=prxparse('s/(.+)(L\. ?T\. ?D\.)(.*)/$1LTD$3/o');
set HAVE;
NEWNAME=prxchange(RX, 1, strip(NAME));
run;
NAME | NEWNAME |
---|---|
L.T.D. HOUSE L.T.D. | L.T.D. HOUSE LTD |
L. T. D. HOUSE L.T.D. | L. T. D. HOUSE LTD |
L. T. D. HOUSE L. T. D. | L. T. D. HOUSE LTD |
L.T.D. HOUSE L. T. D. | L.T.D. HOUSE LTD |
L.T.D. HOUSE Co | L.T.D. HOUSE Co |
L. T. D. HOUSE Co | L. T. D. HOUSE Co |
L. T. D. HOUSE L.T.D. Co | L. T. D. HOUSE LTD Co |
L. T. D. HOUSE L. T. D. Co | L. T. D. HOUSE LTD Co |
Dear @ChrisNZ,
thanks for your kind help. I review my method and data in the past few days, the result may not clear because of my old logic. So, I update the method and divide it into two steps, the first is to process the company suffix if they are at the ending of strings. (which is most common).
for example,
data HAVE;
input NAME &:$150.;
infile datalines missover;
datalines;
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI A.S
HYDRO ALUMINIUM A. S.
HYDRO ALUMINIUM A.S.
HYDRO ALUMINIUM A.S
;
run;
I can use the following code to unify them to 'AS'.
data want;
set have;
NAME_new=prxchange("s/(A\.S\.)$|(A\. *S\.)$|(A\.S)$/ AS /o",-1,strip(NAME));
run;
However, I worry about the coincidence that the 'AS' is not a company suffix but a part of a company name. By using the data HAVE as an example,
data HAVE2;
input NAME &:$150.;
infile datalines missover;
datalines;
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X.A.S
HYDRO ALUMINIUM X. A. S.
HYDRO ALUMINIUM X.A.S.
HYDRO ALUMINIUM X.A.S
;
run;
the strings 'X.A.S' or 'X. A. S.' which writen in the same format, are assumed to be a word, so they should not proceed. however, I get the following result by code,
NAME | NAME6 |
ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X.A.S | ALARKO GAYRIMENKUL YATIRIM ORTAKLIGI X. AS |
HYDRO ALUMINIUM X. A. S. | HYDRO ALUMINIUM X. AS |
HYDRO ALUMINIUM X.A.S. | HYDRO ALUMINIUM X. AS |
HYDRO ALUMINIUM X.A.S | HYDRO ALUMINIUM X. AS |
How can I fix this problem? Could you please give me some suggestions about this? THX in advance.
https://www.google.com/search?q=online+regex+builder
Try a few, decide which one works best for you.
hello @ScottBass,
thanks for your recommendation.
I have started to use this and tried a lot of times, but not work. Could you please give me some suggestions about the code?
From what you are now showing, you have to look at the preceding characters (and ensure they are not dot or dot-space).
This is done by using a negative look-behind assertion. The RegEx syntax is (?<! )
Hello @ChrisNZ
do you meaning
data want;
set have;
NAME_new=prxchange("s/(?<! )(A\.S\.)$|(?<! )(A\. *S\.)$|(?<! )(A\.S)$/ AS /o",-1,strip(NAME));
run;
?
But it does not work.
> But it does not work.
@Alexxxxxxx You have been supplied with many examples of regular expressions -and their explanation- as you kept updating the requirements for your data in successive posts. It is now time that you learn how to write your own.
Search the web for negative look-behind assertion to obtain countless examples and tutorials. The syntax you provide does not do what you need.
In my opinion, the step even before that would be to define exactly what you need.
I'm really busy with "stuff", so this will be my last reply. If this doesn't meet your needs, then Google and/or the SAS documentation are your friends.
I can only go by your sample data. If it is not representative of your actual and complete data, then this may not work.
In your sample data, you always want to convert some aberration of "L.T.D." to "LTD", after the text "HOUSE". If that is representative of your complete data, then this will work. It's not the most robust; if you need a more robust approach, see above, or keep posting and perhaps someone else will spoon feed you the complete solution.
data HAVE;
input NAME &:$100.;
infile datalines missover;
datalines;
L.T.D. HOUSE L.T.D.
L. T. D. HOUSE L.T.D.
L. T. D. HOUSE L. T. D.
L.T.D. HOUSE L. T. D.
L.T.D. HOUSE Co
L. T. D. HOUSE Co
L. T. D. HOUSE L.T.D. Co
L. T. D. HOUSE L. T. D. Co
;
run;
data want;
rx=prxparse("s/(HOUSE) *(L\. *T\. *D\.)(.*)$/\1 LTD\3/o");
set have;
newname=prxchange(rx,-1,strip(name));
run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.