Hello,
I would like some help to add a new variable based on what I search for. the code I have does not do that. I used the scan function but it will only return based on the position of the word.
This is what my output gives me
name avg_amt home_name
D.R. Horton INC 250 D
Lennar Corp. LLC 280 Lennar
PulteGroup INC 300 PulteGroup
Meritage Homes LLC 320 Meritage
Century Communities LLC 500 Century
M.D.C. Holdings LLC 369 M
But this is what I want
name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR
PulteGroup INC 300 PULTE
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY
M.D.C. Holdings LLC 369 M.D.C HOLDINGS
Here is my code:
Instead of the PRXMATCH function, you will probably be better off with the =: and/or the in: operator, which will allow you to compare two character values in a special way. Namely it determines whether all the characters of the shorter value match the leading characters of the longer value, as in ...
data homes;
length name $30;
infile datalines dlm=",";
input name $ avg_amt;
datalines;
D.R. Horton INC, 250
Lennar Corp. LLC , 280
PulteGroup INC, 300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings LLC,369
;
data want;
set homes;
home_name=upcase(name);
if home_name in: ('D.R. HORTON','MERITAGE HOMES','LENNAR','CENTURY','M.D.C. HOLDINGS','PULTE');
home_name=tranwrd(home_name,scan(home_name,-1),' ');
run;
The subsetting IF above uses the IN: operator.
You also seem to want to remove the last word of the resulting name. You can do this by translating the value of the last word (extracted by the SCAN(home_name,-1) function to a blank. The works fine as long as you don't have an earlier word in home_name that is identical to the last word.
this does not give me just the text searched for. The values for Lennar, Pulte & Century are not correct. I need the field to be the exact string searched for.
name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR CORP.
PulteGroup INC 300 PULTEGROUP
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY COMMUNITIES
M.D.C. Holdings LLC 369 M.D.C. HOLDINGS
@NewSASPerson wrote:
this does not give me just the text searched for. The values for Lennar, Pulte & Century are not correct. I need the field to be the exact string searched for.
name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR CORP.
PulteGroup INC 300 PULTEGROUP
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY COMMUNITIES
M.D.C. Holdings LLC 369 M.D.C. HOLDINGS
The program I supplied generates a data set the looks, to me, exactly like what you show above. Here is the proc print result:
Obs name avg_amt home_name 1 D.R. Horton INC 250 D.R. HORTON 2 Lennar Corp. LLC 280 LENNAR CORP. 3 PulteGroup INC 300 PULTEGROUP 4 Meritage Homes LLC 320 MERITAGE HOMES 5 Century Communities LLC 500 CENTURY COMMUNITIES 6 M.D.C. Holdings LLC 369 M.D.C. HOLDINGS
If you don't get the above, please provide your SAS log, and the unexpected data that you got.
Yes that's what I get with your code. However for Home_Name it needs to say LENNAR but yours says LENNAR CORP. & PULTE but yours says PULTEGROUP
I hope that helps clarify the need.
Aha. Didn't notice that. Then your best bet is probably the solution offered by @s_lassen
It is somewhat variable what parts of a name you need. I would suggest using a lookup table and SQL:
data lookup;
infile datalines dlm=',';
length lookup_str home_name $30;
input lookup_str home_name;
cards;
D.R. HORTON %,D.R. HORTON
LENNAR %,LENNAR
PULTE%,PULTE
MERITAGE HOMES %,MERITAGE HOMES
CENTURY %,CENTURY
M.D.C HOLDINGS %,M.D.C HOLDINGS
;run;
proc sql;
create table want as select homes.*,lookup.home_name
from homes join lookup on upcase(homes.name) like trim(lookup.lookup_str);
quit;
You may have to check against your actual data and design your lookup strings carefully, but there is a chance that this will work.
Like this?
data want;
set homes;
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|M.D.C. HOLDINGS|PULTE/oi",name) then
name2= prxchange("s/(D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|M.D.C. HOLDINGS|PULTE)(.*)/\U\1/oi",1,trim(name)) ;
run;
Thank you this worked! However I am curious. What if I want to change CENTURY to BROTHERS what would I need to change in the prxchange function? THE code below gives me "Toll BROTHERS" instead of BROTHERS for name2
data want;
set homes;
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|BROTHERS|M.D.C. HOLDINGS|PULTE/oi",name) then
name2= prxchange("s/(D.R. HORTON|MERITAGE HOMES|LENNAR|BROTHERS|M.D.C. HOLDINGS|PULTE)(.*)/\U\1/oi",1,trim(name)) ;
run;
>THE code below gives me "Toll BROTHERS" i
It should not. Please provide the data sample and code (one 5-line data step).
data homes;
length name $30;
infile datalines dlm=",";
input name $ avg_amt;
datalines;
D.R. Horton INC, 250
Lennar Corp. LLC , 280
PulteGroup INC, 300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings LLC,369
;
run;
data want;
set homes;
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|BROTHERS|M.D.C. HOLDINGS|PULTE/oi",name) then
name2= prxchange("s/(D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|BROTHERS|M.D.C. HOLDINGS|PULTE)(.*)/\U\1/oi",1,trim(name)) ;
run;
name avg_amt name2
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR
PulteGroup INC 300 PULTE
NVR INC 258
KB Home LLC 987
Taylor Morrison INC 350
Meritage Homes LLC 320 MERITAGE HOMES
Toll Brothers LLC 200 Toll BROTHERS
Century Communities LLC 500 CENTURY
LGI Homes INC 253
M.D.C. Holdings LLC 369 M.D.C. HOLDINGS
@NewSASPerson wrote:
Thank you this worked! However I am curious. What if I want to change CENTURY to BROTHERS what would I need to change in the prxchange function? THE code below gives me "Toll BROTHERS" instead of BROTHERS for name2
May be code maintenance becomes easier for you if you use an Informat with the regexpe option and define each search pattern separately. This should allow you to add, amend, remove and test individual rules separately without having to change already working patterns.
proc format;
invalue $name2prx(default=30 notsorted upcase)
's/(D.R. HORTON).*/\U\1/o' (regexpe) = _same_
's/(MERITAGE HOMES).*/\U\1/o' (regexpe) = _same_
's/(LENNAR).*/\U\1/o' (regexpe) = _same_
's/(CENTURY).*/\U\1/o' (regexpe) = _same_
's/(M\.D\.C\. HOLDINGS).*/\U\1/o' (regexpe) = _same_
's/(PULTE).*/\U\1/o' (regexpe) = _same_
's/.*(BROTHERS).*/\U\1/o' (regexpe) = _same_
's/(.*)(\bLLC\b|\bINC\b)$/\U\1/o' (regexpe) = _same_
;
run;
data homes;
length name $30;
infile datalines dlm=",";
input name $ avg_amt;
name2=input(trim(name),$name2prx.);
datalines;
D.R. Horton INC, 250
Lennar Corp. LLC , 280
PulteGroup INC, 300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings LLC,369
Toll BROTHERS,111
;
proc print;
run;
N.B: Normally with informats you can't have overlapping ranges. With RegEx search pattern things are slightly different. You can't use the exact same search pattern twice in the informat but it's possible to formulate more than one search pattern which would match a specific source string.
Using the informat notsorted option SAS will use the first pattern in the informat that matches. If you want to this allows you to formulate very specific search pattern in the beginning of the informat definition and more generic ones for the "left overs". I've added such a "generic" pattern at the end of the informat. It matches source string KB Home LLC
And building on above: It's may be even simpler if you just use normal strings (not patterns) for special cases and then only use RegEx for the remainder. This should also perform better because RegEx operations are costly.
proc format;
invalue $name2prx(default=30 upcase notsorted)
's/(.*)((\bCORP\.) \bLLC\b)$/\U\1/o' (regexpe) = _same_
's/(.*)(LLC\b|\bINC\b)$/\U\1/o' (regexpe) = _same_
;
invalue $name2align(default=30 upcase)
'PULTEGROUP INC' = 'PULTE'
'TOLL BROTHERS' ='BROTHERS'
'CENTURY COMMUNITIES LLC'='CENTURY'
other=[$name2prx.]
;
run;
data homes;
length name $30;
infile datalines dlm=",";
input name $ avg_amt;
name2=input(strip(name),$name2align.);
datalines;
D.R. Horton INC, 250
Lennar Corp. LLC , 280
PulteGroup INC, 300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes Inc ,253
M.D.C. Holdings LLC,369
Toll BROTHERS,111
Case not covered yet Ltd.,111
;
proc print;
run;
Just out of curiosity:
How many distinct name strings do you have to clean-up? And is this a one-off process or something you need to repeat when new data arrives?
I like this as well, However, I have over 100 distinct strings. using an informat might be time consuming. using prxchange, I am writing the string values into a macro. thanks.
@NewSASPerson wrote:
I like this as well, However, I have over 100 distinct strings. using an informat might be time consuming. using prxchange, I am writing the string values into a macro. thanks.
You do realize that one of the informats uses Regular Expressions and does the same than what you can do with prxchange().
The macro nesting then allows for a combination of RegEx and normal string replacement thus allowing you to not need a regular expression for special cases.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.