BookmarkSubscribeRSS Feed
NewSASPerson
Quartz | Level 8

Hello,

 

I would like some help to add a new variable based on what I search for. the code I have does not do that. I used the scan function but it will only return based on the position of the word.

 

This is what my output gives me

 

name avg_amt home_name
D.R. Horton INC 250 D
Lennar Corp. LLC 280 Lennar
PulteGroup INC 300 PulteGroup
Meritage Homes LLC 320 Meritage
Century Communities LLC 500 Century

M.D.C. Holdings  LLC 369 M

 

 

But this is what I want

 

name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR
PulteGroup INC 300 PULTE
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY

M.D.C. Holdings  LLC 369 M.D.C HOLDINGS

 

Here is my code:

 

 


data homes;
length name $30;
infile datalines dlm=",";
input name $ avg_amt;
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings  LLC,369
;
run;
data want;
set homes;
home_name = scan(name,1);
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|M.D.C. HOLDINGS|PULTE/oi",name) ;
run;
 
 
Any help I can get will be appreciated.

 

 

 

13 REPLIES 13
mkeintz
PROC Star

Instead of the PRXMATCH function, you will probably be better off with the =:   and/or the in: operator, which will allow you to compare two character values in a special way.  Namely it determines whether all the characters of the shorter value match the leading characters of the longer value, as in ...

 

data homes;
  length name $30;
  infile datalines dlm=",";
  input name $ avg_amt;
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings  LLC,369
;


data want;
  set homes;
  home_name=upcase(name);
  if home_name in: ('D.R. HORTON','MERITAGE HOMES','LENNAR','CENTURY','M.D.C. HOLDINGS','PULTE');
  home_name=tranwrd(home_name,scan(home_name,-1),' ');
run;

The subsetting IF above uses the IN: operator.

 

You also seem to want to remove the last word of the resulting name.  You can do this by translating the value of the last word (extracted by the SCAN(home_name,-1) function to a blank.   The works fine as long as you don't have an earlier word in home_name that is identical to the last word.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
NewSASPerson
Quartz | Level 8

this does not give me just the text searched for. The values for Lennar, Pulte & Century are not correct. I need the field to be the exact string searched for.

 

name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR CORP.
PulteGroup INC 300 PULTEGROUP
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY COMMUNITIES
M.D.C. Holdings  LLC 369 M.D.C. HOLDINGS

mkeintz
PROC Star

@NewSASPerson wrote:

this does not give me just the text searched for. The values for Lennar, Pulte & Century are not correct. I need the field to be the exact string searched for.

 

name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR CORP.
PulteGroup INC 300 PULTEGROUP
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY COMMUNITIES
M.D.C. Holdings  LLC 369 M.D.C. HOLDINGS


 

 

The program I supplied generates a data set the looks, to me, exactly like what you show above.  Here is the proc print result:

 

      Obs    name                       avg_amt    home_name
         1     D.R. Horton INC              250      D.R. HORTON
         2     Lennar Corp. LLC             280      LENNAR CORP.
         3     PulteGroup INC               300      PULTEGROUP
         4     Meritage Homes LLC           320      MERITAGE HOMES
         5     Century Communities LLC      500      CENTURY COMMUNITIES
         6     M.D.C. Holdings  LLC         369      M.D.C. HOLDINGS

 

If you don't get the above, please provide your SAS log, and the unexpected data that you got. 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
NewSASPerson
Quartz | Level 8

Yes that's what I get with your code. However for Home_Name it needs to say LENNAR but yours says LENNAR CORP. & PULTE but yours says PULTEGROUP

 

I hope that helps clarify the need.

mkeintz
PROC Star

Aha.  Didn't notice that.  Then your best bet is probably the solution offered by @s_lassen 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
s_lassen
Meteorite | Level 14

It is somewhat variable what parts of a name you need. I would suggest using a lookup table and SQL:

data lookup;
  infile datalines dlm=',';
  length lookup_str home_name $30;
  input lookup_str home_name;
cards;
D.R. HORTON %,D.R. HORTON
LENNAR %,LENNAR
PULTE%,PULTE
MERITAGE HOMES %,MERITAGE HOMES
CENTURY %,CENTURY
M.D.C HOLDINGS %,M.D.C HOLDINGS
;run;

proc sql;
  create table want as select homes.*,lookup.home_name
  from homes join lookup on upcase(homes.name) like trim(lookup.lookup_str);
quit;

You may have to check against your actual data and design your lookup strings carefully, but there is a chance that this will work. 

ChrisNZ
Tourmaline | Level 20

Like this?

 

data want;
set homes;
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|M.D.C. HOLDINGS|PULTE/oi",name) then
 name2= prxchange("s/(D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|M.D.C. HOLDINGS|PULTE)(.*)/\U\1/oi",1,trim(name)) ;
run;

 

 

 

NewSASPerson
Quartz | Level 8

Thank you this worked! However I am curious. What if I want to change  CENTURY to BROTHERS what would I need to change in the prxchange function? THE code below gives me "Toll BROTHERS" instead of BROTHERS for name2

 

data want;
set homes;
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|BROTHERS|M.D.C. HOLDINGS|PULTE/oi",name) then
name2
= prxchange("s/(D.R. HORTON|MERITAGE HOMES|LENNAR|BROTHERS|M.D.C. HOLDINGS|PULTE)(.*)/\U\1/oi",1,trim(name)) ;
run;

ChrisNZ
Tourmaline | Level 20

>THE code below gives me "Toll BROTHERS" i

It should not. Please provide the data sample and code (one 5-line data step).

NewSASPerson
Quartz | Level 8
data homes;
length name $30;
infile datalines dlm=",";
input name $ avg_amt;
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings  LLC,369
;
run;

data want;
set homes;
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|BROTHERS|M.D.C. HOLDINGS|PULTE/oi",name) then
 name2= prxchange("s/(D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|BROTHERS|M.D.C. HOLDINGS|PULTE)(.*)/\U\1/oi",1,trim(name)) ;
run;
These are the results below. "Toll" shows up instead of just BROTHERS
name			avg_amt	name2
D.R. Horton INC		250	D.R. HORTON
Lennar Corp. LLC	280	LENNAR
PulteGroup INC		300	PULTE
NVR INC			258	
KB Home LLC		987	
Taylor Morrison  INC	350	
Meritage Homes LLC	320	MERITAGE HOMES
Toll Brothers LLC	200	Toll BROTHERS
Century Communities LLC	500	CENTURY
LGI Homes INC		253	
M.D.C. Holdings  LLC	369	M.D.C. HOLDINGS
 
Patrick
Opal | Level 21

@NewSASPerson wrote:

Thank you this worked! However I am curious. What if I want to change  CENTURY to BROTHERS what would I need to change in the prxchange function? THE code below gives me "Toll BROTHERS" instead of BROTHERS for name2

 


May be code maintenance becomes easier for you if you use an Informat with the regexpe option and define each search pattern separately. This should allow you to add, amend, remove and test individual rules separately without having to change already working patterns.

proc format;
  invalue $name2prx(default=30 notsorted upcase)
    's/(D.R. HORTON).*/\U\1/o' (regexpe) = _same_ 
    's/(MERITAGE HOMES).*/\U\1/o' (regexpe) = _same_ 
    's/(LENNAR).*/\U\1/o' (regexpe) = _same_ 
    's/(CENTURY).*/\U\1/o' (regexpe) = _same_ 
    's/(M\.D\.C\. HOLDINGS).*/\U\1/o' (regexpe) = _same_ 
    's/(PULTE).*/\U\1/o' (regexpe) = _same_ 
    's/.*(BROTHERS).*/\U\1/o' (regexpe) = _same_ 
    's/(.*)(\bLLC\b|\bINC\b)$/\U\1/o' (regexpe) = _same_ 
    ; 
run;

data homes;
  length name $30;
  infile datalines dlm=",";
  input name $ avg_amt;
  name2=input(trim(name),$name2prx.);
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings  LLC,369
Toll BROTHERS,111
;

proc print;
run;

N.B: Normally with informats you can't have overlapping ranges. With RegEx search pattern things are slightly different. You can't use the exact same search pattern twice in the informat but it's possible to formulate more than one search pattern which would match a specific source string.

Using the informat notsorted option SAS will use the first pattern in the informat that matches. If you want to this allows you to formulate very specific search pattern in the beginning of the informat definition and more generic ones for the "left overs". I've added such a "generic" pattern at the end of the informat. It matches source string KB Home LLC

 

And building on above: It's may be even simpler if you just use normal strings (not patterns) for special cases and then only use RegEx for the remainder. This should also perform better because RegEx operations are costly.

proc format;
  invalue $name2prx(default=30 upcase notsorted)
    's/(.*)((\bCORP\.) \bLLC\b)$/\U\1/o' (regexpe) = _same_ 
    's/(.*)(LLC\b|\bINC\b)$/\U\1/o' (regexpe) = _same_ 
    ; 
  invalue $name2align(default=30 upcase)
    'PULTEGROUP INC'  = 'PULTE'
    'TOLL BROTHERS' ='BROTHERS'
    'CENTURY COMMUNITIES LLC'='CENTURY'
    other=[$name2prx.]
    ;
run;

data homes;
  length name $30;
  infile datalines dlm=",";
  input name $ avg_amt;
  name2=input(strip(name),$name2align.);
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes Inc ,253
M.D.C. Holdings  LLC,369
Toll BROTHERS,111
Case not covered yet Ltd.,111
;

proc print;
run;

Capture.JPG 

 

Just out of curiosity:

How many distinct name strings do you have to clean-up? And is this a one-off process or something you need to repeat when new data arrives?

NewSASPerson
Quartz | Level 8

I like this as well, However, I have over 100 distinct strings. using an informat might be time consuming. using prxchange, I am writing the string values into a macro. thanks.

Patrick
Opal | Level 21

@NewSASPerson wrote:

I like this as well, However, I have over 100 distinct strings. using an informat might be time consuming. using prxchange, I am writing the string values into a macro. thanks.


You do realize that one of the informats uses Regular Expressions and does the same than what you can do with prxchange(). 

The macro nesting then allows for a combination of RegEx and normal string replacement thus allowing you to not need a regular expression for special cases.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1440 views
  • 5 likes
  • 5 in conversation