Create Variable Based on String searched for

NewSASPerson · Posted 10-30-2019 06:48 PM

Hello,

I would like some help to add a new variable based on what I search for. the code I have does not do that. I used the scan function but it will only return based on the position of the word.

This is what my output gives me

name avg_amt home_name
D.R. Horton INC 250 D
Lennar Corp. LLC 280 Lennar
PulteGroup INC 300 PulteGroup
Meritage Homes LLC 320 Meritage
Century Communities LLC 500 Century

M.D.C. Holdings LLC 369 M

But this is what I want

name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR
PulteGroup INC 300 PULTE
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY

M.D.C. Holdings LLC 369 M.D.C HOLDINGS

Here is my code:

data homes;
length name $30;
infile datalines dlm=",";
input name $ avg_amt;
datalines;
D.R. Horton INC, 250
Lennar Corp. LLC , 280
PulteGroup INC, 300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings LLC,369
;
run;

Any help I can get will be appreciated.

mkeintz · Posted 10-30-2019 09:02 PM

Instead of the PRXMATCH function, you will probably be better off with the =: and/or the in: operator, which will allow you to compare two character values in a special way. Namely it determines whether all the characters of the shorter value match the leading characters of the longer value, as in ...

data homes;
  length name $30;
  infile datalines dlm=",";
  input name $ avg_amt;
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings  LLC,369
;


data want;
  set homes;
  home_name=upcase(name);
  if home_name in: ('D.R. HORTON','MERITAGE HOMES','LENNAR','CENTURY','M.D.C. HOLDINGS','PULTE');
  home_name=tranwrd(home_name,scan(home_name,-1),' ');
run;

The subsetting IF above uses the IN: operator.

You also seem to want to remove the last word of the resulting name. You can do this by translating the value of the last word (extracted by the SCAN(home_name,-1) function to a blank. The works fine as long as you don't have an earlier word in home_name that is identical to the last word.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

NewSASPerson · Posted 10-30-2019 11:03 PM

this does not give me just the text searched for. The values for Lennar, Pulte & Century are not correct. I need the field to be the exact string searched for.

name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR CORP.
PulteGroup INC 300 PULTEGROUP
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY COMMUNITIES
M.D.C. Holdings LLC 369 M.D.C. HOLDINGS

mkeintz · Posted 10-30-2019 11:23 PM

@NewSASPerson wrote:

this does not give me just the text searched for. The values for Lennar, Pulte & Century are not correct. I need the field to be the exact string searched for.

name avg_amt home_name
D.R. Horton INC 250 D.R. HORTON
Lennar Corp. LLC 280 LENNAR CORP.
PulteGroup INC 300 PULTEGROUP
Meritage Homes LLC 320 MERITAGE HOMES
Century Communities LLC 500 CENTURY COMMUNITIES
M.D.C. Holdings LLC 369 M.D.C. HOLDINGS

The program I supplied generates a data set the looks, to me, exactly like what you show above. Here is the proc print result:

      Obs    name                       avg_amt    home_name
         1     D.R. Horton INC              250      D.R. HORTON
         2     Lennar Corp. LLC             280      LENNAR CORP.
         3     PulteGroup INC               300      PULTEGROUP
         4     Meritage Homes LLC           320      MERITAGE HOMES
         5     Century Communities LLC      500      CENTURY COMMUNITIES
         6     M.D.C. Holdings  LLC         369      M.D.C. HOLDINGS

If you don't get the above, please provide your SAS log, and the unexpected data that you got.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

NewSASPerson · Posted 10-31-2019 05:58 AM

Yes that's what I get with your code. However for Home_Name it needs to say LENNAR but yours says LENNAR CORP. & PULTE but yours says PULTEGROUP

I hope that helps clarify the need.

mkeintz · Posted 10-31-2019 01:39 PM

Aha. Didn't notice that. Then your best bet is probably the solution offered by @s_lassen

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

s_lassen · Posted 10-31-2019 08:08 AM

It is somewhat variable what parts of a name you need. I would suggest using a lookup table and SQL:

data lookup;
  infile datalines dlm=',';
  length lookup_str home_name $30;
  input lookup_str home_name;
cards;
D.R. HORTON %,D.R. HORTON
LENNAR %,LENNAR
PULTE%,PULTE
MERITAGE HOMES %,MERITAGE HOMES
CENTURY %,CENTURY
M.D.C HOLDINGS %,M.D.C HOLDINGS
;run;

proc sql;
  create table want as select homes.*,lookup.home_name
  from homes join lookup on upcase(homes.name) like trim(lookup.lookup_str);
quit;

You may have to check against your actual data and design your lookup strings carefully, but there is a chance that this will work.

ChrisNZ · Posted 10-31-2019 08:53 PM

Like this?

data want;
set homes;
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|M.D.C. HOLDINGS|PULTE/oi",name) then
 name2= prxchange("s/(D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|M.D.C. HOLDINGS|PULTE)(.*)/\U\1/oi",1,trim(name)) ;
run;

High-Performance SAS Coding - Third Edition

NewSASPerson · Posted 11-01-2019 07:43 AM

Thank you this worked! However I am curious. What if I want to change CENTURY to BROTHERS what would I need to change in the prxchange function? THE code below gives me "Toll BROTHERS" instead of BROTHERS for name2

ChrisNZ · Posted 11-01-2019 06:55 PM

>THE code below gives me "Toll BROTHERS" i

It should not. Please provide the data sample and code (one 5-line data step).

High-Performance SAS Coding - Third Edition

NewSASPerson · Posted 11-18-2019 10:44 AM

data homes;
length name $30;
infile datalines dlm=",";
input name $ avg_amt;
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings  LLC,369
;
run;

data want;
set homes;
if prxmatch("m/D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|BROTHERS|M.D.C. HOLDINGS|PULTE/oi",name) then
 name2= prxchange("s/(D.R. HORTON|MERITAGE HOMES|LENNAR|CENTURY|BROTHERS|M.D.C. HOLDINGS|PULTE)(.*)/\U\1/oi",1,trim(name)) ;
run;

These are the results below. "Toll" shows up instead of just BROTHERS

name			avg_amt	name2
D.R. Horton INC		250	D.R. HORTON
Lennar Corp. LLC	280	LENNAR
PulteGroup INC		300	PULTE
NVR INC			258	
KB Home LLC		987	
Taylor Morrison  INC	350	
Meritage Homes LLC	320	MERITAGE HOMES
Toll Brothers LLC	200	Toll BROTHERS
Century Communities LLC	500	CENTURY
LGI Homes INC		253	
M.D.C. Holdings  LLC	369	M.D.C. HOLDINGS

Patrick · Posted 11-01-2019 11:37 PM

@NewSASPerson wrote:

Thank you this worked! However I am curious. What if I want to change CENTURY to BROTHERS what would I need to change in the prxchange function? THE code below gives me "Toll BROTHERS" instead of BROTHERS for name2

May be code maintenance becomes easier for you if you use an Informat with the regexpe option and define each search pattern separately. This should allow you to add, amend, remove and test individual rules separately without having to change already working patterns.

proc format;
  invalue $name2prx(default=30 notsorted upcase)
    's/(D.R. HORTON).*/\U\1/o' (regexpe) = _same_ 
    's/(MERITAGE HOMES).*/\U\1/o' (regexpe) = _same_ 
    's/(LENNAR).*/\U\1/o' (regexpe) = _same_ 
    's/(CENTURY).*/\U\1/o' (regexpe) = _same_ 
    's/(M\.D\.C\. HOLDINGS).*/\U\1/o' (regexpe) = _same_ 
    's/(PULTE).*/\U\1/o' (regexpe) = _same_ 
    's/.*(BROTHERS).*/\U\1/o' (regexpe) = _same_ 
    's/(.*)(\bLLC\b|\bINC\b)$/\U\1/o' (regexpe) = _same_ 
    ; 
run;

data homes;
  length name $30;
  infile datalines dlm=",";
  input name $ avg_amt;
  name2=input(trim(name),$name2prx.);
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes INC ,253
M.D.C. Holdings  LLC,369
Toll BROTHERS,111
;

proc print;
run;

N.B: Normally with informats you can't have overlapping ranges. With RegEx search pattern things are slightly different. You can't use the exact same search pattern twice in the informat but it's possible to formulate more than one search pattern which would match a specific source string.

Using the informat notsorted option SAS will use the first pattern in the informat that matches. If you want to this allows you to formulate very specific search pattern in the beginning of the informat definition and more generic ones for the "left overs". I've added such a "generic" pattern at the end of the informat. It matches source string KB Home LLC

And building on above: It's may be even simpler if you just use normal strings (not patterns) for special cases and then only use RegEx for the remainder. This should also perform better because RegEx operations are costly.

proc format;
  invalue $name2prx(default=30 upcase notsorted)
    's/(.*)((\bCORP\.) \bLLC\b)$/\U\1/o' (regexpe) = _same_ 
    's/(.*)(LLC\b|\bINC\b)$/\U\1/o' (regexpe) = _same_ 
    ; 
  invalue $name2align(default=30 upcase)
    'PULTEGROUP INC'  = 'PULTE'
    'TOLL BROTHERS' ='BROTHERS'
    'CENTURY COMMUNITIES LLC'='CENTURY'
    other=[$name2prx.]
    ;
run;

data homes;
  length name $30;
  infile datalines dlm=",";
  input name $ avg_amt;
  name2=input(strip(name),$name2align.);
datalines;
D.R. Horton INC,  250
Lennar Corp. LLC , 280
PulteGroup INC,  300
NVR INC , 258
KB Home LLC ,987
Taylor Morrison  INC,350
Meritage Homes LLC,320
Toll Brothers LLC , 200
Century Communities LLC, 500
LGI Homes Inc ,253
M.D.C. Holdings  LLC,369
Toll BROTHERS,111
Case not covered yet Ltd.,111
;

proc print;
run;

Just out of curiosity:

How many distinct name strings do you have to clean-up? And is this a one-off process or something you need to repeat when new data arrives?

NewSASPerson · Posted 11-18-2019 10:47 AM

I like this as well, However, I have over 100 distinct strings. using an informat might be time consuming. using prxchange, I am writing the string values into a macro. thanks.

Patrick · Posted 11-18-2019 01:38 PM

@NewSASPerson wrote:

I like this as well, However, I have over 100 distinct strings. using an informat might be time consuming. using prxchange, I am writing the string values into a macro. thanks.

You do realize that one of the informats uses Regular Expressions and does the same than what you can do with prxchange().

The macro nesting then allows for a combination of RegEx and normal string replacement thus allowing you to not need a regular expression for special cases.

Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

Re: Create Variable Based on String searched for

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away