Solved: Re: How to create columns with values based on occurence of words (fro...

LuciaCekanakova · Posted 10-10-2019 04:42 PM

Hi all,

I'd like your help with this problem.

I have 2 tables:

I want to take each string in the STRINGS table, go over the list of values of each variable in VARIABLES table and if the value from this list occurs in the string, the value should be written into a new column that has the same name as the variable.

See the output table for better understanding:

The number of columns in VARIABLES table may vary, and the column names won't always be as regular as var1, var2...

Also, if the string contains more than 1 value fromt he list (e.g. the last observation in STRINGS table), it has to have assigned only 1 value in the new column, no matter which one (so in case of string "a b", it could be assigned either "a", "b" or "a b"). Important thing is for at least one value to be there without duplication of the original string in another row.

I am a beginner in coding in SAS, so the only way I could do this is by typing out a zillion of IF conditions for all the values for creating each new column. However, I'd like the solution to be automatic so that it doesn't matter how many variables there are, what are their names and how many values each one has.

Please, can anyone help me with this?

Here's the code for creating my data:

data strings; 
   input string $6.;
   datalines;                      
a x
aa l z
b k
b y m
c z
c x k
a b
;

data variables;
   infile datalines delimiter=',' dsd; 
   input var1 $ var2 $ var3 $;
   datalines;                      
a,k,x
b,l,y
a b,m,z
,n,
;

Thanks!

hashman · Posted 10-11-2019 03:55 PM

@LuciaCekanakova:

Well, in this case my program in its core remains the same, except:

In addition to STRING, each variable from VARIABLES has to be scanned
A provisions must be made for storing the (sub) strings from VARIABLES in their original (prop)case for further retrieval
Logic for selecting the longest matching substring has to be added

The rest is just diligent, even if somewhat tedious, coding:

data strings ;                                                                                                                          
  input string $30.;                                                                                                                    
  cards ;                                                                                                                               
Ford carrot                                                                                                                             
Honda beans                                                                                                                             
south america black beans                                                                                                               
green beans Audi                                                                                                                        
x5 chili pepper lettuce                                                                                                                 
garlic north America                                                                                                                    
central europe bmw                                                                                                                      
lettuce onion X5                                                                                                                        
America pepper                                                                                                                          
X5 BMW Europe                                                                                                                           
Central East Europe                                                                                                                     
bmw X5 america                                                                                                                          
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data variables;                                                                                                                         
   infile datalines delimiter=',' dsd;                                                                                                  
   length cars $6 vegetables $12 regions $14;                                                                                           
   input cars $ vegetables $ regions $;                                                                                                 
   datalines;                                                                                                                           
Audi,beans,America                                                                                                                      
BMW,black beans,Central Europe                                                                                                          
BMW X5,carrot,East Europe                                                                                                               
X5,chili pepper,Europe                                                                                                                  
Honda,green beans,North America                                                                                                         
,lettuce,South America                                                                                                                  
,onion,                                                                                                                                 
,pepper,                                                                                                                                
;                                                                                                                                       
run;                                                                                                                                    
                                                                                                                                        
proc sql noprint ;                                                                                                                      
  select max (length) into :maxvlen from dictionary.columns                                                                             
  where libname="WORK" and memname="VARIABLES"                                                                                          
  ;                                                                                                                                     
quit ;                                                                                                                                  
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set strings ;                                                                                                                         
  if _n_ = 1 then do ;                                                                                                                  
    length _s _z $ &maxvlen ;                                                                                                           
    dcl hash h () ;                                                                                                                     
    h.definekey ("_s") ;                                                                                                                
    h.definedata ("_z", "_i_") ;                                                                                                        
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set variables end = z ;                                                                                                           
      array v cars vegetables regions ;                                                                                                 
      do over v ;                                                                                                                       
        if cmiss (v) then continue ;                                                                                                    
        do _i = 1 to countw (v) ;                                                                                                       
          _z = scan (v, _i) ;                                                                                                           
          h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                               
          do _j = _i + 1 to countw (v) ;                                                                                                
            _z = catx (" ", _z, scan (v, _j)) ;                                                                                         
            h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                             
          end ;                                                                                                                         
        end ;                                                                                                                           
      end ;                                                                                                                             
    end ;                                                                                                                               
  end ;                                                                                                                                 
  call missing (of v[*]) ;                                                                                                              
  do _i = 1 to countw (string) ;                                                                                                        
    _s = scan (string, _i) ;                                                                                                            
    if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                     
    do _j = _i + 1 to countw (string) ;                                                                                                 
      _s = catx (" ", _s, scan (string, _j)) ;                                                                                          
      if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                   
    end ;                                                                                                                               
  end ;                                                                                                                                 
run ;

Kind regards

Paul D.

View solution in original post

ballardw · Posted 10-10-2019 05:20 PM

Up until your value of 'a b' this looked moderately routine. However since you are looking for a single value of 'a' from the first line of your Strings data it is going to be a bit difficult to search for just the 'a' except when 'a b' is present in a generic sense.

I ask for clarity sake, does your actual working data have exactly 1 or many Variables values that include other variable values as substrings? If so then you may need to provide more information or a more complex example closer to your actual data.

Best might be to find some way to split up that String variable into the component parts that you need if possible so that you have values of 'a' alone instead of 'a x'.

LuciaCekanakova · Posted 10-11-2019 08:34 AM

Hi ballardw,

My working data can contain many variable values that include other variable values as substrings.

Here's a more complex example:

You made me realize more things that are necessary here, so all in all I would like:

to output the longest match in case of multiple occurences of values from list in a string (e.g. "green beans Audi" outputs vegetables="green beans", not just "beans")
to make the scan case-insensitive (for string "bmw x5" the output is cars="BMW X5" in the same case as was in the variables table)
to make the code independent of the names of variables and their number - these can be changed and added, so it would be great to have a universal code that can deal with any names and any numebr of variables (such as cars, vegetables etc.)

@ballardw, @hashman how would you go about this?

In case you want to give it a try, here's the code for data:

data strings ;                                                                                                            
  input string $30.;                                                                                                      
  cards ;                                                                                                                 
Ford carrot
Honda beans
south america black beans
green beans Audi
x5 chili pepper lettuce
garlic north America
central europe bmw
lettuce onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;                                                                                                                         
run ;  

data variables;
   infile datalines delimiter=',' dsd; 
   length cars $6 vegetables $12 regions $14;
   input cars $ vegetables $ regions $;
   datalines;                      
Audi,beans,America
BMW,black beans,Central Europe
BMW X5,carrot,East Europe
X5,chili pepper,Europe
Honda,green beans,North America
,lettuce,South America
,onion,
,pepper,
;
run;

Tom · Posted 10-11-2019 08:44 AM

I still don't get what you want to do with the multi-word strings in the VARIABLES table.

Do you only want to match 'BMW X5' to STRINGS that contain 'BMW X5' in that order and not to strings that have 'X5 BMW' or 'BMW model X5'?

Do you want to match also to strings that contain only 'BMW' or only 'X5'?

LuciaCekanakova · Posted 10-11-2019 09:10 AM

Do you only want to match 'BMW X5' to STRINGS that contain 'BMW X5' in that order and not to strings that have 'X5 BMW' or 'BMW model X5'? YES, exactly.

Do you want to match also to strings that contain only 'BMW' or only 'X5'? These are matched to values 'BMW' and 'X5' that are also in the variable list, but not because they are part of 'BMW X5' that is in this list. You basically always scan the string for the exact value in the variable list. And if the value list contains values 'BMW', 'X5' and 'BMW X5', a string like the one in observation #12 would match all of these, but I'd like the output to be only the longest exact match - which is 'BMW X5' in this case. In obs #10 it matches 'X5' and 'BMW', so 'BMW' goes into output.

Tom · Posted 10-11-2019 09:12 AM

In that case use the FINDW() function.

proc sql ;
  create table want as
    select string
         , case when (findw(string,cars,' ','sit')) then cars else ' ' end as cars
         , case when (findw(string,vegetables,' ','sit')) then vegetables else ' ' end as vegetables
         , case when (findw(string,regions,' ','sit')) then regions else ' ' end as regions
    from strings,variables
    having not missing(cats(calculated cars,calculated vegetables,calculated regions))
    order by 1,2,3,4
  ;
quit;

LuciaCekanakova · Posted 10-11-2019 09:17 AM

This program finds all the matches and creates an observation for each match. How do I make it to select only the longest match for each variable?

Tom · Posted 10-11-2019 09:43 AM

Probably easier to do each "variable" separately then. I added row numbers to your sample dataset to make it easier to see what is happening. And to give a tiebreaker when there are multiple

data strings ;
  input string $30.;
  snum+1;
cards;
Ford carrot
Honda beans
south america black beans
green beans Audi
x5 chili pepper lettuce
garlic north America
central europe bmw
lettuce onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;

data variables;
   infile cards delimiter=',' dsd;
   length cars $6 vegetables $12 regions $14;
   input cars vegetables regions;
   vnum+1;
cards;
Audi,beans,America
BMW,black beans,Central Europe
BMW X5,carrot,East Europe
X5,chili pepper,Europe
Honda,green beans,North America
,lettuce,South America
,onion,
,pepper,
;

proc sql ;
  create table vegetables as
    select snum,string
         , lengthn(vegetables) as length
         , vnum,vegetables
    from strings left join variables
      on findw(string,vegetables,' ','sit')
    order by snum,string,length,snum
  ;
quit;

data vegetables;
  set vegetables;
  by snum;
  if last.snum;
run;

proc print data=vegetables width=min;
run;

Obs    snum    string                       length    vnum    vegetables

  1      1     Ford carrot                     6        3     carrot
  2      2     Honda beans                     5        1     beans
  3      3     south america black beans      11        2     black beans
  4      4     green beans Audi               11        5     green beans
  5      5     x5 chili pepper lettuce        12        4     chili pepper
  6      6     garlic north America            0        .
  7      7     central europe bmw              0        .
  8      8     lettuce onion X5                7        6     lettuce
  9      9     America pepper                  6        8     pepper
 10     10     X5 BMW Europe                   0        .
 11     11     Central East Europe             0        .
 12     12     bmw X5 america                  0        .

LuciaCekanakova · Posted 10-11-2019 09:53 AM

For some reason my output from your code looks different:

Any idea why? There are no errors or warnings in the log.

Tom · Posted 10-11-2019 10:08 AM

Copy the code again as I did change the ordering and the IF statement to make sure it is taking the longest match.

hashman · Posted 10-11-2019 03:55 PM

@LuciaCekanakova:

Well, in this case my program in its core remains the same, except:

In addition to STRING, each variable from VARIABLES has to be scanned
A provisions must be made for storing the (sub) strings from VARIABLES in their original (prop)case for further retrieval
Logic for selecting the longest matching substring has to be added

The rest is just diligent, even if somewhat tedious, coding:

data strings ;                                                                                                                          
  input string $30.;                                                                                                                    
  cards ;                                                                                                                               
Ford carrot                                                                                                                             
Honda beans                                                                                                                             
south america black beans                                                                                                               
green beans Audi                                                                                                                        
x5 chili pepper lettuce                                                                                                                 
garlic north America                                                                                                                    
central europe bmw                                                                                                                      
lettuce onion X5                                                                                                                        
America pepper                                                                                                                          
X5 BMW Europe                                                                                                                           
Central East Europe                                                                                                                     
bmw X5 america                                                                                                                          
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data variables;                                                                                                                         
   infile datalines delimiter=',' dsd;                                                                                                  
   length cars $6 vegetables $12 regions $14;                                                                                           
   input cars $ vegetables $ regions $;                                                                                                 
   datalines;                                                                                                                           
Audi,beans,America                                                                                                                      
BMW,black beans,Central Europe                                                                                                          
BMW X5,carrot,East Europe                                                                                                               
X5,chili pepper,Europe                                                                                                                  
Honda,green beans,North America                                                                                                         
,lettuce,South America                                                                                                                  
,onion,                                                                                                                                 
,pepper,                                                                                                                                
;                                                                                                                                       
run;                                                                                                                                    
                                                                                                                                        
proc sql noprint ;                                                                                                                      
  select max (length) into :maxvlen from dictionary.columns                                                                             
  where libname="WORK" and memname="VARIABLES"                                                                                          
  ;                                                                                                                                     
quit ;                                                                                                                                  
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set strings ;                                                                                                                         
  if _n_ = 1 then do ;                                                                                                                  
    length _s _z $ &maxvlen ;                                                                                                           
    dcl hash h () ;                                                                                                                     
    h.definekey ("_s") ;                                                                                                                
    h.definedata ("_z", "_i_") ;                                                                                                        
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set variables end = z ;                                                                                                           
      array v cars vegetables regions ;                                                                                                 
      do over v ;                                                                                                                       
        if cmiss (v) then continue ;                                                                                                    
        do _i = 1 to countw (v) ;                                                                                                       
          _z = scan (v, _i) ;                                                                                                           
          h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                               
          do _j = _i + 1 to countw (v) ;                                                                                                
            _z = catx (" ", _z, scan (v, _j)) ;                                                                                         
            h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                             
          end ;                                                                                                                         
        end ;                                                                                                                           
      end ;                                                                                                                             
    end ;                                                                                                                               
  end ;                                                                                                                                 
  call missing (of v[*]) ;                                                                                                              
  do _i = 1 to countw (string) ;                                                                                                        
    _s = scan (string, _i) ;                                                                                                            
    if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                     
    do _j = _i + 1 to countw (string) ;                                                                                                 
      _s = catx (" ", _s, scan (string, _j)) ;                                                                                          
      if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                   
    end ;                                                                                                                               
  end ;                                                                                                                                 
run ;

Kind regards

Paul D.

LuciaCekanakova · Posted 10-14-2019 08:54 AM

Thank you! This is exactly what I need 🙂

The only thing I changed is that I created a macro variable with the list of variable names and then used it in array "v".

hashman · Posted 10-14-2019 12:23 PM

@LuciaCekanakova:

>I created a macro variable with the list of variable names and then used it in array "v"<

Given that you'd been striving to make the program as little hard coded as possible, this totally makes sense.

LuciaCekanakova · Posted 10-25-2019 10:26 AM

@hashman

Hi, I discovered 2 more things that aren't working as I expected:

When the VARIABLES table contains mulitple-word values, but the partial values are not in the value list (e.g. "South America" is there, but "South" by itself is not) they are also searched for in the STRINGS and if they are found, they are output in the new column (see the column regions in obs #2 and #8). I'd like to stop this from happening adn only search for the whole values if they have multiple-words.
In VARIABLES table it happens that the same value is in multiple columns (e.g. beans, carrot... - I added a new column with the general type of vegetable). When this occurs, the found value is output only in one variable column it was found in, it doesn't repeat in others (see columns vegetable and vegetablegeneral in obs #1,2,8,9). I'd like for these values to repeat in all the output column were they should be base don the original VARIABLES columns.

My output:

Desired output:

Please, do you have any tip on how to modify the code to achieve this?

The code so far with new data:

data strings ;                                                                                                                          
  input string $30.;                                                                                                                    
  cards ;                                                                                                                               
Ford carrot                                                                                                                             
Honda beans South Africa                                                                                                                             
south america black beans                                                                                                               
green beans Audi                                                                                                                        
x5 chili pepper                                                                                                   
garlic north America                                                                                                                    
central europe bmw                                                                                                                      
lettuce north onion X5                                                                                                                        
America pepper                                                                                                                     
X5 BMW Europe                                                                                                                           
Central East Europe                                                                                                                     
bmw X5 america                                                                                                                          
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data variables;                                                                                                                         
   infile datalines delimiter=',' dsd;                                                                                                  
   length cars $6 vegetablegeneral $7 vegetable $12 regions $14;                                                                                           
   input cars $ vegetablegeneral $ vegetable $ regions $;                                                                                                 
   datalines;                                                                                                                           
Audi,beans,beans,America                                                                                                                      
BMW,carrot,black beans,Central Europe                                                                                                          
BMW X5,lettuce,carrot,East Europe                                                                                                               
X5,onion,chili pepper,Europe                                                                                                                  
Honda,pepper,green beans,North America                                                                                                         
,,lettuce,South America                                                                                                                  
,,onion,                                                                                                                                 
,,pepper,                                                                                                                                
;                                                                                                                                       
run;                                                                                                                                    
                                                                                                                                        
proc sql noprint ;                                                                                                                      
  select max (length) into :maxvlen from dictionary.columns                                                                             
  where libname="WORK" and memname="VARIABLES"                                                                                          
  ;                                                                                                                                     
quit ;      

proc sql noprint;                                                                                                                      
  select name into :varlist separated by ' ' from dictionary.columns
  where libname="WORK" and memname="VARIABLES";
quit;   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set strings ;                                                                                                                         
  if _n_ = 1 then do ;                                                                                                                  
    length _s _z $ &maxvlen ;                                                                                                           
    dcl hash h () ;                                                                                                                     
    h.definekey ("_s") ;                                                                                                                
    h.definedata ("_z", "_i_") ;                                                                                                        
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set variables end = z ;                                                                                                           
      array v &varlist ;                                                                                                 
      do over v ;                                                                                                                       
        if cmiss (v) then continue ;                                                                                                    
        do _i = 1 to countw (v) ;                                                                                                       
          _z = scan (v, _i) ;                                                                                                           
          h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                               
          do _j = _i + 1 to countw (v) ;                                                                                                
            _z = catx (" ", _z, scan (v, _j)) ;                                                                                         
            h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                             
          end ;                                                                                                                         
        end ;                                                                                                                           
      end ;                                                                                                                             
    end ;                                                                                                                               
  end ;                                                                                                                                 
  call missing (of v[*]) ;                                                                                                              
  do _i = 1 to countw (string) ;                                                                                                        
    _s = scan (string, _i) ;                                                                                                            
    if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                     
    do _j = _i + 1 to countw (string) ;                                                                                                 
      _s = catx (" ", _s, scan (string, _j)) ;                                                                                          
      if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                   
    end ;                                                                                                                               
  end ;                                                                                                                                 
run ;

hashman · Posted 10-26-2019 03:00 PM

@LuciaCekanakova: Will take a look when I get a piece of free quality time.

How to create columns with values based on occurence of words (from predefined lists) in a string?

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Re: How to create columns with values based on occurence of words (from predefined lists) in a strin

Register Today!

SAS Training: Just a Click Away