BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LuciaCekanakova
Obsidian | Level 7

Hi all,

I'd like your help with this problem.

I have 2 tables:

image.pngimage.png

 

I want to take each string in the STRINGS table, go over the list of values of each variable in VARIABLES table and if the value from this list occurs in the string, the value should be written into a new column that has the same name as the variable.

 

See the output table for better understanding:

image.png

 

The number of columns in VARIABLES table may vary, and the column names won't always be as regular as var1, var2...

Also, if the string contains more than 1 value fromt he list (e.g. the last observation in STRINGS table), it has to have assigned only 1 value in the new column, no matter which one (so in case of string "a b", it could be assigned either "a", "b" or "a b"). Important thing is for at least one value to be there without duplication of the original string in another row. 

 

I am a beginner in coding in SAS, so the only way I could do this is by typing out a zillion of IF conditions for all the values for creating each new column. However, I'd like the solution to be automatic so that it doesn't matter how many variables there are, what are their names and how many values each one has. 

 

Please, can anyone help me with this?

 

Here's the code for creating my data:

data strings; 
   input string $6.;
   datalines;                      
a x
aa l z
b k
b y m
c z
c x k
a b
;

data variables;
   infile datalines delimiter=',' dsd; 
   input var1 $ var2 $ var3 $;
   datalines;                      
a,k,x
b,l,y
a b,m,z
,n,
;

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
hashman
Ammonite | Level 13

@LuciaCekanakova:

Well, in this case my program in its core remains the same, except:

  • In addition to STRING, each variable from VARIABLES has to be scanned
  • A provisions must be made for storing the (sub) strings from VARIABLES in their original (prop)case for further retrieval
  • Logic for selecting the longest matching substring has to be added

The rest is just diligent, even if somewhat tedious, coding:

data strings ;                                                                                                                          
  input string $30.;                                                                                                                    
  cards ;                                                                                                                               
Ford carrot                                                                                                                             
Honda beans                                                                                                                             
south america black beans                                                                                                               
green beans Audi                                                                                                                        
x5 chili pepper lettuce                                                                                                                 
garlic north America                                                                                                                    
central europe bmw                                                                                                                      
lettuce onion X5                                                                                                                        
America pepper                                                                                                                          
X5 BMW Europe                                                                                                                           
Central East Europe                                                                                                                     
bmw X5 america                                                                                                                          
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data variables;                                                                                                                         
   infile datalines delimiter=',' dsd;                                                                                                  
   length cars $6 vegetables $12 regions $14;                                                                                           
   input cars $ vegetables $ regions $;                                                                                                 
   datalines;                                                                                                                           
Audi,beans,America                                                                                                                      
BMW,black beans,Central Europe                                                                                                          
BMW X5,carrot,East Europe                                                                                                               
X5,chili pepper,Europe                                                                                                                  
Honda,green beans,North America                                                                                                         
,lettuce,South America                                                                                                                  
,onion,                                                                                                                                 
,pepper,                                                                                                                                
;                                                                                                                                       
run;                                                                                                                                    
                                                                                                                                        
proc sql noprint ;                                                                                                                      
  select max (length) into :maxvlen from dictionary.columns                                                                             
  where libname="WORK" and memname="VARIABLES"                                                                                          
  ;                                                                                                                                     
quit ;                                                                                                                                  
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set strings ;                                                                                                                         
  if _n_ = 1 then do ;                                                                                                                  
    length _s _z $ &maxvlen ;                                                                                                           
    dcl hash h () ;                                                                                                                     
    h.definekey ("_s") ;                                                                                                                
    h.definedata ("_z", "_i_") ;                                                                                                        
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set variables end = z ;                                                                                                           
      array v cars vegetables regions ;                                                                                                 
      do over v ;                                                                                                                       
        if cmiss (v) then continue ;                                                                                                    
        do _i = 1 to countw (v) ;                                                                                                       
          _z = scan (v, _i) ;                                                                                                           
          h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                               
          do _j = _i + 1 to countw (v) ;                                                                                                
            _z = catx (" ", _z, scan (v, _j)) ;                                                                                         
            h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                             
          end ;                                                                                                                         
        end ;                                                                                                                           
      end ;                                                                                                                             
    end ;                                                                                                                               
  end ;                                                                                                                                 
  call missing (of v[*]) ;                                                                                                              
  do _i = 1 to countw (string) ;                                                                                                        
    _s = scan (string, _i) ;                                                                                                            
    if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                     
    do _j = _i + 1 to countw (string) ;                                                                                                 
      _s = catx (" ", _s, scan (string, _j)) ;                                                                                          
      if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                   
    end ;                                                                                                                               
  end ;                                                                                                                                 
run ;                                     

Kind regards

Paul D. 

View solution in original post

19 REPLIES 19
ballardw
Super User

Up until your value of 'a b' this looked moderately routine. However since you are looking for a single value of 'a' from the first line of your Strings data it is going to be a bit difficult to search for just the 'a' except when 'a b' is present in a generic sense.

I ask for clarity sake, does your actual working data have exactly 1 or many Variables values that include other variable values as substrings? If so then you may need to provide more information or a more complex example closer to your actual data.

 

Best might be to find some way to split up that String variable into the component parts that you need if possible so that you have values of 'a' alone instead of 'a x'.

LuciaCekanakova
Obsidian | Level 7

Hi ballardw,

My working data can contain many variable values that include other variable values as substrings.

 

Here's a more complex example:

image.pngimage.pngimage.png

 

You made me realize more things that are necessary here, so all in all I would like:

  • to output the longest match in case of multiple occurences of values from list in a string (e.g. "green beans Audi" outputs vegetables="green beans", not just "beans")
  • to make the scan case-insensitive (for string "bmw x5" the output is cars="BMW X5" in the same case as was in the variables table)
  • to make the code independent of the names of variables and their number - these can be changed and added, so it would be great to have a universal code that can deal with any names and any numebr of variables (such as cars, vegetables etc.)

@ballardw@hashman how would you go about this?

 

In case you want to give it a try, here's the code for data:

data strings ;                                                                                                            
  input string $30.;                                                                                                      
  cards ;                                                                                                                 
Ford carrot
Honda beans
south america black beans
green beans Audi
x5 chili pepper lettuce
garlic north America
central europe bmw
lettuce onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;                                                                                                                         
run ;  

data variables;
   infile datalines delimiter=',' dsd; 
   length cars $6 vegetables $12 regions $14;
   input cars $ vegetables $ regions $;
   datalines;                      
Audi,beans,America
BMW,black beans,Central Europe
BMW X5,carrot,East Europe
X5,chili pepper,Europe
Honda,green beans,North America
,lettuce,South America
,onion,
,pepper,
;
run;

 

Tom
Super User Tom
Super User

I still don't get what you want to do with the multi-word strings in the VARIABLES table.

Do you only want to match 'BMW X5' to STRINGS that contain 'BMW X5' in that order and not to strings that have 'X5 BMW' or 'BMW model X5'?

Do you want to match also to strings that contain only 'BMW' or only 'X5'?

LuciaCekanakova
Obsidian | Level 7

Do you only want to match 'BMW X5' to STRINGS that contain 'BMW X5' in that order and not to strings that have 'X5 BMW' or 'BMW model X5'? YES, exactly.

 

Do you want to match also to strings that contain only 'BMW' or only 'X5'? These are matched to values 'BMW' and 'X5' that are also in the variable list, but not because they are part of 'BMW X5' that is in this list. You basically always scan the string for the exact value in the variable list. And if the value list contains values 'BMW', 'X5' and 'BMW X5', a string like the one in observation #12 would match all of these, but I'd like the output to be only the longest exact match - which is 'BMW X5' in this case. In obs #10 it matches 'X5' and 'BMW', so 'BMW' goes into output.

Tom
Super User Tom
Super User

In that case use the FINDW() function.

proc sql ;
  create table want as
    select string
         , case when (findw(string,cars,' ','sit')) then cars else ' ' end as cars
         , case when (findw(string,vegetables,' ','sit')) then vegetables else ' ' end as vegetables
         , case when (findw(string,regions,' ','sit')) then regions else ' ' end as regions
    from strings,variables
    having not missing(cats(calculated cars,calculated vegetables,calculated regions))
    order by 1,2,3,4
  ;
quit;
LuciaCekanakova
Obsidian | Level 7

This program finds all the matches and creates an observation for each match. How do I make it to select only the longest match for each variable?

Tom
Super User Tom
Super User

Probably easier to do each "variable" separately then.  I added row numbers to your sample dataset to make it easier to see what is happening. And to give a tiebreaker when there are multiple

data strings ;
  input string $30.;
  snum+1;
cards;
Ford carrot
Honda beans
south america black beans
green beans Audi
x5 chili pepper lettuce
garlic north America
central europe bmw
lettuce onion X5
America pepper
X5 BMW Europe
Central East Europe
bmw X5 america
;

data variables;
   infile cards delimiter=',' dsd;
   length cars $6 vegetables $12 regions $14;
   input cars vegetables regions;
   vnum+1;
cards;
Audi,beans,America
BMW,black beans,Central Europe
BMW X5,carrot,East Europe
X5,chili pepper,Europe
Honda,green beans,North America
,lettuce,South America
,onion,
,pepper,
;

proc sql ;
  create table vegetables as
    select snum,string
         , lengthn(vegetables) as length
         , vnum,vegetables
    from strings left join variables
      on findw(string,vegetables,' ','sit')
    order by snum,string,length,snum
  ;
quit;

data vegetables;
  set vegetables;
  by snum;
  if last.snum;
run;

proc print data=vegetables width=min;
run;
Obs    snum    string                       length    vnum    vegetables

  1      1     Ford carrot                     6        3     carrot
  2      2     Honda beans                     5        1     beans
  3      3     south america black beans      11        2     black beans
  4      4     green beans Audi               11        5     green beans
  5      5     x5 chili pepper lettuce        12        4     chili pepper
  6      6     garlic north America            0        .
  7      7     central europe bmw              0        .
  8      8     lettuce onion X5                7        6     lettuce
  9      9     America pepper                  6        8     pepper
 10     10     X5 BMW Europe                   0        .
 11     11     Central East Europe             0        .
 12     12     bmw X5 america                  0        .
LuciaCekanakova
Obsidian | Level 7

For some reason my output from your code looks different:image.png

 

Any idea why? There are no errors or warnings in the log.

Tom
Super User Tom
Super User
Copy the code again as I did change the ordering and the IF statement to make sure it is taking the longest match.
hashman
Ammonite | Level 13

@LuciaCekanakova:

Well, in this case my program in its core remains the same, except:

  • In addition to STRING, each variable from VARIABLES has to be scanned
  • A provisions must be made for storing the (sub) strings from VARIABLES in their original (prop)case for further retrieval
  • Logic for selecting the longest matching substring has to be added

The rest is just diligent, even if somewhat tedious, coding:

data strings ;                                                                                                                          
  input string $30.;                                                                                                                    
  cards ;                                                                                                                               
Ford carrot                                                                                                                             
Honda beans                                                                                                                             
south america black beans                                                                                                               
green beans Audi                                                                                                                        
x5 chili pepper lettuce                                                                                                                 
garlic north America                                                                                                                    
central europe bmw                                                                                                                      
lettuce onion X5                                                                                                                        
America pepper                                                                                                                          
X5 BMW Europe                                                                                                                           
Central East Europe                                                                                                                     
bmw X5 america                                                                                                                          
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data variables;                                                                                                                         
   infile datalines delimiter=',' dsd;                                                                                                  
   length cars $6 vegetables $12 regions $14;                                                                                           
   input cars $ vegetables $ regions $;                                                                                                 
   datalines;                                                                                                                           
Audi,beans,America                                                                                                                      
BMW,black beans,Central Europe                                                                                                          
BMW X5,carrot,East Europe                                                                                                               
X5,chili pepper,Europe                                                                                                                  
Honda,green beans,North America                                                                                                         
,lettuce,South America                                                                                                                  
,onion,                                                                                                                                 
,pepper,                                                                                                                                
;                                                                                                                                       
run;                                                                                                                                    
                                                                                                                                        
proc sql noprint ;                                                                                                                      
  select max (length) into :maxvlen from dictionary.columns                                                                             
  where libname="WORK" and memname="VARIABLES"                                                                                          
  ;                                                                                                                                     
quit ;                                                                                                                                  
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set strings ;                                                                                                                         
  if _n_ = 1 then do ;                                                                                                                  
    length _s _z $ &maxvlen ;                                                                                                           
    dcl hash h () ;                                                                                                                     
    h.definekey ("_s") ;                                                                                                                
    h.definedata ("_z", "_i_") ;                                                                                                        
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set variables end = z ;                                                                                                           
      array v cars vegetables regions ;                                                                                                 
      do over v ;                                                                                                                       
        if cmiss (v) then continue ;                                                                                                    
        do _i = 1 to countw (v) ;                                                                                                       
          _z = scan (v, _i) ;                                                                                                           
          h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                               
          do _j = _i + 1 to countw (v) ;                                                                                                
            _z = catx (" ", _z, scan (v, _j)) ;                                                                                         
            h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                             
          end ;                                                                                                                         
        end ;                                                                                                                           
      end ;                                                                                                                             
    end ;                                                                                                                               
  end ;                                                                                                                                 
  call missing (of v[*]) ;                                                                                                              
  do _i = 1 to countw (string) ;                                                                                                        
    _s = scan (string, _i) ;                                                                                                            
    if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                     
    do _j = _i + 1 to countw (string) ;                                                                                                 
      _s = catx (" ", _s, scan (string, _j)) ;                                                                                          
      if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                   
    end ;                                                                                                                               
  end ;                                                                                                                                 
run ;                                     

Kind regards

Paul D. 

LuciaCekanakova
Obsidian | Level 7

Thank you! This is exactly what I need 🙂

The only thing I changed is that I created a macro variable with the list of variable names and then used it in array "v".

hashman
Ammonite | Level 13

@LuciaCekanakova:

>I created a macro variable with the list of variable names and then used it in array "v"<

 

Given that you'd been striving to make the program as little hard coded as possible, this totally makes sense. 

LuciaCekanakova
Obsidian | Level 7

@hashman 

Hi, I discovered 2 more things that aren't working as I expected:

 

  1. When the VARIABLES table contains mulitple-word values, but the partial values are not in the value list (e.g. "South America" is there, but "South" by itself is not) they are also searched for in the STRINGS and if they are found, they are output in the new column (see the column regions in obs #2 and #8). I'd like to stop this from happening adn only search for the whole values if they have multiple-words.
  2. In VARIABLES table it happens that the same value is in multiple columns (e.g. beans, carrot... - I added a new column with the general type of vegetable). When this occurs, the found value is output only in one variable column it was found in, it doesn't repeat in others (see columns vegetable and vegetablegeneral in obs #1,2,8,9). I'd like for these values to repeat in all the output column were they should be base don the original VARIABLES columns.

My output:

image.png

 

Desired output:

image.png

 

Please, do you have any tip on how to modify the code to achieve this?

 

The code so far with new data:

data strings ;                                                                                                                          
  input string $30.;                                                                                                                    
  cards ;                                                                                                                               
Ford carrot                                                                                                                             
Honda beans South Africa                                                                                                                             
south america black beans                                                                                                               
green beans Audi                                                                                                                        
x5 chili pepper                                                                                                   
garlic north America                                                                                                                    
central europe bmw                                                                                                                      
lettuce north onion X5                                                                                                                        
America pepper                                                                                                                     
X5 BMW Europe                                                                                                                           
Central East Europe                                                                                                                     
bmw X5 america                                                                                                                          
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data variables;                                                                                                                         
   infile datalines delimiter=',' dsd;                                                                                                  
   length cars $6 vegetablegeneral $7 vegetable $12 regions $14;                                                                                           
   input cars $ vegetablegeneral $ vegetable $ regions $;                                                                                                 
   datalines;                                                                                                                           
Audi,beans,beans,America                                                                                                                      
BMW,carrot,black beans,Central Europe                                                                                                          
BMW X5,lettuce,carrot,East Europe                                                                                                               
X5,onion,chili pepper,Europe                                                                                                                  
Honda,pepper,green beans,North America                                                                                                         
,,lettuce,South America                                                                                                                  
,,onion,                                                                                                                                 
,,pepper,                                                                                                                                
;                                                                                                                                       
run;                                                                                                                                    
                                                                                                                                        
proc sql noprint ;                                                                                                                      
  select max (length) into :maxvlen from dictionary.columns                                                                             
  where libname="WORK" and memname="VARIABLES"                                                                                          
  ;                                                                                                                                     
quit ;      

proc sql noprint;                                                                                                                      
  select name into :varlist separated by ' ' from dictionary.columns
  where libname="WORK" and memname="VARIABLES";
quit;   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set strings ;                                                                                                                         
  if _n_ = 1 then do ;                                                                                                                  
    length _s _z $ &maxvlen ;                                                                                                           
    dcl hash h () ;                                                                                                                     
    h.definekey ("_s") ;                                                                                                                
    h.definedata ("_z", "_i_") ;                                                                                                        
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set variables end = z ;                                                                                                           
      array v &varlist ;                                                                                                 
      do over v ;                                                                                                                       
        if cmiss (v) then continue ;                                                                                                    
        do _i = 1 to countw (v) ;                                                                                                       
          _z = scan (v, _i) ;                                                                                                           
          h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                               
          do _j = _i + 1 to countw (v) ;                                                                                                
            _z = catx (" ", _z, scan (v, _j)) ;                                                                                         
            h.ref (key: upcase (_z), data: _z, data: _i_) ;                                                                             
          end ;                                                                                                                         
        end ;                                                                                                                           
      end ;                                                                                                                             
    end ;                                                                                                                               
  end ;                                                                                                                                 
  call missing (of v[*]) ;                                                                                                              
  do _i = 1 to countw (string) ;                                                                                                        
    _s = scan (string, _i) ;                                                                                                            
    if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                     
    do _j = _i + 1 to countw (string) ;                                                                                                 
      _s = catx (" ", _s, scan (string, _j)) ;                                                                                          
      if h.find (key: upcase (_s)) = 0 then if length (_z) > length (v) then v = _z ;                                                   
    end ;                                                                                                                               
  end ;                                                                                                                                 
run ;    

 

hashman
Ammonite | Level 13

@LuciaCekanakova: Will take a look when I get a piece of free quality time. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 5836 views
  • 1 like
  • 6 in conversation