BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JonkeyJonkerson
Calcite | Level 5

Hi team,

I'm looking to combine rows based on whether or not a text variable (always just a single word) appears within another text variable. E.G. I want to treat "apple" and "apples" as the same word and take the sum of the quant data for each. The resulting word that is displayed will always be the largest of the words that were combined.

Wordquantity
apples2
apple3
orange5
banana3

to this:

Wordquantity
apples5
orange5
banana3

 

Note that I'm not looking to combine misspelled words, only words that contain other words.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
hashman
Ammonite | Level 13

@JonkeyJonkerson :

As other contributors have noted, this task falls into the category of fuzzy match problems. Though their complexity varies and can require quite sophisticated methods to deal with, in your present case the fuzzy matching rule is pretty simple and straightforward. So, it can be approached, for example, this way:

  1. Remap WORD in every observation by comparing it with WORDs in the rest of the file. If the latter includes the former, reassign the latter to the former. This way, at the end of the remapping the longest "inclusive" WORD will take over. 
  2. Summarize using the remapped WORD as a key.

#1 can be done using an array or a hash table used as an array .The latter has the advantage of not having to size it up ahead of time or set its dimension as "big enough", so this is opted for below.

#2 can be done in a separate step using SQL, MEANS, or the DATA step. Alternatively, it can be done in the same step where remapping is done by using another hash object instance.

 

If you prefer the 2-step approach, it can be rendered, for example, thus:

data have ;               
  input word $10. amount ;
  cards ;                 
fruits     2              
orange     1              
apple      1              
fruit      1              
applesplus 3              
fruitsies  3              
apples     2              
;                         
run ;                     

data remap (drop = word rename = (_w = word)) ;                                
  if _n_ = 1 then do ;                                                         
    dcl hash h (dataset:"have") ;                                              
    h.definekey ("word") ;                                                     
    h.definedone () ;                                                          
    dcl hiter hi ("h") ;                                                       
  end ;                                                                        
  set have ;                                                                   
  _w = word ;                                                                  
  do while (hi.next() = 0) ;                                                   
    if find (word, strip (_w)) then _w = word ;                                
  end ;                                                                        
run ;                                                                          
                                                                               
proc sql ;                                                                     
  create table want as select word, sum (amount) as sum from remap group word ;
quit ;                                                                         

If you prefer to do everything in a single step, you could use:

data _null_ ;                                    
  dcl hash h (dataset:"have") ;                  
  h.definekey  ("word") ;                        
  h.definedone () ;                              
  dcl hiter hi ("h") ;                           
                                                 
  dcl hash s (ordered:"A") ;                     
  s.definekey  ("word") ;                        
  s.definedata ("word", "sum") ;                 
  s.definedone () ;                              
                                                 
  do until (z) ;                                 
    set have end = z ;                           
    _w = word ;                                  
    do while (hi.next() = 0) ;                   
      if find (word, strip (_w)) then _w = word ;
    end ;                                        
    word = _w ;                                  
    if s.find() ne 0 then sum = amount ;         
    else                  sum + amount ;         
    s.replace() ;                                
  end ;                                          
                                                 
  s.output (dataset:"want") ;                    
run ;                                            

Kind regards

Paul D.    

View solution in original post

12 REPLIES 12
PeterClemmensen
Tourmaline | Level 20

What if the Word 'Apples' was yet again contained in another word. eg 'Appless'?

JonkeyJonkerson
Calcite | Level 5
Hi! In this case 'Appless' would become the dominant word (even though it's a misspelling). And all other text that could be contained within 'appless' (apple/apples) will be collapsed into 'appless'
andreas_lds
Jade | Level 19

And what about "pineapple" and "apple"?

Combining those could be wrong .... please explain the rules to apply, add some more examples.

novinosrin
Tourmaline | Level 20

@andreas_lds   =  Brilliance at Best!

 

I wish I could think like that. Well , only some are born with it. I suppose! Kudos!

Reeza
Super User
No one is born with it, it's something you learn 🙂 Edge cases and what will break your code...
hashman
Ammonite | Level 13

@andreas_lds : 

 

It's a keen and cute observation. Yet I suspect the OP is using the sample "words" in a merely syntactic sense, i.e. without the idea that they represent botanical terms and thus should belong to the same botanical class in case they "match" by containing one another and treated this way for the purposes of aggregation.

 

Kind regards

Paul D.

Ksharp
Super User

That is fuzz matched problem . It is very difficult , could try SPEDIS() , COMPLV() ...... some functions which check spell or edit distance between two words.

Or if you have SAS DataFlux product, that would deduct to a piece of cake .

novinosrin
Tourmaline | Level 20

Xia, Interesting. -"Or if you have SAS DataFlux product, that would deduct to a piece of cake"

 

What algorithm is embedded in dataflux for fuzzy matches. Does it make the closest or best possible approximation or even absolute? 

 

Any thoughts on the likelihood of mismatch or something slipping through the cracks? I vaguely remember @ballardw  responding something similar stressing the importance of text mining software along with fuzzy merge. Well well, I regret not taking notes. 

Ksharp
Super User

Nov,

Sorry.I don't know. But I think that must be the high cutting-edge technology in DataFlux .

That is why SAS pay so much money to buy DataFlux .. 

Reeza
Super User
dataflux uses a primarily rules based algorithm to do the matching. There's an 'open source' version for names called LinkKing. Someone once posted a link if you feel like going through my twitter feed.
hashman
Ammonite | Level 13

@JonkeyJonkerson :

As other contributors have noted, this task falls into the category of fuzzy match problems. Though their complexity varies and can require quite sophisticated methods to deal with, in your present case the fuzzy matching rule is pretty simple and straightforward. So, it can be approached, for example, this way:

  1. Remap WORD in every observation by comparing it with WORDs in the rest of the file. If the latter includes the former, reassign the latter to the former. This way, at the end of the remapping the longest "inclusive" WORD will take over. 
  2. Summarize using the remapped WORD as a key.

#1 can be done using an array or a hash table used as an array .The latter has the advantage of not having to size it up ahead of time or set its dimension as "big enough", so this is opted for below.

#2 can be done in a separate step using SQL, MEANS, or the DATA step. Alternatively, it can be done in the same step where remapping is done by using another hash object instance.

 

If you prefer the 2-step approach, it can be rendered, for example, thus:

data have ;               
  input word $10. amount ;
  cards ;                 
fruits     2              
orange     1              
apple      1              
fruit      1              
applesplus 3              
fruitsies  3              
apples     2              
;                         
run ;                     

data remap (drop = word rename = (_w = word)) ;                                
  if _n_ = 1 then do ;                                                         
    dcl hash h (dataset:"have") ;                                              
    h.definekey ("word") ;                                                     
    h.definedone () ;                                                          
    dcl hiter hi ("h") ;                                                       
  end ;                                                                        
  set have ;                                                                   
  _w = word ;                                                                  
  do while (hi.next() = 0) ;                                                   
    if find (word, strip (_w)) then _w = word ;                                
  end ;                                                                        
run ;                                                                          
                                                                               
proc sql ;                                                                     
  create table want as select word, sum (amount) as sum from remap group word ;
quit ;                                                                         

If you prefer to do everything in a single step, you could use:

data _null_ ;                                    
  dcl hash h (dataset:"have") ;                  
  h.definekey  ("word") ;                        
  h.definedone () ;                              
  dcl hiter hi ("h") ;                           
                                                 
  dcl hash s (ordered:"A") ;                     
  s.definekey  ("word") ;                        
  s.definedata ("word", "sum") ;                 
  s.definedone () ;                              
                                                 
  do until (z) ;                                 
    set have end = z ;                           
    _w = word ;                                  
    do while (hi.next() = 0) ;                   
      if find (word, strip (_w)) then _w = word ;
    end ;                                        
    word = _w ;                                  
    if s.find() ne 0 then sum = amount ;         
    else                  sum + amount ;         
    s.replace() ;                                
  end ;                                          
                                                 
  s.output (dataset:"want") ;                    
run ;                                            

Kind regards

Paul D.    

JonkeyJonkerson
Calcite | Level 5

Brilliant! Well thought out and articulated. Thank you!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1210 views
  • 5 likes
  • 7 in conversation