Solved: Re: Collapsing rows containing similar text variables

JonkeyJonkerson · Posted 07-09-2019 09:18 AM

Hi team,

I'm looking to combine rows based on whether or not a text variable (always just a single word) appears within another text variable. E.G. I want to treat "apple" and "apples" as the same word and take the sum of the quant data for each. The resulting word that is displayed will always be the largest of the words that were combined.

Word	quantity
apples	2
apple	3
orange	5
banana	3

to this:

Word	quantity
apples	5
orange	5
banana	3

Note that I'm not looking to combine misspelled words, only words that contain other words.

Thanks!

hashman · Posted 07-10-2019 03:38 PM

@JonkeyJonkerson :

As other contributors have noted, this task falls into the category of fuzzy match problems. Though their complexity varies and can require quite sophisticated methods to deal with, in your present case the fuzzy matching rule is pretty simple and straightforward. So, it can be approached, for example, this way:

Remap WORD in every observation by comparing it with WORDs in the rest of the file. If the latter includes the former, reassign the latter to the former. This way, at the end of the remapping the longest "inclusive" WORD will take over.
Summarize using the remapped WORD as a key.

#1 can be done using an array or a hash table used as an array .The latter has the advantage of not having to size it up ahead of time or set its dimension as "big enough", so this is opted for below.

#2 can be done in a separate step using SQL, MEANS, or the DATA step. Alternatively, it can be done in the same step where remapping is done by using another hash object instance.

If you prefer the 2-step approach, it can be rendered, for example, thus:

data have ;               
  input word $10. amount ;
  cards ;                 
fruits     2              
orange     1              
apple      1              
fruit      1              
applesplus 3              
fruitsies  3              
apples     2              
;                         
run ;                     

data remap (drop = word rename = (_w = word)) ;                                
  if _n_ = 1 then do ;                                                         
    dcl hash h (dataset:"have") ;                                              
    h.definekey ("word") ;                                                     
    h.definedone () ;                                                          
    dcl hiter hi ("h") ;                                                       
  end ;                                                                        
  set have ;                                                                   
  _w = word ;                                                                  
  do while (hi.next() = 0) ;                                                   
    if find (word, strip (_w)) then _w = word ;                                
  end ;                                                                        
run ;                                                                          
                                                                               
proc sql ;                                                                     
  create table want as select word, sum (amount) as sum from remap group word ;
quit ;

If you prefer to do everything in a single step, you could use:

data _null_ ;                                    
  dcl hash h (dataset:"have") ;                  
  h.definekey  ("word") ;                        
  h.definedone () ;                              
  dcl hiter hi ("h") ;                           
                                                 
  dcl hash s (ordered:"A") ;                     
  s.definekey  ("word") ;                        
  s.definedata ("word", "sum") ;                 
  s.definedone () ;                              
                                                 
  do until (z) ;                                 
    set have end = z ;                           
    _w = word ;                                  
    do while (hi.next() = 0) ;                   
      if find (word, strip (_w)) then _w = word ;
    end ;                                        
    word = _w ;                                  
    if s.find() ne 0 then sum = amount ;         
    else                  sum + amount ;         
    s.replace() ;                                
  end ;                                          
                                                 
  s.output (dataset:"want") ;                    
run ;

Kind regards

Paul D.

View solution in original post

PeterClemmensen · Posted 07-09-2019 09:20 AM

What if the Word 'Apples' was yet again contained in another word. eg 'Appless'?

The DATA to DATA Step Macro
Blog: SASnrd

JonkeyJonkerson · Posted 07-09-2019 09:22 AM

Hi! In this case 'Appless' would become the dominant word (even though it's a misspelling). And all other text that could be contained within 'appless' (apple/apples) will be collapsed into 'appless'

andreas_lds · Posted 07-09-2019 09:31 AM

And what about "pineapple" and "apple"?

Combining those could be wrong .... please explain the rules to apply, add some more examples.

novinosrin · Posted 07-09-2019 09:36 AM

@andreas_lds = Brilliance at Best!

I wish I could think like that. Well , only some are born with it. I suppose! Kudos!

Reeza · Posted 07-09-2019 10:24 AM

No one is born with it, it's something you learn 🙂 Edge cases and what will break your code...

hashman · Posted 07-09-2019 02:42 PM

@andreas_lds :

It's a keen and cute observation. Yet I suspect the OP is using the sample "words" in a merely syntactic sense, i.e. without the idea that they represent botanical terms and thus should belong to the same botanical class in case they "match" by containing one another and treated this way for the purposes of aggregation.

Kind regards

Paul D.

Ksharp · Posted 07-10-2019 08:09 AM

That is fuzz matched problem . It is very difficult , could try SPEDIS() , COMPLV() ...... some functions which check spell or edit distance between two words.

Or if you have SAS DataFlux product, that would deduct to a piece of cake .

novinosrin · Posted 07-10-2019 08:23 AM

Xia, Interesting. -"Or if you have SAS DataFlux product, that would deduct to a piece of cake"

What algorithm is embedded in dataflux for fuzzy matches. Does it make the closest or best possible approximation or even absolute?

Any thoughts on the likelihood of mismatch or something slipping through the cracks? I vaguely remember @ballardw responding something similar stressing the importance of text mining software along with fuzzy merge. Well well, I regret not taking notes.

Ksharp · Posted 07-10-2019 08:37 AM

Nov,

Sorry.I don't know. But I think that must be the high cutting-edge technology in DataFlux .

That is why SAS pay so much money to buy DataFlux ..

Reeza · Posted 07-10-2019 11:03 AM

dataflux uses a primarily rules based algorithm to do the matching. There's an 'open source' version for names called LinkKing. Someone once posted a link if you feel like going through my twitter feed.

hashman · Posted 07-10-2019 03:38 PM

@JonkeyJonkerson :

As other contributors have noted, this task falls into the category of fuzzy match problems. Though their complexity varies and can require quite sophisticated methods to deal with, in your present case the fuzzy matching rule is pretty simple and straightforward. So, it can be approached, for example, this way:

Remap WORD in every observation by comparing it with WORDs in the rest of the file. If the latter includes the former, reassign the latter to the former. This way, at the end of the remapping the longest "inclusive" WORD will take over.
Summarize using the remapped WORD as a key.

#1 can be done using an array or a hash table used as an array .The latter has the advantage of not having to size it up ahead of time or set its dimension as "big enough", so this is opted for below.

#2 can be done in a separate step using SQL, MEANS, or the DATA step. Alternatively, it can be done in the same step where remapping is done by using another hash object instance.

If you prefer the 2-step approach, it can be rendered, for example, thus:

data have ;               
  input word $10. amount ;
  cards ;                 
fruits     2              
orange     1              
apple      1              
fruit      1              
applesplus 3              
fruitsies  3              
apples     2              
;                         
run ;                     

data remap (drop = word rename = (_w = word)) ;                                
  if _n_ = 1 then do ;                                                         
    dcl hash h (dataset:"have") ;                                              
    h.definekey ("word") ;                                                     
    h.definedone () ;                                                          
    dcl hiter hi ("h") ;                                                       
  end ;                                                                        
  set have ;                                                                   
  _w = word ;                                                                  
  do while (hi.next() = 0) ;                                                   
    if find (word, strip (_w)) then _w = word ;                                
  end ;                                                                        
run ;                                                                          
                                                                               
proc sql ;                                                                     
  create table want as select word, sum (amount) as sum from remap group word ;
quit ;

If you prefer to do everything in a single step, you could use:

data _null_ ;                                    
  dcl hash h (dataset:"have") ;                  
  h.definekey  ("word") ;                        
  h.definedone () ;                              
  dcl hiter hi ("h") ;                           
                                                 
  dcl hash s (ordered:"A") ;                     
  s.definekey  ("word") ;                        
  s.definedata ("word", "sum") ;                 
  s.definedone () ;                              
                                                 
  do until (z) ;                                 
    set have end = z ;                           
    _w = word ;                                  
    do while (hi.next() = 0) ;                   
      if find (word, strip (_w)) then _w = word ;
    end ;                                        
    word = _w ;                                  
    if s.find() ne 0 then sum = amount ;         
    else                  sum + amount ;         
    s.replace() ;                                
  end ;                                          
                                                 
  s.output (dataset:"want") ;                    
run ;

Kind regards

Paul D.

JonkeyJonkerson · Posted 07-10-2019 04:06 PM

Brilliant! Well thought out and articulated. Thank you!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!