Hi team,
I'm looking to combine rows based on whether or not a text variable (always just a single word) appears within another text variable. E.G. I want to treat "apple" and "apples" as the same word and take the sum of the quant data for each. The resulting word that is displayed will always be the largest of the words that were combined.
Word | quantity |
apples | 2 |
apple | 3 |
orange | 5 |
banana | 3 |
to this:
Word | quantity |
apples | 5 |
orange | 5 |
banana | 3 |
Note that I'm not looking to combine misspelled words, only words that contain other words.
Thanks!
As other contributors have noted, this task falls into the category of fuzzy match problems. Though their complexity varies and can require quite sophisticated methods to deal with, in your present case the fuzzy matching rule is pretty simple and straightforward. So, it can be approached, for example, this way:
#1 can be done using an array or a hash table used as an array .The latter has the advantage of not having to size it up ahead of time or set its dimension as "big enough", so this is opted for below.
#2 can be done in a separate step using SQL, MEANS, or the DATA step. Alternatively, it can be done in the same step where remapping is done by using another hash object instance.
If you prefer the 2-step approach, it can be rendered, for example, thus:
data have ;
input word $10. amount ;
cards ;
fruits 2
orange 1
apple 1
fruit 1
applesplus 3
fruitsies 3
apples 2
;
run ;
data remap (drop = word rename = (_w = word)) ;
if _n_ = 1 then do ;
dcl hash h (dataset:"have") ;
h.definekey ("word") ;
h.definedone () ;
dcl hiter hi ("h") ;
end ;
set have ;
_w = word ;
do while (hi.next() = 0) ;
if find (word, strip (_w)) then _w = word ;
end ;
run ;
proc sql ;
create table want as select word, sum (amount) as sum from remap group word ;
quit ;
If you prefer to do everything in a single step, you could use:
data _null_ ;
dcl hash h (dataset:"have") ;
h.definekey ("word") ;
h.definedone () ;
dcl hiter hi ("h") ;
dcl hash s (ordered:"A") ;
s.definekey ("word") ;
s.definedata ("word", "sum") ;
s.definedone () ;
do until (z) ;
set have end = z ;
_w = word ;
do while (hi.next() = 0) ;
if find (word, strip (_w)) then _w = word ;
end ;
word = _w ;
if s.find() ne 0 then sum = amount ;
else sum + amount ;
s.replace() ;
end ;
s.output (dataset:"want") ;
run ;
Kind regards
Paul D.
What if the Word 'Apples' was yet again contained in another word. eg 'Appless'?
And what about "pineapple" and "apple"?
Combining those could be wrong .... please explain the rules to apply, add some more examples.
@andreas_lds = Brilliance at Best!
I wish I could think like that. Well , only some are born with it. I suppose! Kudos!
It's a keen and cute observation. Yet I suspect the OP is using the sample "words" in a merely syntactic sense, i.e. without the idea that they represent botanical terms and thus should belong to the same botanical class in case they "match" by containing one another and treated this way for the purposes of aggregation.
Kind regards
Paul D.
That is fuzz matched problem . It is very difficult , could try SPEDIS() , COMPLV() ...... some functions which check spell or edit distance between two words.
Or if you have SAS DataFlux product, that would deduct to a piece of cake .
Xia, Interesting. -"Or if you have SAS DataFlux product, that would deduct to a piece of cake"
What algorithm is embedded in dataflux for fuzzy matches. Does it make the closest or best possible approximation or even absolute?
Any thoughts on the likelihood of mismatch or something slipping through the cracks? I vaguely remember @ballardw responding something similar stressing the importance of text mining software along with fuzzy merge. Well well, I regret not taking notes.
Nov,
Sorry.I don't know. But I think that must be the high cutting-edge technology in DataFlux .
That is why SAS pay so much money to buy DataFlux ..
As other contributors have noted, this task falls into the category of fuzzy match problems. Though their complexity varies and can require quite sophisticated methods to deal with, in your present case the fuzzy matching rule is pretty simple and straightforward. So, it can be approached, for example, this way:
#1 can be done using an array or a hash table used as an array .The latter has the advantage of not having to size it up ahead of time or set its dimension as "big enough", so this is opted for below.
#2 can be done in a separate step using SQL, MEANS, or the DATA step. Alternatively, it can be done in the same step where remapping is done by using another hash object instance.
If you prefer the 2-step approach, it can be rendered, for example, thus:
data have ;
input word $10. amount ;
cards ;
fruits 2
orange 1
apple 1
fruit 1
applesplus 3
fruitsies 3
apples 2
;
run ;
data remap (drop = word rename = (_w = word)) ;
if _n_ = 1 then do ;
dcl hash h (dataset:"have") ;
h.definekey ("word") ;
h.definedone () ;
dcl hiter hi ("h") ;
end ;
set have ;
_w = word ;
do while (hi.next() = 0) ;
if find (word, strip (_w)) then _w = word ;
end ;
run ;
proc sql ;
create table want as select word, sum (amount) as sum from remap group word ;
quit ;
If you prefer to do everything in a single step, you could use:
data _null_ ;
dcl hash h (dataset:"have") ;
h.definekey ("word") ;
h.definedone () ;
dcl hiter hi ("h") ;
dcl hash s (ordered:"A") ;
s.definekey ("word") ;
s.definedata ("word", "sum") ;
s.definedone () ;
do until (z) ;
set have end = z ;
_w = word ;
do while (hi.next() = 0) ;
if find (word, strip (_w)) then _w = word ;
end ;
word = _w ;
if s.find() ne 0 then sum = amount ;
else sum + amount ;
s.replace() ;
end ;
s.output (dataset:"want") ;
run ;
Kind regards
Paul D.
Brilliant! Well thought out and articulated. Thank you!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.