Hi,
I have a dataset like
words | d1 | d2 | d3 | d4 |
beast | . | 5 | 4 | 8 |
beasts | 1 | 7 | 3 | 6 |
apopt | 2 | . | . | 3 |
apopto | 4 | 8 | 7 | 5 |
succinate | 5 | 9 | . | 3 |
succinates | 6 | 6 | . | 2 |
I want my result looks like
words | d1 | d2 | d3 | d4 |
beasts | 1 | 12 | 7 | 14 |
apopto | 6 | 8 | 7 | 8 |
succinates | 11 | 15 | . | 5 |
The original dataset is much larger, there are thousands of values of words. And many of them have typos but I need to sum the values of similar words. I am thinking using something like 'sound like' operator to calculate the sum, but it seems only work in where statement. I really have no idea what to do, can some one help me out?
You may check next links documentation on SOUNDEX function:
http://support.sas.com/resources/papers/proceedings12/122-2012.pdf
I suggest you to run next code to check how soundex realy can help you:
proc sql;
create table test as select
words, soundex(words) as sounds
from have
sort by sounds;
quit;
I guess the question is "how do you define similar?". It could be like soundex() returns the same value.
You could also have a look into function complev() and compged()
In the end of the day: You need to cluster word into groups and then you sum by the groups.
I'm not sure if SAS offers tools for such word clustering as part of SAS Text Analytics but if you've got the license then sure worth checking.
If you don't find anything in the SAS toolbox available to you then eventually consider using some other tool for clustering/standardizing your words. I'm sure there is a lot out there.
https://github.com/mattalcock/blog/blob/master/2012/12/5/python-spell-checker.rst
As @Patrick said:
In the end of the day: You need to cluster word into groups and then you sum by the groups.
In order to cluster words, I would check results of several methods by running:
proc sql;
create table temp as select
words,
function(words) as clusterx
from have
sort by sounds;
quit;
where function is replaced by any suggested function or even a combination of them:
1) compress(words,'aeiou') /* suppressing voels */
2) substr(words,1,4) /* check first 4 chatacters only */
in order to use complev(), compged() functions you need to gather at least two strings in order to compare.
that can be done by:
proc sort data=have; by words; run;
data temp;
set have;
length prev_word $20;
retain prev_word;
if _n_ = 1 then prev_word = words;
else do;
cluster1 = complev(words, prev_word);
cluster2 = compged(words, prev_word);
output;
prev_word = words;
end;
run;
then check which one of the two - cluster1 or cluster2 - fits more to desired results.
Finally you need to decide which clustering method gives the best results, then you can sum by
proc means data=temp;
class clusterx;
var d1 - d4;
output out=want sum=;
run;
Firstly you need find out which name belong to the same cluster. Once you have done that (check WANT dataset), you can do the math to calculate SUM. I think it is easy for you. data x; infile cards truncover expandtabs; input words : $40. d1 d2 d3 d4; id+1; cards; beast . 5 4 8 beasts 1 7 3 6 apopt 2 . . 3 apopto 4 8 7 5 succinate 5 9 . 3 succinates 6 6 . 2 ; run; proc sql; create table have as select a.words as from,b.words as to,spedis(a.words,b.words) as distance from x as a,x as b where a.id ne b.id group by a.words having calculated distance=min(calculated distance); quit; data full; set have end=last; if _n_ eq 1 then do; declare hash h(); h.definekey('node'); h.definedata('node'); h.definedone(); end; output; node=from; h.replace(); from=to; to=node; output; node=from; h.replace(); if last then h.output(dataset:'node'); drop node; run; data want(keep=node household); declare hash ha(ordered:'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('last'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key'); _ha.definedone(); if 0 then set full; declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y'); from_to.definekey('from'); from_to.definedata('to'); from_to.definedone(); if 0 then set node; declare hash no(dataset:'node'); declare hiter hi_no('no'); no.definekey('node'); no.definedata('node'); no.definedone(); do while(hi_no.next()=0); household+1; output; count=1; key=node;_ha.add(); last=node;ha.add(); rc=hi.first(); do while(rc=0); from=last;rx=from_to.find(); do while(rx=0); key=to;ry=_ha.check(); if ry ne 0 then do; node=to;output;rr=no.remove(key:node); key=to;_ha.add(); count+1; last=to;ha.add(); end; rx=from_to.find_next(); end; rc=hi.next(); end; ha.clear();_ha.clear(); end; stop; run;
Your example shows that the first 5 characters of WORDS looks to be good. You may choose 4 or 3 or 2 characters from any position using susbstr() function.
In the following code, I have declared W as 5-character string to which when I assign WORDS takes the first leftmost
5 characters. Then DOW loop is used to get the sums. You may pre-sort the HAVE data set by W. In this example sorting is not a must as words are already GROUPED(so NOTSORTED option is used).
data have;
length w $5;
input words :$10. d1 d2 d3 d4;
w = words;
datalines;
beast . 5 4 8
beasts 1 7 3 6
apopt 2 . . 3
apopto 4 8 7 5
succinate 5 9 . 3
succinates 6 6 . 2
;
run;
data need;
do until(last.w);
set have ;
by w notsorted;
array sum[4];
array k[4] d1 - d4;
if first.w then call missing(of sum[*]);
do i = 1 to dim(k); sum[i] + k[i]; end;
end;
keep words sum:;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.