Re: how to sum the values of similar words?

hx · Posted 11-28-2016 03:26 PM

Hi,

I have a dataset like

words	d1	d2	d3	d4
beast	.	5	4	8
beasts	1	7	3	6
apopt	2	.	.	3
apopto	4	8	7	5
succinate	5	9	.	3
succinates	6	6	.	2

I want my result looks like

words	d1	d2	d3	d4
beasts	1	12	7	14
apopto	6	8	7	8
succinates	11	15	.	5

The original dataset is much larger, there are thousands of values of words. And many of them have typos but I need to sum the values of similar words. I am thinking using something like 'sound like' operator to calculate the sum, but it seems only work in where statement. I really have no idea what to do, can some one help me out?

Shmuel · Posted 11-28-2016 04:04 PM

You may check next links documentation on SOUNDEX function:

http://support.sas.com/resources/papers/proceedings12/122-2012.pdf

http://support.sas.com/documentation/cdl/en/sqlproc/69822/HTML/default/viewer.htm#p0a62rd151ctown1x3...

I suggest you to run next code to check how soundex realy can help you:

proc sql;

create table test as select

words, soundex(words) as sounds

from have

sort by sounds;

quit;

hx · Posted 11-28-2016 05:19 PM

Thank you Shmuel! the thing is I need to calculate the sum of d1 d2s according to the words. And I found that if I use soundex, it will have questions such that word like Acid,acute and aged would be treated as the same, but that's not what I want. So i'm thinking using like creating some kind of loop to get this done, yet still no progress..

Patrick · Posted 11-28-2016 06:37 PM

I guess the question is "how do you define similar?". It could be like soundex() returns the same value.

You could also have a look into function complev() and compged()

In the end of the day: You need to cluster word into groups and then you sum by the groups.

I'm not sure if SAS offers tools for such word clustering as part of SAS Text Analytics but if you've got the license then sure worth checking.

If you don't find anything in the SAS toolbox available to you then eventually consider using some other tool for clustering/standardizing your words. I'm sure there is a lot out there.

https://github.com/mattalcock/blog/blob/master/2012/12/5/python-spell-checker.rst

https://lucene.apache.org/core/3_5_0/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellCh...

Shmuel · Posted 11-28-2016 08:57 PM

As @Patrick said:

In the end of the day: You need to cluster word into groups and then you sum by the groups.

In order to cluster words, I would check results of several methods by running:

 proc sql;
          create table temp as select
             words, 
             function(words) as clusterx
             from have
           sort by sounds;
quit;

where function is replaced by any suggested function or even a combination of them:

1) compress(words,'aeiou') /* suppressing voels */

2) substr(words,1,4) /* check first 4 chatacters only */

in order to use complev(), compged() functions you need to gather at least two strings in order to compare.

that can be done by:

     proc sort data=have; by words; run;
    data temp;
       set have;
            length prev_word $20;
             retain prev_word;
             if _n_  = 1 then prev_word = words;
            else do;
                 cluster1 = complev(words, prev_word);
                 cluster2 = compged(words, prev_word);
                 output;
                 prev_word = words;
            end;
   run;

then check which one of the two - cluster1 or cluster2 - fits more to desired results.

Finally you need to decide which clustering method gives the best results, then you can sum by

proc means data=temp;
     class clusterx;
     var d1 - d4;
     output out=want sum=;
run;

Ksharp · Posted 11-29-2016 12:44 AM


Firstly you need find out which name belong to the same cluster.
Once you have done that (check WANT dataset), 
you can do the math to calculate SUM. I think it is easy for you.



data x;
infile cards truncover expandtabs;
input words	: $40. d1	d2	d3	d4;
id+1;
cards;
beast	.	5	4	8
beasts	1	7	3	6
apopt	2	.	.	3
apopto	4	8	7	5
succinate	5	9	.	3
succinates	6	6	.	2
;
run;

proc sql;
create table have as
select a.words as from,b.words as to,spedis(a.words,b.words) as distance
 from x as a,x as b
  where a.id ne b.id
   group by a.words
    having calculated distance=min(calculated distance);
quit;


data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

KachiM · Posted 11-29-2016 07:51 AM

Your example shows that the first 5 characters of WORDS looks to be good. You may choose 4 or 3 or 2 characters from any position using susbstr() function.

In the following code, I have declared W as 5-character string to which when I assign WORDS takes the first leftmost

5 characters. Then DOW loop is used to get the sums. You may pre-sort the HAVE data set by W. In this example sorting is not a must as words are already GROUPED(so NOTSORTED option is used).

data have;
length w $5;
input words :$10. d1 d2 d3 d4;
w = words;
datalines;
beast .  5  4  8
beasts   1  7  3  6
apopt 2  .  .  3
apopto   4  8  7  5
succinate   5  9  .  3
succinates  6  6  .  2
;
run;

data need;
   do until(last.w);
      set have ;
      by w notsorted;

      array sum[4];
      array k[4] d1 - d4;
      if first.w then call missing(of sum[*]);
      do i = 1 to dim(k); sum[i] + k[i]; end;
   end;
keep words sum:;
run;

how to sum the values of similar words?