DATA Step, Macro, Functions and more

how to sum the values of similar words?

Reply
Occasional Contributor hx
Occasional Contributor
Posts: 17

how to sum the values of similar words?

Hi,

I have a dataset like

wordsd1d2d3d4
beast.548
beasts1736
apopt2..3
apopto4875
succinate59.3
succinates66.2

I want my result looks like

wordsd1d2d3d4
beasts112714
apopto6878
succinates1115.5

The original dataset is much larger, there are thousands of values of words. And many of them have typos but I need to sum the values of similar words. I am thinking using something like 'sound like' operator to calculate the sum, but it seems only work in where statement. I really have no idea what to do, can some one help me out?

 

Trusted Advisor
Posts: 1,571

Re: how to sum the values of similar words?

You may check next links documentation on SOUNDEX function:

 

http://support.sas.com/resources/papers/proceedings12/122-2012.pdf

 

http://support.sas.com/documentation/cdl/en/sqlproc/69822/HTML/default/viewer.htm#p0a62rd151ctown1x3...

 

I suggest you to run next code to check how soundex realy can help you:

   proc sql;

          create table test as select

             words, soundex(words) as sounds

             from have

           sort by sounds;

quit;

Occasional Contributor hx
Occasional Contributor
Posts: 17

Re: how to sum the values of similar words?

Thank you Shmuel! the thing is I need to calculate the sum of d1 d2s according to the words. And I found that if I use soundex, it will have questions such that word like Acid,acute and aged would be treated as the same, but that's not what I want. So i'm thinking using like creating some kind of loop to get this done, yet still no progress..
Respected Advisor
Posts: 4,173

Re: how to sum the values of similar words?

I guess the question is "how do you define similar?". It could be like soundex() returns the same value.

 

You could also have a look into function complev() and compged()

 

In the end of the day: You need to cluster word into groups and then you sum by the groups.

 

I'm not sure if SAS offers tools for such word clustering as part of SAS Text Analytics but if you've got the license then sure worth checking.

 

If you don't find anything in the SAS toolbox available to you then eventually consider using some other tool for clustering/standardizing your words. I'm sure there is a lot out there. 

https://github.com/mattalcock/blog/blob/master/2012/12/5/python-spell-checker.rst

https://lucene.apache.org/core/3_5_0/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellCh...

 

 

 

Trusted Advisor
Posts: 1,571

Re: how to sum the values of similar words?

As @Patrick said:

In the end of the day: You need to cluster word into groups and then you sum by the groups.

In order to cluster words, I would check results of several methods by running:

 proc sql;
          create table temp as select
             words, 
             function(words) as clusterx
             from have
           sort by sounds;
quit;

where function is replaced by any suggested function or even a combination of them:

 

1)   compress(words,'aeiou')   /* suppressing voels */

2)   substr(words,1,4)               /* check first 4 chatacters only */

 

in order to use complev(), compged() functions you need to gather at least two strings in order to compare.

that can be done by:

 

 

     proc sort data=have; by words; run;
    data temp;
       set have;
            length prev_word $20;
             retain prev_word;
             if _n_  = 1 then prev_word = words;
            else do;
cluster1 = complev(words, prev_word);
cluster2 = compged(words, prev_word);                  output;                  prev_word = words;             end;    run;

   then check which one of the two - cluster1 or cluster2 - fits more to desired results.

 

   Finally you need to decide which clustering method gives the best results, then you can sum by

proc means data=temp;
     class clusterx;
     var d1 - d4;
     output out=want sum=;
run;
Super User
Posts: 10,035

Re: how to sum the values of similar words?


Firstly you need find out which name belong to the same cluster.
Once you have done that (check WANT dataset), 
you can do the math to calculate SUM. I think it is easy for you.



data x;
infile cards truncover expandtabs;
input words	: $40. d1	d2	d3	d4;
id+1;
cards;
beast	.	5	4	8
beasts	1	7	3	6
apopt	2	.	.	3
apopto	4	8	7	5
succinate	5	9	.	3
succinates	6	6	.	2
;
run;

proc sql;
create table have as
select a.words as from,b.words as to,spedis(a.words,b.words) as distance
 from x as a,x as b
  where a.id ne b.id
   group by a.words
    having calculated distance=min(calculated distance);
quit;


data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

Super Contributor
Posts: 298

Re: how to sum the values of similar words?

Your example shows that the first 5 characters of WORDS looks to be good. You may choose 4 or 3 or 2 characters from any position using susbstr() function.

 

In the following code, I have declared W as 5-character string to which  when I assign WORDS takes the first leftmost

5 characters. Then DOW loop is used to get the sums. You may pre-sort the HAVE data set by W. In this example sorting is not a must as words are already GROUPED(so NOTSORTED option is used).

 

 

 

 

data have;
length w $5;
input words :$10. d1 d2 d3 d4;
w = words;
datalines;
beast .  5  4  8
beasts   1  7  3  6
apopt 2  .  .  3
apopto   4  8  7  5
succinate   5  9  .  3
succinates  6  6  .  2
;
run;

data need;
   do until(last.w);
      set have ;
      by w notsorted;

      array sum[4];
      array k[4] d1 - d4;
      if first.w then call missing(of sum[*]);
      do i = 1 to dim(k); sum[i] + k[i]; end;
   end;
keep words sum:;
run;

 

Ask a Question
Discussion stats
  • 6 replies
  • 256 views
  • 0 likes
  • 5 in conversation