DATA Step, Macro, Functions and more

Ngram modal

Reply
New Contributor
Posts: 2

Ngram modal

[ Edited ]

Hello Experts,

 

I wants to creat a ngarm modal, created 3garm modal and the codes are below.

 

data test;
sen = "The cow jumps over the moon";
run;

 


data test3;
set test;
nitems=countw(sen);
length combo $ 100;
if nitems >1;
do i=1 to nitems;
    combo = scan(sen,i);
    output;
        do j=i+1 to nitems;
            combo = catx('', scan(sen,i), scan(sen,j));
            output;
                do k=j+1 to nitems;
                      combo = catx('', scan(sen,i), scan(sen,j), scan(sen,k));
                      output;
                  end;
        end;
end;
run;
 

Thanks in Advance

Super User
Posts: 5,083

Re: Ngram modal

I don't see a question here  But if you are asking about how to correct your program to get all three-word combinations, here are a few small changes:

 

data test3;
set test;
nitems=countw(sen);
length combo $ 100;
if nitems >2;
do i=1 to nitems-2;
    combo = scan(sen,i);
    *output;
        do j=i+1 to nitems-1;
            combo = catx('', scan(sen,i), scan(sen,j));
           *output;
                do k=j+1 to nitems;
                      combo = catx('', scan(sen,i), scan(sen,j), scan(sen,k));
                      output;
                  end;
        end;
end;
run;

 

Valued Guide
Posts: 797

Re: Ngram modal

[ Edited ]

I thought ngrams were CONTIGUOUS words, but you are apparently trying to get all word COMBINATIONS (i.e. even if the "ngram" elements are not contiguous).  Is that really your intention?  (And you say you want "3grams" but you're also outputing single words, and pairs or words.  If combinations is what you really want, I'd suggest using the ALLCOMBI function (code untested):

 

data want (drop=ix_:);
  set have;
  item_count=countw(sen);
  if item_count<3 then delete;

  length combo $80;

  array items{20} $12 _temporary_;
  do I=1 to item_count;
    items{I}=scan(sen,I,' ');
  end;

  array ix3 {*} ix_1-ix_3;
  array ix2 {*} ix_1-ix_2;
  array ix1 {*} ix_1-ix_1;


  do combosize=1 to 3;
    ncomb=comb(item_count,combosize);
    ix_1=.;
    do c=1 to ncomb;
      select (combosize);
        when (1) call allcombi(item_count,combosize, of ix1{*});
        when (2) call allcombi(item_count,combosize, of ix2{*});
        when (3) call allcombi(item_count,combosize, of ix3{*});
      end;
      combo=' ';
      do I=1 to combosize;
        combo=catx(' ',combo,items{ix3{I}});
      end;
      output;
    end;
  end;
run;

 

 

But if you only want what it typically defined as ngrams, it's a lot simpler:

 

data want;
  set have;

  item_count=countw(sen);
  length gram $36;

  if item_count>=3 then do g=1 to item_count;
    gram=scan(sen,g,' ');    /*1-gram*/
    output;
    if g=item_count then leave;

    gram=catx(' ',gram,scan(sen,g+1,' '));  /*bi-gram*/
    output;
    if g=item_count-1 then leave;

    gram=catx(' ',gram,scan(sen,g+2,' ');  /*tri-gram*/
    output;
  end;
run;

 

 

New Contributor
Posts: 2

Re: Ngram modal

Above mentioned coded is giving me this result for a string "The cow jumps over the moon". In combination of Two words:

The cow
The jumps
The over
The the
The Moon
cow jumps
cow over
cow the
cow moon
jumps over
jumps the
jumps moon
over the
over moon
the moon

but here the requirement is to get the dynamic combination for n-words like there should be combinations of twos, threes, fours....ns words (all permutation and combinations) so please help me to write a macro for the same.

I hope, am making sense this time.
Valued Guide
Posts: 797

Re: Ngram modal

The program I suggested, after correcting a typographical error, generates singles, doubles, and triples with no coding logic change.  So I do not undertand why you got pairs and not triples (the error I corrected is not related to size of combinations).

 

You have now changed your requirement from generating triples (a fixed size) to a variable combination size. You can modify the program to accomodate a larger fixed size.  You can do combos up to 10 words with:

   (1) adding arrays statements for each size up to 10,
   (2) change "do combosize=1 to 3"  to "do combosize=1 to min(10,item_count)";
   (3) add a "when" statement for each additional size
   (4) change combo=catx(' ',combo,items{ix3{I}});

        to combo=catx(' ',combo,items{ix10{I}});

 

But in the end, this program is not meant to accomodate ANY size.

Valued Guide
Posts: 797

Re: Ngram modal

A macro-ized version:

 

data have; 
  sen='the cow jumps over the moon'; 
run; 

%macro want(max=10); 
  %local max /*maximum combination size*/ 
         S   /*combo size index        */ ; 

  data want (drop=ix_:); 
    set have; 
    item_count=countw(sen); 
    if item_count<3 then delete; 

    length combo $%eval(100+&max*13); 

    array items{&max} $12 _temporary_; 
    do I=1 to min(item_count,&max); 
      items{I}=scan(sen,I,' '); 
    end; 

    %do S=1 %to &max; 
      array ix&S {*} ix_1-ix_&S ; 
    %end; 

    do combosize=1 to item_count; 
      ncomb=comb(item_count,combosize); 
      ix_1=.; 
      do c=1 to ncomb;
        select (combosize);
        %do s=1 %to &max ;
          when(&S) call allcombi(item_count,combosize,of ix&S{*});
        %end;
        end;
        combo=' ';
        do I=1 to combosize;
          combo=catx(' ',combo,items{ix&max{I}});
        end;
        output;
      end;
    end;

  run;
  %mend;

%want(max=15);

 

 

As I said earlier, the program is not meant for a variable combination size, which is why it now has to be macro-ized.

Ask a Question
Discussion stats
  • 5 replies
  • 171 views
  • 0 likes
  • 3 in conversation