BookmarkSubscribeRSS Feed
ashish2016sahu
Calcite | Level 5

Hello Experts,

 

I wants to creat a ngarm modal, created 3garm modal and the codes are below.

 

data test;
sen = "The cow jumps over the moon";
run;

 


data test3;
set test;
nitems=countw(sen);
length combo $ 100;
if nitems >1;
do i=1 to nitems;
    combo = scan(sen,i);
    output;
        do j=i+1 to nitems;
            combo = catx('', scan(sen,i), scan(sen,j));
            output;
                do k=j+1 to nitems;
                      combo = catx('', scan(sen,i), scan(sen,j), scan(sen,k));
                      output;
                  end;
        end;
end;
run;
 

Thanks in Advance

5 REPLIES 5
Astounding
PROC Star

I don't see a question here  But if you are asking about how to correct your program to get all three-word combinations, here are a few small changes:

 

data test3;
set test;
nitems=countw(sen);
length combo $ 100;
if nitems >2;
do i=1 to nitems-2;
    combo = scan(sen,i);
    *output;
        do j=i+1 to nitems-1;
            combo = catx('', scan(sen,i), scan(sen,j));
           *output;
                do k=j+1 to nitems;
                      combo = catx('', scan(sen,i), scan(sen,j), scan(sen,k));
                      output;
                  end;
        end;
end;
run;

 

mkeintz
PROC Star

I thought ngrams were CONTIGUOUS words, but you are apparently trying to get all word COMBINATIONS (i.e. even if the "ngram" elements are not contiguous).  Is that really your intention?  (And you say you want "3grams" but you're also outputing single words, and pairs or words.  If combinations is what you really want, I'd suggest using the ALLCOMBI function (code untested):

 

data want (drop=ix_:);
  set have;
  item_count=countw(sen);
  if item_count<3 then delete;

  length combo $80;

  array items{20} $12 _temporary_;
  do I=1 to item_count;
    items{I}=scan(sen,I,' ');
  end;

  array ix3 {*} ix_1-ix_3;
  array ix2 {*} ix_1-ix_2;
  array ix1 {*} ix_1-ix_1;


  do combosize=1 to 3;
    ncomb=comb(item_count,combosize);
    ix_1=.;
    do c=1 to ncomb;
      select (combosize);
        when (1) call allcombi(item_count,combosize, of ix1{*});
        when (2) call allcombi(item_count,combosize, of ix2{*});
        when (3) call allcombi(item_count,combosize, of ix3{*});
      end;
      combo=' ';
      do I=1 to combosize;
        combo=catx(' ',combo,items{ix3{I}});
      end;
      output;
    end;
  end;
run;

 

 

But if you only want what it typically defined as ngrams, it's a lot simpler:

 

data want;
  set have;

  item_count=countw(sen);
  length gram $36;

  if item_count>=3 then do g=1 to item_count;
    gram=scan(sen,g,' ');    /*1-gram*/
    output;
    if g=item_count then leave;

    gram=catx(' ',gram,scan(sen,g+1,' '));  /*bi-gram*/
    output;
    if g=item_count-1 then leave;

    gram=catx(' ',gram,scan(sen,g+2,' ');  /*tri-gram*/
    output;
  end;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ashish2016sahu
Calcite | Level 5
Above mentioned coded is giving me this result for a string "The cow jumps over the moon". In combination of Two words:

The cow
The jumps
The over
The the
The Moon
cow jumps
cow over
cow the
cow moon
jumps over
jumps the
jumps moon
over the
over moon
the moon

but here the requirement is to get the dynamic combination for n-words like there should be combinations of twos, threes, fours....ns words (all permutation and combinations) so please help me to write a macro for the same.

I hope, am making sense this time.
mkeintz
PROC Star

The program I suggested, after correcting a typographical error, generates singles, doubles, and triples with no coding logic change.  So I do not undertand why you got pairs and not triples (the error I corrected is not related to size of combinations).

 

You have now changed your requirement from generating triples (a fixed size) to a variable combination size. You can modify the program to accomodate a larger fixed size.  You can do combos up to 10 words with:

   (1) adding arrays statements for each size up to 10,
   (2) change "do combosize=1 to 3"  to "do combosize=1 to min(10,item_count)";
   (3) add a "when" statement for each additional size
   (4) change combo=catx(' ',combo,items{ix3{I}});

        to combo=catx(' ',combo,items{ix10{I}});

 

But in the end, this program is not meant to accomodate ANY size.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

A macro-ized version:

 

data have; 
  sen='the cow jumps over the moon'; 
run; 

%macro want(max=10); 
  %local max /*maximum combination size*/ 
         S   /*combo size index        */ ; 

  data want (drop=ix_:); 
    set have; 
    item_count=countw(sen); 
    if item_count<3 then delete; 

    length combo $%eval(100+&max*13); 

    array items{&max} $12 _temporary_; 
    do I=1 to min(item_count,&max); 
      items{I}=scan(sen,I,' '); 
    end; 

    %do S=1 %to &max; 
      array ix&S {*} ix_1-ix_&S ; 
    %end; 

    do combosize=1 to item_count; 
      ncomb=comb(item_count,combosize); 
      ix_1=.; 
      do c=1 to ncomb;
        select (combosize);
        %do s=1 %to &max ;
          when(&S) call allcombi(item_count,combosize,of ix&S{*});
        %end;
        end;
        combo=' ';
        do I=1 to combosize;
          combo=catx(' ',combo,items{ix&max{I}});
        end;
        output;
      end;
    end;

  run;
  %mend;

%want(max=15);

 

 

As I said earlier, the program is not meant for a variable combination size, which is why it now has to be macro-ized.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1417 views
  • 0 likes
  • 3 in conversation