Solved: Re: How to scan string for same repeating pattern

StanleyManning · Posted 06-03-2022 05:59 PM

I'm a bit of a novice, but is there a SAS function that is able to interrogate a long text string and detect for specific patterns. For example, I am trying to find to most common occurrence of a the same pattern of 3 letters. For example, 'XYZ' appears 5 times in the 1st obs and 'ABC' appears 3 times in the 2nd obs.

I was thinking a do loop with an iterative index script might work, but I can't seem to get it right. Appreciate any help for a newbie.

DATA &ODSN1;
INFILE DATALINES TRUNCOVER;
INPUT DNA_STR $85.
;
DATALINES;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
;
RUN;

mkeintz · Posted 06-06-2022 12:57 AM

If you had a list of the distinct triplet patterns in DNA_STR, then you could use the COUNT function to get the frequency of each pattern. But you need to avoid counting patterns that cross boundaries between consecutive triplets. This could be done be creating an extended dna string (called _test_str below), say by separating each triplet by a dot. Then the COUNT function would never count patterns that cross boundaries.

DATA have;
  INFILE DATALINES TRUNCOVER;
  INPUT DNA_STR $84.;
DATALINES;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
RUN;

data want (drop= _:);
  string_id=_n_;
  set have;

  length _test_str $112 ; /*Original DNA_STR length plus 1 extra per triplet*/

  array _most_freq_patterns {28} $3;

  do P=1 to 82 by 3;
    _test_str=catx('.',_test_str,substr(dna_str,p,3));
  end;
  _test_str=cats(_test_str,'.');

  length pattern $3;
  do until (_total_freq=28);
    pattern=substr(_test_str,1,3);
    _freq=count(_test_str,pattern);
    _total_freq=sum(_total_freq,_freq);
    if _freq>max_freq then do;
      max_freq=_freq;
      call missing(of _most_freq_patterns{*});
      _n_most_freq=1;
      _most_freq_patterns{1}=pattern;
    end;
    else if _freq=max_freq then do;
      _n_most_freq=_n_most_freq+1;
      _most_freq_patterns{_n_most_freq}=pattern;
    end;
    _test_str=left(transtrn(_test_str,cats(pattern,'.'),''));
  end;
  do P=1 to _n_most_freq;
    pattern=_most_freq_patterns{p};
    output;
  end;
run;

Editted addition - explanatory notes.

The "_test_str=left(.......)" effectively removes all instances of the current pattern (including the trailing '.') and left justfies the result. Because the result is left justified, the next pattern to count is already in positions 1 through 3 of _test_str.

The array size of _most_freq_patterns is 28, just in case all 28 triplets in DNA_STR occur only once. - making for a 28-way tie.

Building _test_str can be expensive, since it concatenates the increasly ong _TEST_STR to the new triplet 28 times. this is a process that will take up more time at more than linear rates as the length of the original string increases (say from 84 to 8400). If such costs are unacceptable, one could replace this:

  do P=1 to 82 by 3;
    _test_str=catx('.',_test_str,substr(dna_str,p,3));
  end;
  _test_str=cats(_test_str,'.');

with this:

  call pokelong(dna_str,addrlong(_most_freq_patterns{1}),84);
  _test_str=trim(catx('.',of _most_freq_patterns)) || '.';1

The POKELONG call routine copies the 84 bytes of memory for DNA_STR to the 84 bytes of memory occupied by the 28 contiguous 3-byte elements of the _most_freq_patterns array - wherever in memory they may be. Then the CATX function creates the _TEST_STR string in a single pass - not 28 passes. Imagine the savings if DNA_STR were 8400 bytes.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 06-03-2022 09:01 PM

So you want to find the most frequent three-letter sequence in a string of (up to 85) letters. I presume you intend to keep only non-overlapping sequences (so ABABABA would be two ABA's, not three), right?

Also, what if you had 5 non-overlapping sequences of 4 letters (ABCD......ABCD.....ABCD....ABCD....ABCD), and no other sequence of 3 letters or more has more than 4 instances. Does that mean you have a tie for most frequent three-letter sequence - i.e. five ABC's and five BCD's?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 06-06-2022 03:47 AM

Here another coding option that should work. Data WANT will contain multiple rows per single source row if the there is more than one "most frequent" pattern.

data have;
  infile datalines truncover;
  input dna_str $85.;
  datalines;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
;
run;

data inter;
  set have;
  row=_n_;
  length pattern $3;
  n_pattern=length(dna_str)/3;
  do start=1 to n_pattern;
    pattern=substr(dna_str,start,3);
    output;
  end;
  drop dna_str;
run;

proc freq data=inter nlevels noprint;
  table pattern /out=byvalue nopercent;
  by row;
run;

proc rank data=byvalue ties=dense descending out=want(where=(rank=1));
  by row;
  var count;
  ranks rank;
run;

s_lassen · Posted 06-04-2022 04:26 AM

As your input variable is called DNA_STR, I assume that it is a string of DNA bases - although you put in some numbers and letters that are not in the normal DNA base nomenclature (C, G, A or T), but maybe that was just for the example.

But DNA strings are normally read in sets of 3 bases at a time. If you start at the beginning, your first string would come out as CGG, AGG, ACX,YZT, CTA etc. No XYZ there, because of the alignment.

As already remarked by @mkeintz you probably need to specify your criteria more clearly.

StanleyManning · Posted 06-04-2022 10:04 PM

Thank you for replying. Yes, the numbers were just an example. I am taking a bio-infomattics class, and one the problems is how to analyze a string a find the most common occurrence of a 3-byte sequence. So, the code reads the first position (substring) and then increments 3 bytes to position 4, 7 (possibly index and do loop 'i+2')... to the end of the given text string. DNA strings in actuality are much longer. This problem is a dummy version this dummy can't figure out. I appreciate any help.

mkeintz · Posted 06-06-2022 12:57 AM

If you had a list of the distinct triplet patterns in DNA_STR, then you could use the COUNT function to get the frequency of each pattern. But you need to avoid counting patterns that cross boundaries between consecutive triplets. This could be done be creating an extended dna string (called _test_str below), say by separating each triplet by a dot. Then the COUNT function would never count patterns that cross boundaries.

DATA have;
  INFILE DATALINES TRUNCOVER;
  INPUT DNA_STR $84.;
DATALINES;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
RUN;

data want (drop= _:);
  string_id=_n_;
  set have;

  length _test_str $112 ; /*Original DNA_STR length plus 1 extra per triplet*/

  array _most_freq_patterns {28} $3;

  do P=1 to 82 by 3;
    _test_str=catx('.',_test_str,substr(dna_str,p,3));
  end;
  _test_str=cats(_test_str,'.');

  length pattern $3;
  do until (_total_freq=28);
    pattern=substr(_test_str,1,3);
    _freq=count(_test_str,pattern);
    _total_freq=sum(_total_freq,_freq);
    if _freq>max_freq then do;
      max_freq=_freq;
      call missing(of _most_freq_patterns{*});
      _n_most_freq=1;
      _most_freq_patterns{1}=pattern;
    end;
    else if _freq=max_freq then do;
      _n_most_freq=_n_most_freq+1;
      _most_freq_patterns{_n_most_freq}=pattern;
    end;
    _test_str=left(transtrn(_test_str,cats(pattern,'.'),''));
  end;
  do P=1 to _n_most_freq;
    pattern=_most_freq_patterns{p};
    output;
  end;
run;

Editted addition - explanatory notes.

The "_test_str=left(.......)" effectively removes all instances of the current pattern (including the trailing '.') and left justfies the result. Because the result is left justified, the next pattern to count is already in positions 1 through 3 of _test_str.

The array size of _most_freq_patterns is 28, just in case all 28 triplets in DNA_STR occur only once. - making for a 28-way tie.

Building _test_str can be expensive, since it concatenates the increasly ong _TEST_STR to the new triplet 28 times. this is a process that will take up more time at more than linear rates as the length of the original string increases (say from 84 to 8400). If such costs are unacceptable, one could replace this:

  do P=1 to 82 by 3;
    _test_str=catx('.',_test_str,substr(dna_str,p,3));
  end;
  _test_str=cats(_test_str,'.');

with this:

  call pokelong(dna_str,addrlong(_most_freq_patterns{1}),84);
  _test_str=trim(catx('.',of _most_freq_patterns)) || '.';1

The POKELONG call routine copies the 84 bytes of memory for DNA_STR to the 84 bytes of memory occupied by the 28 contiguous 3-byte elements of the _most_freq_patterns array - wherever in memory they may be. Then the CATX function creates the _TEST_STR string in a single pass - not 28 passes. Imagine the savings if DNA_STR were 8400 bytes.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

s_lassen · Posted 06-07-2022 05:48 AM

I would convert the data to a long format. As you probably do not want to have a copy of the long DNA_STR in the long data, start by adding a surrogate key (ID) to your input:

DATA have;
  INFILE DATALINES TRUNCOVER;
  INPUT DNA_STR $85.;
  ID=_N_;
DATALINES;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
;
RUN;

Then, convert to long format and use PROC SUMMARY (or PROC FREQ) to get the frequencies:

data long;
  set have;
  length dna_seq $3;
  do pos=1 by 3 to length(DNA_STR);
    dna_seq=substr(DNA_STR,pos,3);
    output;
    end;
  keep ID pos dna_seq;
run; 

proc summary data=long nway;
  class ID dna_seq;
  output out=counts(rename=(_freq_=seq_count) drop=_type_);
run;

Then you can get the values with maximum counts for each ID using SQL:

proc sql;
  select * from counts
  group by id
  having(seq_count)=max(seq_count);
quit;

PGStats · Posted 06-05-2022 10:48 PM

If the patterns are short (e.g. 3 or 4 letters) and the strings are very long, I guess you might as well count every pattern in the string:

DATA have;
INFILE DATALINES TRUNCOVER;
INPUT DNA_STR $85.;
str_id + 1;
DATALINES;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
;

data want;
array patern {64} $3 _temporary_;
if _n_ = 1 then do;
    do c1 = "A", "T", "G", "C";
        do c2 = "A", "T", "G", "C";
            do c3 = "A", "T", "G", "C";
                i + 1;
                patern{i} = cats(c1, c2, c3);
                end;
            end;
        end;
    end;
set have;

do p = 1 to dim(patern);
    count = 0; pos = 1; pat = patern{p};
    do i = 1 to 9999 until(pos=0);
        pos = find(DNA_STR, pat, pos);
        if pos > 0 then do;
            count = count + 1;
            pos = pos + length(pat);
            end;
        end;
    if count > 1 then output;
    end;
keep str_id pat count; 
run;

proc sql;
select * from want group by str_id having count = max(count);
quit;

PG

StanleyManning · Posted 06-07-2022 08:48 AM

Thank you very much for taking the time to respond. All of these solutions worked well. I don't quite grasp some of the coding aspects, but I am going to try step thru by section and try to understand it more.

SAS Innovate 2025: Call for Content

Classroom Training Available!