<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to scan string for same repeating pattern in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816519#M322316</link>
    <description>&lt;P&gt;Thank you for replying.&amp;nbsp; Yes, the numbers were just an example.&amp;nbsp; I am taking a bio-infomattics class, and one the problems is how to analyze a string a find the most common occurrence of a 3-byte sequence.&amp;nbsp; So, the code reads the first position (substring) and then increments 3 bytes to position 4, 7 (possibly index and do loop 'i+2')... to the end of the given text string.&amp;nbsp; DNA strings in actuality are much longer.&amp;nbsp; This problem is a dummy version this dummy can't figure out.&amp;nbsp; I appreciate any help.&lt;/P&gt;</description>
    <pubDate>Sun, 05 Jun 2022 02:04:48 GMT</pubDate>
    <dc:creator>StanleyManning</dc:creator>
    <dc:date>2022-06-05T02:04:48Z</dc:date>
    <item>
      <title>How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816458#M322267</link>
      <description>&lt;P&gt;I'm a bit of a novice, but is there a SAS function that is able to interrogate a long text string and detect for specific patterns.&amp;nbsp; For example, I am trying to find to most common occurrence of a the same pattern of 3 letters.&amp;nbsp; For example, 'XYZ' appears 5 times in the 1st obs and 'ABC' appears 3 times in the 2nd obs.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was thinking a do loop with an iterative index script might work, but I can't seem to get it right.&amp;nbsp; Appreciate any help for a newbie.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;DATA &amp;amp;ODSN1;&lt;BR /&gt;INFILE DATALINES TRUNCOVER;&lt;BR /&gt;INPUT DNA_STR $85.&lt;BR /&gt;;&lt;BR /&gt;DATALINES;&lt;BR /&gt;CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC&lt;BR /&gt;CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC&lt;BR /&gt;;&lt;BR /&gt;RUN;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jun 2022 21:59:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816458#M322267</guid>
      <dc:creator>StanleyManning</dc:creator>
      <dc:date>2022-06-03T21:59:55Z</dc:date>
    </item>
    <item>
      <title>Re: How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816475#M322281</link>
      <description>&lt;P&gt;So you want to find the most frequent three-letter sequence in a string of (up to 85) letters.&amp;nbsp; I presume you intend to keep only non-overlapping sequences (so ABABABA would be two ABA's, not three), right?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, what if you had 5 non-overlapping sequences of 4 letters&amp;nbsp; (ABCD......ABCD.....ABCD....ABCD....ABCD), and no other sequence of 3 letters or more has more than 4 instances.&amp;nbsp; Does that mean you have a tie for most frequent three-letter sequence - i.e.&amp;nbsp; five ABC's&amp;nbsp; and five BCD's?&lt;/P&gt;</description>
      <pubDate>Sat, 04 Jun 2022 04:13:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816475#M322281</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2022-06-04T04:13:20Z</dc:date>
    </item>
    <item>
      <title>Re: How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816491#M322292</link>
      <description>&lt;P&gt;As your input variable is called DNA_STR, I assume that it is a string of DNA bases - although you put in some numbers and letters that are not in the normal DNA base nomenclature (C, G, A or T), but maybe that was just for the example.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But DNA strings are normally read in sets of 3 bases at a time. If you start at the beginning, your first string would come out as&amp;nbsp;&lt;SPAN&gt;CGG, AGG, ACX,YZT, CTA etc. No XYZ there, because of the alignment.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;As already remarked by&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/31461"&gt;@mkeintz&lt;/a&gt;&amp;nbsp;you probably need to specify your criteria more clearly.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 04 Jun 2022 08:26:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816491#M322292</guid>
      <dc:creator>s_lassen</dc:creator>
      <dc:date>2022-06-04T08:26:48Z</dc:date>
    </item>
    <item>
      <title>Re: How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816519#M322316</link>
      <description>&lt;P&gt;Thank you for replying.&amp;nbsp; Yes, the numbers were just an example.&amp;nbsp; I am taking a bio-infomattics class, and one the problems is how to analyze a string a find the most common occurrence of a 3-byte sequence.&amp;nbsp; So, the code reads the first position (substring) and then increments 3 bytes to position 4, 7 (possibly index and do loop 'i+2')... to the end of the given text string.&amp;nbsp; DNA strings in actuality are much longer.&amp;nbsp; This problem is a dummy version this dummy can't figure out.&amp;nbsp; I appreciate any help.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Jun 2022 02:04:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816519#M322316</guid>
      <dc:creator>StanleyManning</dc:creator>
      <dc:date>2022-06-05T02:04:48Z</dc:date>
    </item>
    <item>
      <title>Re: How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816586#M322344</link>
      <description>&lt;P&gt;If the patterns are short (e.g. 3 or 4 letters) and the strings are very long, I guess you might as well count every pattern in the string:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;DATA have;
INFILE DATALINES TRUNCOVER;
INPUT DNA_STR $85.;
str_id + 1;
DATALINES;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
;

data want;
array patern {64} $3 _temporary_;
if _n_ = 1 then do;
    do c1 = "A", "T", "G", "C";
        do c2 = "A", "T", "G", "C";
            do c3 = "A", "T", "G", "C";
                i + 1;
                patern{i} = cats(c1, c2, c3);
                end;
            end;
        end;
    end;
set have;

do p = 1 to dim(patern);
    count = 0; pos = 1; pat = patern{p};
    do i = 1 to 9999 until(pos=0);
        pos = find(DNA_STR, pat, pos);
        if pos &amp;gt; 0 then do;
            count = count + 1;
            pos = pos + length(pat);
            end;
        end;
    if count &amp;gt; 1 then output;
    end;
keep str_id pat count; 
run;

proc sql;
select * from want group by str_id having count = max(count);
quit;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="PGStats_0-1654483585007.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/72002iA5A62DD2B280F487/image-size/medium?v=v2&amp;amp;px=400" role="button" title="PGStats_0-1654483585007.png" alt="PGStats_0-1654483585007.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2022 02:48:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816586#M322344</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2022-06-06T02:48:15Z</dc:date>
    </item>
    <item>
      <title>Re: How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816588#M322345</link>
      <description>&lt;P&gt;If you had a list of the distinct triplet patterns in DNA_STR, then you could use the COUNT function to get the frequency of each pattern.&amp;nbsp; But you need to avoid counting patterns that cross boundaries between consecutive triplets.&amp;nbsp; This could be done be creating an extended dna string (called _test_str below), say by separating each triplet by a dot.&amp;nbsp; Then the COUNT function would never count patterns that cross boundaries.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;DATA have;
  INFILE DATALINES TRUNCOVER;
  INPUT DNA_STR $84.;
DATALINES;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
RUN;

data want (drop= _:);
  string_id=_n_;
  set have;

  length _test_str $112 ; /*Original DNA_STR length plus 1 extra per triplet*/

  array _most_freq_patterns {28} $3;

  do P=1 to 82 by 3;
    _test_str=catx('.',_test_str,substr(dna_str,p,3));
  end;
  _test_str=cats(_test_str,'.');

  length pattern $3;
  do until (_total_freq=28);
    pattern=substr(_test_str,1,3);
    _freq=count(_test_str,pattern);
    _total_freq=sum(_total_freq,_freq);
    if _freq&amp;gt;max_freq then do;
      max_freq=_freq;
      call missing(of _most_freq_patterns{*});
      _n_most_freq=1;
      _most_freq_patterns{1}=pattern;
    end;
    else if _freq=max_freq then do;
      _n_most_freq=_n_most_freq+1;
      _most_freq_patterns{_n_most_freq}=pattern;
    end;
    _test_str=left(transtrn(_test_str,cats(pattern,'.'),''));
  end;
  do P=1 to _n_most_freq;
    pattern=_most_freq_patterns{p};
    output;
  end;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Editted addition - explanatory notes.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; The "_test_str=left(.......)" effectively removes all instances of the current pattern (including the trailing '.') and left justfies the result.&amp;nbsp; Because the result is left justified, the next pattern to count is already in positions 1 through 3 of _test_str.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The array size of _most_freq_patterns is 28, just in case all 28 triplets in DNA_STR occur only once. - making for a 28-way tie.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Building _test_str can be expensive, since it concatenates the increasly ong _TEST_STR to the new triplet 28 times.&amp;nbsp; this is a process that will take up more time at more than linear rates as the length of the original string increases (say from 84 to 8400).&amp;nbsp; If such costs are unacceptable, one could replace this:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;  do P=1 to 82 by 3;
    _test_str=catx('.',_test_str,substr(dna_str,p,3));
  end;
  _test_str=cats(_test_str,'.');&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;with this:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;  call pokelong(dna_str,addrlong(_most_freq_patterns{1}),84);
  _test_str=trim(catx('.',of _most_freq_patterns)) || '.';1&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The POKELONG call routine copies the 84 bytes of memory for DNA_STR to the 84 bytes of memory occupied by the 28 contiguous 3-byte elements of the _most_freq_patterns array - wherever in memory they may be.&amp;nbsp; Then the CATX function creates the _TEST_STR string in a single pass - not 28 passes.&amp;nbsp; Imagine the savings if DNA_STR were 8400 bytes.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 16:40:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816588#M322345</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2022-06-07T16:40:32Z</dc:date>
    </item>
    <item>
      <title>Re: How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816600#M322348</link>
      <description>&lt;P&gt;Here another coding option that should work. Data WANT will contain multiple rows per single source row if the there is more than one "most frequent" pattern.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
  infile datalines truncover;
  input dna_str $85.;
  datalines;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
;
run;

data inter;
  set have;
  row=_n_;
  length pattern $3;
  n_pattern=length(dna_str)/3;
  do start=1 to n_pattern;
    pattern=substr(dna_str,start,3);
    output;
  end;
  drop dna_str;
run;

proc freq data=inter nlevels noprint;
  table pattern /out=byvalue nopercent;
  by row;
run;

proc rank data=byvalue ties=dense descending out=want(where=(rank=1));
  by row;
  var count;
  ranks rank;
run;
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 06 Jun 2022 07:47:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816600#M322348</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2022-06-06T07:47:30Z</dc:date>
    </item>
    <item>
      <title>Re: How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816827#M322454</link>
      <description>&lt;P&gt;I would convert the data to a long format. As you probably do not want to have a copy of the long DNA_STR in the long data, start by adding a surrogate key (ID) to your input:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;DATA have;
  INFILE DATALINES TRUNCOVER;
  INPUT DNA_STR $85.;
  ID=_N_;
DATALINES;
CGGAGGACXYZTCTAGGTAXYZACGCTTATCAGXYZGTCCATAGGACATXYZTCG123CTCTAGGXYZGAATCAGGTGCT12TC
CGGA456CABCTCTAGGTAABCACGCTTATCAG123GTCCATAGGACATXYZTCGGAACTCTAGGABCGAATCAG987CTTATC
;
RUN;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Then, convert to long format and use PROC SUMMARY (or PROC FREQ) to get the frequencies:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data long;
  set have;
  length dna_seq $3;
  do pos=1 by 3 to length(DNA_STR);
    dna_seq=substr(DNA_STR,pos,3);
    output;
    end;
  keep ID pos dna_seq;
run; 

proc summary data=long nway;
  class ID dna_seq;
  output out=counts(rename=(_freq_=seq_count) drop=_type_);
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Then you can get the values with maximum counts for each ID using SQL:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
  select * from counts
  group by id
  having(seq_count)=max(seq_count);
quit;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 09:48:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816827#M322454</guid>
      <dc:creator>s_lassen</dc:creator>
      <dc:date>2022-06-07T09:48:42Z</dc:date>
    </item>
    <item>
      <title>Re: How to scan string for same repeating pattern</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816846#M322459</link>
      <description>&lt;P&gt;&amp;nbsp;Thank you very much for taking the time to respond.&amp;nbsp; All of these solutions worked well.&amp;nbsp; I don't quite grasp some of the coding aspects, but I am going to try step thru by section and try to understand it more.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 12:48:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-scan-string-for-same-repeating-pattern/m-p/816846#M322459</guid>
      <dc:creator>StanleyManning</dc:creator>
      <dc:date>2022-06-07T12:48:04Z</dc:date>
    </item>
  </channel>
</rss>

