Re: Find some sequence in DNA

LauChiFung · Posted 02-11-2017 12:57 AM

if i have 10000 DNA sequence data e,g;

1.G

2.C

3.A

4.C

....

How can i do to find no. GCC pattern in this dataset?

PeterClemmensen · Posted 02-11-2017 04:38 AM

Just to clarify, your data looks something like this right?

data DNA;
   input ID seq $;
   datalines;
   1 G
   2 C
   3 A 
   4 C
   ;

.. and so on. Then you want the observation where a seq value of G is followed by a C and then another C right? 🙂

The DATA to DATA Step Macro
Blog: SASnrd

LauChiFung · Posted 02-12-2017 03:03 AM

YES

Shmuel · Posted 02-11-2017 05:45 AM

The way to deal with your query depends on your data file type and format.

1) assuming your data is a flat file then:

filename DNA '...path and filename';
data want;
      infile DNA truncover end=eof;
      length c1-c3 $1 ;
      array cx c1-c3;
      retain c1-c3 ' '  i 0 ;
      input   na $1.;
      link check;
      keep pos c1-c3;
return;
check:
    if i < 3 then do;
       i+1; cx (i) = na;
    end;
    else do;
          pos = _N_-2 ;  /* position of 1st NA = G */
          if compress(c1||c2||c3) = 'GCC' 
            then output;
          c1=c2;
          c2=c3;
          c3=na;
     end;
return;
run;

2) Similarly, if the data is a sas dataset then the code should be,

assuming that NA is the variable with the Nuclear Acid code:

data want;
      set have;
      length c1-c3 $1 ;
      array cx c1-c3;
      retain c1-c3 ' '  i 0 ;      
      link check;
      keep pos c1-c3;
return;
check:
    if i < 3 then do;
       i+1; cx (i) = na;
    end;
    else do;
          pos = _N_-2;  /* position of 1st NA = G */
          if compress(c1||c2||c3) = 'GCC' then output; 
             c1=c2; c2=c3; c3=na; 
    end;
return; 
run;

mkeintz · Posted 02-11-2017 08:29 AM

data _null_;

retain n_gcc 0;

set dna end=eod;

if lag2(seq)='G' and lag(seq)='C' and seq='C' then n_gcc+1;

if eod then put n_gcc=;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

LauChiFung · Posted 02-11-2017 06:35 PM

So thank for you all

Find some sequence in DNA