BookmarkSubscribeRSS Feed
LauChiFung
Calcite | Level 5

if i have 10000 DNA sequence data e,g;

 

1.G

2.C

3.A

4.C

....

....

How can i do to find no. GCC pattern in this dataset?

 

 

 

 

 

 

 

5 REPLIES 5
PeterClemmensen
Tourmaline | Level 20

Just to clarify, your data looks something like this right?

 

data DNA;
   input ID seq $;
   datalines;
   1 G
   2 C
   3 A 
   4 C
   ;

.. and so on. Then you want the observation where a seq value of G is followed by a C and then another C right? 🙂

LauChiFung
Calcite | Level 5
YES
Shmuel
Garnet | Level 18

The way to deal with your query depends on your data file type and format.

 

1) assuming your data is a flat file then:

filename DNA '...path and filename';
data want;
      infile DNA truncover end=eof;
      length c1-c3 $1 ;
      array cx c1-c3;
      retain c1-c3 ' '  i 0 ;
      input   na $1.;
      link check;
keep pos c1-c3; return; check: if i < 3 then do; i+1; cx (i) = na; end; else do;
pos = _N_-2 ; /* position of 1st NA = G */ if compress(c1||c2||c3) = 'GCC' then output; c1=c2; c2=c3; c3=na;
end; return; run;

2)  Similarly, if the data is a sas dataset then the code should be,

     assuming that NA is the variable with the Nuclear Acid code:

data want;
      set have;
      length c1-c3 $1 ;
      array cx c1-c3;
      retain c1-c3 ' '  i 0 ;      
      link check;
      keep pos c1-c3;
return;
check:
    if i < 3 then do;
       i+1; cx (i) = na;
    end;
    else do;
          pos = _N_-2;  /* position of 1st NA = G */
if compress(c1||c2||c3) = 'GCC' then output;
c1=c2; c2=c3; c3=na;
end;
return;
run;

 

 

 

mkeintz
PROC Star

data _null_;

  retain n_gcc 0;

  set dna end=eod;

 

  if lag2(seq)='G' and lag(seq)='C'  and seq='C'  then n_gcc+1;

  if eod then put n_gcc=;

run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LauChiFung
Calcite | Level 5
So thank for you all

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1600 views
  • 1 like
  • 4 in conversation