BookmarkSubscribeRSS Feed
LauChiFung
Calcite | Level 5

if i have 10000 DNA sequence data e,g;

 

1.G

2.C

3.A

4.C

....

....

How can i do to find no. GCC pattern in this dataset?

 

 

 

 

 

 

 

5 REPLIES 5
PeterClemmensen
Tourmaline | Level 20

Just to clarify, your data looks something like this right?

 

data DNA;
   input ID seq $;
   datalines;
   1 G
   2 C
   3 A 
   4 C
   ;

.. and so on. Then you want the observation where a seq value of G is followed by a C and then another C right? 🙂

LauChiFung
Calcite | Level 5
YES
Shmuel
Garnet | Level 18

The way to deal with your query depends on your data file type and format.

 

1) assuming your data is a flat file then:

filename DNA '...path and filename';
data want;
      infile DNA truncover end=eof;
      length c1-c3 $1 ;
      array cx c1-c3;
      retain c1-c3 ' '  i 0 ;
      input   na $1.;
      link check;
keep pos c1-c3; return; check: if i < 3 then do; i+1; cx (i) = na; end; else do;
pos = _N_-2 ; /* position of 1st NA = G */ if compress(c1||c2||c3) = 'GCC' then output; c1=c2; c2=c3; c3=na;
end; return; run;

2)  Similarly, if the data is a sas dataset then the code should be,

     assuming that NA is the variable with the Nuclear Acid code:

data want;
      set have;
      length c1-c3 $1 ;
      array cx c1-c3;
      retain c1-c3 ' '  i 0 ;      
      link check;
      keep pos c1-c3;
return;
check:
    if i < 3 then do;
       i+1; cx (i) = na;
    end;
    else do;
          pos = _N_-2;  /* position of 1st NA = G */
if compress(c1||c2||c3) = 'GCC' then output;
c1=c2; c2=c3; c3=na;
end;
return;
run;

 

 

 

mkeintz
PROC Star

data _null_;

  retain n_gcc 0;

  set dna end=eod;

 

  if lag2(seq)='G' and lag(seq)='C'  and seq='C'  then n_gcc+1;

  if eod then put n_gcc=;

run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LauChiFung
Calcite | Level 5
So thank for you all

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 807 views
  • 1 like
  • 4 in conversation