if i have 10000 DNA sequence data e,g;
1.G
2.C
3.A
4.C
....
....
How can i do to find no. GCC pattern in this dataset?
Just to clarify, your data looks something like this right?
data DNA;
input ID seq $;
datalines;
1 G
2 C
3 A
4 C
;
.. and so on. Then you want the observation where a seq value of G is followed by a C and then another C right? 🙂
The way to deal with your query depends on your data file type and format.
1) assuming your data is a flat file then:
filename DNA '...path and filename';
data want;
infile DNA truncover end=eof;
length c1-c3 $1 ;
array cx c1-c3;
retain c1-c3 ' ' i 0 ;
input na $1.;
link check;
keep pos c1-c3;
return;
check:
if i < 3 then do;
i+1; cx (i) = na;
end;
else do;
pos = _N_-2 ; /* position of 1st NA = G */
if compress(c1||c2||c3) = 'GCC'
then output;
c1=c2;
c2=c3;
c3=na;
end;
return;
run;
2) Similarly, if the data is a sas dataset then the code should be,
assuming that NA is the variable with the Nuclear Acid code:
data want; set have; length c1-c3 $1 ; array cx c1-c3; retain c1-c3 ' ' i 0 ; link check; keep pos c1-c3; return; check: if i < 3 then do; i+1; cx (i) = na; end; else do; pos = _N_-2; /* position of
1st NA = G */
if compress(c1||c2||c3) = 'GCC' then output;
c1=c2; c2=c3; c3=na;
end;
return;
run;
data _null_;
retain n_gcc 0;
set dna end=eod;
if lag2(seq)='G' and lag(seq)='C' and seq='C' then n_gcc+1;
if eod then put n_gcc=;
run;
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.