if i have 10000 DNA sequence data e,g;
1.G
2.C
3.A
4.C
....
....
How can i do to find no. GCC pattern in this dataset?
Just to clarify, your data looks something like this right?
data DNA;
input ID seq $;
datalines;
1 G
2 C
3 A
4 C
;
.. and so on. Then you want the observation where a seq value of G is followed by a C and then another C right? 🙂
The way to deal with your query depends on your data file type and format.
1) assuming your data is a flat file then:
filename DNA '...path and filename';
data want;
infile DNA truncover end=eof;
length c1-c3 $1 ;
array cx c1-c3;
retain c1-c3 ' ' i 0 ;
input na $1.;
link check;
keep pos c1-c3;
return;
check:
if i < 3 then do;
i+1; cx (i) = na;
end;
else do;
pos = _N_-2 ; /* position of 1st NA = G */
if compress(c1||c2||c3) = 'GCC'
then output;
c1=c2;
c2=c3;
c3=na;
end;
return;
run;
2) Similarly, if the data is a sas dataset then the code should be,
assuming that NA is the variable with the Nuclear Acid code:
data want; set have; length c1-c3 $1 ; array cx c1-c3; retain c1-c3 ' ' i 0 ; link check; keep pos c1-c3; return; check: if i < 3 then do; i+1; cx (i) = na; end; else do; pos = _N_-2; /* position of
1st NA = G */
if compress(c1||c2||c3) = 'GCC' then output;
c1=c2; c2=c3; c3=na;
end;
return;
run;
data _null_;
retain n_gcc 0;
set dna end=eod;
if lag2(seq)='G' and lag(seq)='C' and seq='C' then n_gcc+1;
if eod then put n_gcc=;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.