BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
asiddiqui
Calcite | Level 5

I have an input file with a sequence in FASTA format begins with a single-line description, followed by lines of sequence data.

The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning, shorter than 80 characters in length.

The data is divided into 50 character set each, in multiples lines extending upto 1400 characters.

>gi|5524211 gb AAD44166.1 cytochrome b

LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFW

GATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVA

LAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLL

LALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGV

LALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQ

PVEYPYTIIGQMASILYFSIILAFLPIAGXIENY

My question: When I read the input file into a dataset, I created two columns, "Desc" and "Sequence". I need my dataset to have one Desc row and one Sequence row, but the sequence is getting divided up into multiple row like as follows. Looking for help either cleaning the LFCR as I create the dataset or conc the rows after the dataset is created. PLEASE HELP

Obs           Desc                                                                    Sequence

-------------------------------------------------------------------------------------------------------------------------

1          gi|5524211 gb AAD44166.1 cytochrome b          LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFW

2                                                                                GATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVA

3                                                                                LAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLL

4                                                                                LALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGV

5                                                                                LALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQ

6                                                                                PVEYPYTIIGQMASILYFSIILAFLPIAGXIENY

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Ok. Now tested with Mike's (thanks Mike) fake data (with a single DESC) :

data want;

length desc $80 sequence $2000;

do until (eof);

     infile "&sasforum.\datasets\fasta PG.txt"  end=eof lrecl=1000;

     input;

     if char(_infile_,1) = '>' then desc = substr(_infile_,2);

     else sequence = cats(sequence, _infile_);

     end;

run;

PG

PG

View solution in original post

6 REPLIES 6
PGStats
Opal | Level 21

If, as you imply, there is only one desc per file, and thus, your dataset should contain only one observation, then this should do (untested) :

data want;

length desc $80 sequence $2000;

do until (eof);

     infile "yourFastaFile.xxx"  end=eof;

     input;

     if char(_infile_,1) = '>' then desc = substr(_infile_,2,80);

     else sequence = cats(sequence, _infile_);

     end;

run;

PG

PG
asiddiqui
Calcite | Level 5

Thnx y'all, both responses works but the sequence is reading only upto 107 characters and not beyond. My input file has 1302 sequence char.

MikeZdeb
Rhodochrosite | Level 12

hi ... can you post a portion of your data

PGStats
Opal | Level 21

Ok. Now tested with Mike's (thanks Mike) fake data (with a single DESC) :

data want;

length desc $80 sequence $2000;

do until (eof);

     infile "&sasforum.\datasets\fasta PG.txt"  end=eof lrecl=1000;

     input;

     if char(_infile_,1) = '>' then desc = substr(_infile_,2);

     else sequence = cats(sequence, _infile_);

     end;

run;

PG

PG
asiddiqui
Calcite | Level 5

Thank you PGStats and MikeZdeb, your codes works perfectly as intended with Mike's and my dummy input file , but when I run it on my actual file (image below) it's not reading all the sequences using proc print.

I was not able to figure out why, Then It struck me maybe its something with my "proc print" output settings, so I used ODS to put in a pdf file, this time it read all my sequences but had spaces between the different sequences...hmm, Used ODS to html and boom all looks good (but cant explain why).

THANKYOU PGStats and MikeZdeb for your help and valuable time. Love this forum.

Input file

Incorrect output with spaces with ods pdf

MikeZdeb
Rhodochrosite | Level 12

hi ... if there are more than one DESC per file, I think this will work (at least it works with the attached fake data) ...

data slowa;

infile 'z:\fasta.txt' end=done;

length desc $100 sequence $1400;

do _n_=1 by 1 until (done);

   input @;

   if char(_infile_,1) eq '>' then do;

       if _n_ ne 1 then output;

       desc = substr(_infile_,2);

       call missing(sequence);

   end;

   else sequence = cats(sequence,_infile_);

   input;

end;

output;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1792 views
  • 3 likes
  • 3 in conversation