I have an input file with a sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning, shorter than 80 characters in length. The data is divided into 50 character set each, in multiples lines extending upto 1400 characters. >gi|5524211 gb AAD44166.1 cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFW GATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVA LAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLL LALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGV LALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQ PVEYPYTIIGQMASILYFSIILAFLPIAGXIENY My question: When I read the input file into a dataset, I created two columns, "Desc" and "Sequence". I need my dataset to have one Desc row and one Sequence row, but the sequence is getting divided up into multiple row like as follows. Looking for help either cleaning the LFCR as I create the dataset or conc the rows after the dataset is created. PLEASE HELP Obs Desc Sequence ------------------------------------------------------------------------------------------------------------------------- 1 gi|5524211 gb AAD44166.1 cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFW 2 GATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVA 3 LAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLL 4 LALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGV 5 LALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQ 6 PVEYPYTIIGQMASILYFSIILAFLPIAGXIENY
... View more