Hello,
I have a variable in a SAS dataset that is a string of characters such as O-O-O-M-M-O-I-O, what I would like to do is try to shorten the value of the variable by adding a number that indicates how many times a character is repeated in a row. So for the above example I would like to see it be converted to 3O-2M-O-I-O. Suggestions on how to do this?
here is a code that you can use:
data have; input ID 1. DNA $30.; datalines; 1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O 2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M 3 M-M-M-M-O 4 O-I-O-I-M-O-O-O-I-I 5 O ; run;
What I would like for the above is to get the "want DNA" column.
ID | HAVE DNA | WANT DNA |
1 | O-O-O-O-O-O-O-O-O-O-O-O-O-O-O | 15O |
2 | M-M-M-M-M-M-O-I-I-I-I-O-O-M-M | 6M-O-4I-2O-2M |
3 | M-M-M-M-O | 4M-O |
4 | O-I-O-I-M-O-O-O-I-I | O-I-O-I-M-3O-2I |
5 | O | O |
Thank you!
data want;
set have;
length out_strand $30;
length cur_strand cur_base $1;
do _i = 1 to countw(dna);
cur_strand = scan(dna,_i,'-'); *identify the current base we are looking at;
do strand_count = 0 by 1 until (cur_strand ne cur_base); *iterate over the scans to find the next nonmatch;
cur_base = scan(dna,strand_count+_i,'-');
end;
*compose the output string, checking to see if we need to append the number if >1 or not if =1;
out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));
_i = strand_count+_i-1; *have to decrement one, since we go one past the match;
end;
run;
This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.
data want;
set have;
length out_strand $30;
length cur_strand cur_base $1;
do _i = 1 to countw(dna);
cur_strand = scan(dna,_i,'-'); *identify the current base we are looking at;
do strand_count = 0 by 1 until (cur_strand ne cur_base); *iterate over the scans to find the next nonmatch;
cur_base = scan(dna,strand_count+_i,'-');
end;
*compose the output string, checking to see if we need to append the number if >1 or not if =1;
out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));
_i = strand_count+_i-1; *have to decrement one, since we go one past the match;
end;
run;
This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.
data finale(drop= i j k count);
set have;
length want_dna $200; *This needs to be set to a maximum lenghth;
do i=1 to n;
count=0;
key=substr(DNA,i,1);
do j=i to n;
if i=j and i ne 1 then do;
do k=1 to k=i-1;
if key=substr(DNA,k,1) then goto skip;
end;
end;
if key=substr(DNA,j,1) then count=+1;
end;
if i=1 then do;
if count gt 1 then WANT_DNA=key||'-'||put(count,best12.);
else WANT_DNA=key;
end;
else do;
if count gt 1 then WANT_DNA=WANT_DNA||'-'||key||'-'||put(count,best12.);
else WANT_DNA=WANT_DNA||'-'||key;
end;
skip:
end;
run;
I have not tested this code, to be updated.
Thank you @Satish_Parida! I did get an error:
Variable n is uninitialized.
ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or
the BY expression is missing, zero, or invalid.
However, the code by @snoopy369 worked!
Thank you to you both!!
I love this question absolutely.
data have;
infile cards expandtabs;
input ID 1. DNA $30.;
cards;
1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O
2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M
3 M-M-M-M-O
4 O-I-O-I-M-O-O-O-I-I
5 O
; run;
data temp;
set have;
do i=1 to countw(dna,'-');
value=scan(dna,i,'-');output;
end;
drop i dna;
run;
proc summary data=temp ;
by id value notsorted;
output out=temp1;
run;
data want;
length want $ 200;
do until(last.id);
set temp1;
by id;
if _freq_=1 then want=catx('-',want,value);
else want=catx('-',want,cats(_freq_,value));
end;
drop _type_ _freq_ value;
run;
proc print noobs;run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.