Hello,
I have a variable in a SAS dataset that is a string of characters such as O-O-O-M-M-O-I-O, what I would like to do is try to shorten the value of the variable by adding a number that indicates how many times a character is repeated in a row. So for the above example I would like to see it be converted to 3O-2M-O-I-O. Suggestions on how to do this?
here is a code that you can use:
data have; input ID 1. DNA $30.; datalines; 1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O 2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M 3 M-M-M-M-O 4 O-I-O-I-M-O-O-O-I-I 5 O ; run;
What I would like for the above is to get the "want DNA" column.
ID | HAVE DNA | WANT DNA |
1 | O-O-O-O-O-O-O-O-O-O-O-O-O-O-O | 15O |
2 | M-M-M-M-M-M-O-I-I-I-I-O-O-M-M | 6M-O-4I-2O-2M |
3 | M-M-M-M-O | 4M-O |
4 | O-I-O-I-M-O-O-O-I-I | O-I-O-I-M-3O-2I |
5 | O | O |
Thank you!
data want;
set have;
length out_strand $30;
length cur_strand cur_base $1;
do _i = 1 to countw(dna);
cur_strand = scan(dna,_i,'-'); *identify the current base we are looking at;
do strand_count = 0 by 1 until (cur_strand ne cur_base); *iterate over the scans to find the next nonmatch;
cur_base = scan(dna,strand_count+_i,'-');
end;
*compose the output string, checking to see if we need to append the number if >1 or not if =1;
out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));
_i = strand_count+_i-1; *have to decrement one, since we go one past the match;
end;
run;
This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.
data want;
set have;
length out_strand $30;
length cur_strand cur_base $1;
do _i = 1 to countw(dna);
cur_strand = scan(dna,_i,'-'); *identify the current base we are looking at;
do strand_count = 0 by 1 until (cur_strand ne cur_base); *iterate over the scans to find the next nonmatch;
cur_base = scan(dna,strand_count+_i,'-');
end;
*compose the output string, checking to see if we need to append the number if >1 or not if =1;
out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));
_i = strand_count+_i-1; *have to decrement one, since we go one past the match;
end;
run;
This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.
data finale(drop= i j k count);
set have;
length want_dna $200; *This needs to be set to a maximum lenghth;
do i=1 to n;
count=0;
key=substr(DNA,i,1);
do j=i to n;
if i=j and i ne 1 then do;
do k=1 to k=i-1;
if key=substr(DNA,k,1) then goto skip;
end;
end;
if key=substr(DNA,j,1) then count=+1;
end;
if i=1 then do;
if count gt 1 then WANT_DNA=key||'-'||put(count,best12.);
else WANT_DNA=key;
end;
else do;
if count gt 1 then WANT_DNA=WANT_DNA||'-'||key||'-'||put(count,best12.);
else WANT_DNA=WANT_DNA||'-'||key;
end;
skip:
end;
run;
I have not tested this code, to be updated.
Thank you @Satish_Parida! I did get an error:
Variable n is uninitialized.
ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or
the BY expression is missing, zero, or invalid.
However, the code by @snoopy369 worked!
Thank you to you both!!
I love this question absolutely.
data have;
infile cards expandtabs;
input ID 1. DNA $30.;
cards;
1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O
2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M
3 M-M-M-M-O
4 O-I-O-I-M-O-O-O-I-I
5 O
; run;
data temp;
set have;
do i=1 to countw(dna,'-');
value=scan(dna,i,'-');output;
end;
drop i dna;
run;
proc summary data=temp ;
by id value notsorted;
output out=temp1;
run;
data want;
length want $ 200;
do until(last.id);
set temp1;
by id;
if _freq_=1 then want=catx('-',want,value);
else want=catx('-',want,cats(_freq_,value));
end;
drop _type_ _freq_ value;
run;
proc print noobs;run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.