Solved
Contributor
Posts: 43

# how to count and output specific number of strings in a variable

Hello,

I have a variable in a SAS dataset that is a string of characters such as O-O-O-M-M-O-I-O, what I would like to do is try to shorten the value of the variable by adding a number that indicates how many times a character is repeated in a row. So for the above example I would like to see it be converted to 3O-2M-O-I-O.  Suggestions on how to do this?

here is a code that you can use:

```data have;
input ID 1. DNA \$30.;
datalines;
1	O-O-O-O-O-O-O-O-O-O-O-O-O-O-O
2	M-M-M-M-M-M-O-I-I-I-I-O-O-M-M
3	M-M-M-M-O
4	O-I-O-I-M-O-O-O-I-I
5	O
; run;```

What I would like for the above is to get the "want DNA" column.

 ID HAVE DNA WANT DNA 1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O 15O 2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M 6M-O-4I-2O-2M 3 M-M-M-M-O 4M-O 4 O-I-O-I-M-O-O-O-I-I O-I-O-I-M-3O-2I 5 O O

Thank you!

Accepted Solutions
Solution
‎01-31-2018 12:14 PM
Super Contributor
Posts: 320

## Re: how to count and output specific number of strings in a variable

``````data want;
set have;
length out_strand \$30;
length cur_strand cur_base \$1;

do _i = 1 to countw(dna);
cur_strand = scan(dna,_i,'-');  *identify the current base we are looking at;

do strand_count = 0 by 1 until (cur_strand ne cur_base);  *iterate over the scans to find the next nonmatch;
cur_base = scan(dna,strand_count+_i,'-');
end;

*compose the output string, checking to see if we need to append the number if >1 or not if =1;
out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));

_i = strand_count+_i-1;  *have to decrement one, since we go one past the match;
end;
run;
``````

This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.

All Replies
Solution
‎01-31-2018 12:14 PM
Super Contributor
Posts: 320

## Re: how to count and output specific number of strings in a variable

``````data want;
set have;
length out_strand \$30;
length cur_strand cur_base \$1;

do _i = 1 to countw(dna);
cur_strand = scan(dna,_i,'-');  *identify the current base we are looking at;

do strand_count = 0 by 1 until (cur_strand ne cur_base);  *iterate over the scans to find the next nonmatch;
cur_base = scan(dna,strand_count+_i,'-');
end;

*compose the output string, checking to see if we need to append the number if >1 or not if =1;
out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));

_i = strand_count+_i-1;  *have to decrement one, since we go one past the match;
end;
run;
``````

This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.

Frequent Contributor
Posts: 112

## Re: how to count and output specific number of strings in a variable

[ Edited ]
``````data finale(drop= i j k count);
set have;
length want_dna \$200;											*This needs to be set to a maximum lenghth;
do i=1 to n;
count=0;
key=substr(DNA,i,1);
do j=i to n;
if i=j and i ne 1 then do;
do k=1 to k=i-1;
if key=substr(DNA,k,1) then goto skip;
end;
end;
if key=substr(DNA,j,1) then count=+1;
end;

if i=1 then do;
if count gt 1 then 	WANT_DNA=key||'-'||put(count,best12.);
else WANT_DNA=key;
end;
else do;
if count gt 1 then 	WANT_DNA=WANT_DNA||'-'||key||'-'||put(count,best12.);
else WANT_DNA=WANT_DNA||'-'||key;
end;
skip:
end;
run;``````

I have not tested this code, to be updated.

Contributor
Posts: 43

## Re: how to count and output specific number of strings in a variable

Thank you @Satish_Parida! I did get an error:

Variable n is uninitialized.

ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or

the BY expression is missing, zero, or invalid.

However, the code by @snoopy369 worked!

Thank you to you both!!

Super User
Posts: 10,846

## Re: how to count and output specific number of strings in a variable

I love this question absolutely.

``````data have;
infile cards expandtabs;
input ID 1. DNA \$30.;
cards;
1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O
2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M
3 M-M-M-M-O
4 O-I-O-I-M-O-O-O-I-I
5 O
; run;
data temp;
set have;
do i=1 to countw(dna,'-');
value=scan(dna,i,'-');output;
end;
drop i dna;
run;
proc summary data=temp ;
by id value notsorted;
output out=temp1;
run;
data want;
length want \$ 200;
do until(last.id);
set temp1;
by id;
if _freq_=1 then want=catx('-',want,value);
else want=catx('-',want,cats(_freq_,value));
end;
drop _type_ _freq_ value;
run;
proc print noobs;run;``````
☑ This topic is solved.

Discussion stats
• 4 replies
• 143 views
• 1 like
• 4 in conversation