BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas_student1
Quartz | Level 8

Hello,

 

I have a variable in a SAS dataset that is a string of characters such as O-O-O-M-M-O-I-O, what I would like to do is try to shorten the value of the variable by adding a number that indicates how many times a character is repeated in a row. So for the above example I would like to see it be converted to 3O-2M-O-I-O.  Suggestions on how to do this?

 

here is a code that you can use:

 

data have;
input ID 1. DNA $30.;
datalines;
1	O-O-O-O-O-O-O-O-O-O-O-O-O-O-O	
2	M-M-M-M-M-M-O-I-I-I-I-O-O-M-M	
3	M-M-M-M-O	                
4	O-I-O-I-M-O-O-O-I-I	        
5	O	 
; run;

 

What I would like for the above is to get the "want DNA" column.

 

IDHAVE DNAWANT DNA
1O-O-O-O-O-O-O-O-O-O-O-O-O-O-O15O
2M-M-M-M-M-M-O-I-I-I-I-O-O-M-M6M-O-4I-2O-2M
3M-M-M-M-O4M-O
4O-I-O-I-M-O-O-O-I-IO-I-O-I-M-3O-2I
5OO

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
snoopy369
Barite | Level 11
data want;
  set have;
  length out_strand $30;
  length cur_strand cur_base $1;

  do _i = 1 to countw(dna);  
    cur_strand = scan(dna,_i,'-');  *identify the current base we are looking at;

	  do strand_count = 0 by 1 until (cur_strand ne cur_base);  *iterate over the scans to find the next nonmatch;
	    cur_base = scan(dna,strand_count+_i,'-');
      end;

      *compose the output string, checking to see if we need to append the number if >1 or not if =1;
	  out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));

      _i = strand_count+_i-1;  *have to decrement one, since we go one past the match;
  end;
run;

This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.

View solution in original post

4 REPLIES 4
snoopy369
Barite | Level 11
data want;
  set have;
  length out_strand $30;
  length cur_strand cur_base $1;

  do _i = 1 to countw(dna);  
    cur_strand = scan(dna,_i,'-');  *identify the current base we are looking at;

	  do strand_count = 0 by 1 until (cur_strand ne cur_base);  *iterate over the scans to find the next nonmatch;
	    cur_base = scan(dna,strand_count+_i,'-');
      end;

      *compose the output string, checking to see if we need to append the number if >1 or not if =1;
	  out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));

      _i = strand_count+_i-1;  *have to decrement one, since we go one past the match;
  end;
run;

This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.

Satish_Parida
Lapis Lazuli | Level 10
data finale(drop= i j k count);
set have;
length want_dna $200;											*This needs to be set to a maximum lenghth;
do i=1 to n;
	count=0;
	key=substr(DNA,i,1);
	do j=i to n;
		if i=j and i ne 1 then do;
			do k=1 to k=i-1;
				if key=substr(DNA,k,1) then goto skip;
			end;
		end;
	if key=substr(DNA,j,1) then count=+1;
	end;

	if i=1 then do;
		if count gt 1 then 	WANT_DNA=key||'-'||put(count,best12.);
		else WANT_DNA=key;
	end;
	else do;
		if count gt 1 then 	WANT_DNA=WANT_DNA||'-'||key||'-'||put(count,best12.);
		else WANT_DNA=WANT_DNA||'-'||key;
	end;
	skip:
end;
run;

I have not tested this code, to be updated.

sas_student1
Quartz | Level 8

Thank you @Satish_Parida! I did get an error:

Variable n is uninitialized.

ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or

the BY expression is missing, zero, or invalid.

 

However, the code by @snoopy369 worked!

 

Thank you to you both!!

Ksharp
Super User

I love this question absolutely.

data have;
infile cards expandtabs;
input ID 1. DNA $30.;
cards;
1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O	
2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M	
3 M-M-M-M-O	                
4 O-I-O-I-M-O-O-O-I-I	        
5 O	 
; run;
data temp;
 set have;
 do i=1 to countw(dna,'-');
  value=scan(dna,i,'-');output;
 end;
 drop i dna;
run;
proc summary data=temp ;
by id value notsorted;
output out=temp1;
run;
data want;
length want $ 200;
 do until(last.id);
  set temp1;
  by id;
  if _freq_=1 then want=catx('-',want,value);
   else want=catx('-',want,cats(_freq_,value));
 end;
 drop _type_ _freq_ value;
run;
proc print noobs;run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1094 views
  • 1 like
  • 4 in conversation