BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas_student1
Quartz | Level 8

Hello,

 

I have a variable in a SAS dataset that is a string of characters such as O-O-O-M-M-O-I-O, what I would like to do is try to shorten the value of the variable by adding a number that indicates how many times a character is repeated in a row. So for the above example I would like to see it be converted to 3O-2M-O-I-O.  Suggestions on how to do this?

 

here is a code that you can use:

 

data have;
input ID 1. DNA $30.;
datalines;
1	O-O-O-O-O-O-O-O-O-O-O-O-O-O-O	
2	M-M-M-M-M-M-O-I-I-I-I-O-O-M-M	
3	M-M-M-M-O	                
4	O-I-O-I-M-O-O-O-I-I	        
5	O	 
; run;

 

What I would like for the above is to get the "want DNA" column.

 

IDHAVE DNAWANT DNA
1O-O-O-O-O-O-O-O-O-O-O-O-O-O-O15O
2M-M-M-M-M-M-O-I-I-I-I-O-O-M-M6M-O-4I-2O-2M
3M-M-M-M-O4M-O
4O-I-O-I-M-O-O-O-I-IO-I-O-I-M-3O-2I
5OO

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
snoopy369
Barite | Level 11
data want;
  set have;
  length out_strand $30;
  length cur_strand cur_base $1;

  do _i = 1 to countw(dna);  
    cur_strand = scan(dna,_i,'-');  *identify the current base we are looking at;

	  do strand_count = 0 by 1 until (cur_strand ne cur_base);  *iterate over the scans to find the next nonmatch;
	    cur_base = scan(dna,strand_count+_i,'-');
      end;

      *compose the output string, checking to see if we need to append the number if >1 or not if =1;
	  out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));

      _i = strand_count+_i-1;  *have to decrement one, since we go one past the match;
  end;
run;

This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.

View solution in original post

4 REPLIES 4
snoopy369
Barite | Level 11
data want;
  set have;
  length out_strand $30;
  length cur_strand cur_base $1;

  do _i = 1 to countw(dna);  
    cur_strand = scan(dna,_i,'-');  *identify the current base we are looking at;

	  do strand_count = 0 by 1 until (cur_strand ne cur_base);  *iterate over the scans to find the next nonmatch;
	    cur_base = scan(dna,strand_count+_i,'-');
      end;

      *compose the output string, checking to see if we need to append the number if >1 or not if =1;
	  out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));

      _i = strand_count+_i-1;  *have to decrement one, since we go one past the match;
  end;
run;

This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.

Satish_Parida
Lapis Lazuli | Level 10
data finale(drop= i j k count);
set have;
length want_dna $200;											*This needs to be set to a maximum lenghth;
do i=1 to n;
	count=0;
	key=substr(DNA,i,1);
	do j=i to n;
		if i=j and i ne 1 then do;
			do k=1 to k=i-1;
				if key=substr(DNA,k,1) then goto skip;
			end;
		end;
	if key=substr(DNA,j,1) then count=+1;
	end;

	if i=1 then do;
		if count gt 1 then 	WANT_DNA=key||'-'||put(count,best12.);
		else WANT_DNA=key;
	end;
	else do;
		if count gt 1 then 	WANT_DNA=WANT_DNA||'-'||key||'-'||put(count,best12.);
		else WANT_DNA=WANT_DNA||'-'||key;
	end;
	skip:
end;
run;

I have not tested this code, to be updated.

sas_student1
Quartz | Level 8

Thank you @Satish_Parida! I did get an error:

Variable n is uninitialized.

ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or

the BY expression is missing, zero, or invalid.

 

However, the code by @snoopy369 worked!

 

Thank you to you both!!

Ksharp
Super User

I love this question absolutely.

data have;
infile cards expandtabs;
input ID 1. DNA $30.;
cards;
1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O	
2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M	
3 M-M-M-M-O	                
4 O-I-O-I-M-O-O-O-I-I	        
5 O	 
; run;
data temp;
 set have;
 do i=1 to countw(dna,'-');
  value=scan(dna,i,'-');output;
 end;
 drop i dna;
run;
proc summary data=temp ;
by id value notsorted;
output out=temp1;
run;
data want;
length want $ 200;
 do until(last.id);
  set temp1;
  by id;
  if _freq_=1 then want=catx('-',want,value);
   else want=catx('-',want,cats(_freq_,value));
 end;
 drop _type_ _freq_ value;
run;
proc print noobs;run;
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1637 views
  • 1 like
  • 4 in conversation