BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas_student1
Quartz | Level 8

Hello,

 

I have a variable in a SAS dataset that is a string of characters such as O-O-O-M-M-O-I-O, what I would like to do is try to shorten the value of the variable by adding a number that indicates how many times a character is repeated in a row. So for the above example I would like to see it be converted to 3O-2M-O-I-O.  Suggestions on how to do this?

 

here is a code that you can use:

 

data have;
input ID 1. DNA $30.;
datalines;
1	O-O-O-O-O-O-O-O-O-O-O-O-O-O-O	
2	M-M-M-M-M-M-O-I-I-I-I-O-O-M-M	
3	M-M-M-M-O	                
4	O-I-O-I-M-O-O-O-I-I	        
5	O	 
; run;

 

What I would like for the above is to get the "want DNA" column.

 

IDHAVE DNAWANT DNA
1O-O-O-O-O-O-O-O-O-O-O-O-O-O-O15O
2M-M-M-M-M-M-O-I-I-I-I-O-O-M-M6M-O-4I-2O-2M
3M-M-M-M-O4M-O
4O-I-O-I-M-O-O-O-I-IO-I-O-I-M-3O-2I
5OO

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
snoopy369
Barite | Level 11
data want;
  set have;
  length out_strand $30;
  length cur_strand cur_base $1;

  do _i = 1 to countw(dna);  
    cur_strand = scan(dna,_i,'-');  *identify the current base we are looking at;

	  do strand_count = 0 by 1 until (cur_strand ne cur_base);  *iterate over the scans to find the next nonmatch;
	    cur_base = scan(dna,strand_count+_i,'-');
      end;

      *compose the output string, checking to see if we need to append the number if >1 or not if =1;
	  out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));

      _i = strand_count+_i-1;  *have to decrement one, since we go one past the match;
  end;
run;

This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.

View solution in original post

4 REPLIES 4
snoopy369
Barite | Level 11
data want;
  set have;
  length out_strand $30;
  length cur_strand cur_base $1;

  do _i = 1 to countw(dna);  
    cur_strand = scan(dna,_i,'-');  *identify the current base we are looking at;

	  do strand_count = 0 by 1 until (cur_strand ne cur_base);  *iterate over the scans to find the next nonmatch;
	    cur_base = scan(dna,strand_count+_i,'-');
      end;

      *compose the output string, checking to see if we need to append the number if >1 or not if =1;
	  out_strand = catx('-',out_strand,cats(ifc(strand_count>1,strand_count,''),cur_strand));

      _i = strand_count+_i-1;  *have to decrement one, since we go one past the match;
  end;
run;

This should work for what you need; basically you scan over the string and keep scanning until you reach a non-match.

Satish_Parida
Lapis Lazuli | Level 10
data finale(drop= i j k count);
set have;
length want_dna $200;											*This needs to be set to a maximum lenghth;
do i=1 to n;
	count=0;
	key=substr(DNA,i,1);
	do j=i to n;
		if i=j and i ne 1 then do;
			do k=1 to k=i-1;
				if key=substr(DNA,k,1) then goto skip;
			end;
		end;
	if key=substr(DNA,j,1) then count=+1;
	end;

	if i=1 then do;
		if count gt 1 then 	WANT_DNA=key||'-'||put(count,best12.);
		else WANT_DNA=key;
	end;
	else do;
		if count gt 1 then 	WANT_DNA=WANT_DNA||'-'||key||'-'||put(count,best12.);
		else WANT_DNA=WANT_DNA||'-'||key;
	end;
	skip:
end;
run;

I have not tested this code, to be updated.

sas_student1
Quartz | Level 8

Thank you @Satish_Parida! I did get an error:

Variable n is uninitialized.

ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or

the BY expression is missing, zero, or invalid.

 

However, the code by @snoopy369 worked!

 

Thank you to you both!!

Ksharp
Super User

I love this question absolutely.

data have;
infile cards expandtabs;
input ID 1. DNA $30.;
cards;
1 O-O-O-O-O-O-O-O-O-O-O-O-O-O-O	
2 M-M-M-M-M-M-O-I-I-I-I-O-O-M-M	
3 M-M-M-M-O	                
4 O-I-O-I-M-O-O-O-I-I	        
5 O	 
; run;
data temp;
 set have;
 do i=1 to countw(dna,'-');
  value=scan(dna,i,'-');output;
 end;
 drop i dna;
run;
proc summary data=temp ;
by id value notsorted;
output out=temp1;
run;
data want;
length want $ 200;
 do until(last.id);
  set temp1;
  by id;
  if _freq_=1 then want=catx('-',want,value);
   else want=catx('-',want,cats(_freq_,value));
 end;
 drop _type_ _freq_ value;
run;
proc print noobs;run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 824 views
  • 1 like
  • 4 in conversation