DATA Step, Macro, Functions and more

character question

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 90
Accepted Solution

character question

I have two gene sequences

(1)GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA

(2)GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA

 

I want to mark their differences. 

Now I marked the differences with lowcase(char) (see my code below). My question is how I can mark the difference with Red color.

By the way, I appreciate if someone can optimize my code.

 

Thanks.

 

 

 

***Code Start*******************************************

data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;

data b;
set a;
retain base len_1;
if _n_=1 then
do;
base=f1;
len_1=length(base);
end;
f2=f1;
len_2=length(f2);
x=min(len_1, len_2);
do i=1 to x;
substr_1=substr(base,i,1);
substr_2=substr(f2,i,1);
if substr_1 ^=substr_2 then substr(f2,i,1)=lowcase(substr_2);
else;
end;
run;

 

proc print data=b ;
var f1 f2;
run;

 

***Code end***********************************

 

 

 

 


Accepted Solutions
Solution
‎11-04-2016 01:33 PM
Super Contributor
Posts: 298

Re: character question

Posted in reply to Niugg2010

Compare() function compares two strings. Returns left-most position of the byte which is not matching and 0(zero) when the two strings are same.  Since you have given only two strings which has a differeing byte at the 15-position and I am adding one more string to show that COMPARE() function returns 0.

 

data a;
length f1 $ 66;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;

data _null_;
   retain old;

   set a;
   if _n_ = 1 then old = f1;
   else do;
      dif = compare(f1, old);
      put dif = ;
      if dif = 0 then put 'No Difference';
      if dif ^= 0 then substr(f1, dif, 1) = lowcase(substr(f1, dif, 1));
      put f1 = ;
   end;
run;

 

View solution in original post


All Replies
Super User
Posts: 11,336

Re: character question

Posted in reply to Niugg2010

Color implies a method of display that includes such things. So the question to you is what output format are you looking for as addressing individual characters will likley require different methods. Do you want HTML, RTF, PDF or something else for output.

Frequent Contributor
Posts: 90

Re: character question

RTF or PDF are both fine for me.

 

Do you mean to use proc template to control the output? In my mind, Proc template can only define to CELLVALUE level, not the special character in each CELL.

 

 

 

Super User
Posts: 11,336

Re: character question

[ Edited ]
Posted in reply to Niugg2010

Niugg2010 wrote:

RTF or PDF are both fine for me.

 

Do you mean to use proc template to control the output? In my mind, Proc template can only define to CELLVALUE level, not the special character in each CELL.

 

 

 


Thats why the target definition is important. The only way I see likely is to build a string with inbeded markup codes. A pseudo code approach is going to yield a string that lools something like the following where {font color: color value} is replaced by the raw codes of the markup destination.

{font color: default}ABCABCABC{font color:red}abc{font color:default}BDABDABDA

using letters intentionally that do not resemble your data in any detail.

 

ESCAPECHAR and the RAW function will let you insert the control strings once the values needed are determined.

 

I would recommend hard coding a couple of examples to get the feel before trying to code conditionally based on the case of the letters. The latter shouldn't be to difficult actually once the correct code is determined.

 

Here's a real brief example of inserting codes to print, change the RTF filepath to something you can use:

ods escapchar="^";
data junk;

x = 'Example of ^{raw \cf12 RAW} function';
y ="Example ^{style [foreground=red] of Super, Alpha ^{super ^{unicode ALPHA}
       ^{style [foreground=green] Nested}} Formatting} and Scoping";
run;

ods rtf file='D:\data\junk.rtf' style=meadow;
proc print data=junk;
run;
ods rtf close;
Frequent Contributor
Posts: 90

Re: character question

[ Edited ]

Thanks. I tried. It is powerful. However, I just listed two sequences above. Actually I have over 50 sequences to mark. Do you have any method to add conditions to deal with the data? Thanks

Super User
Posts: 10,018

Re: character question

Posted in reply to Niugg2010
It would be very convenient for SAS/IML , if you could post the output you want.
Or Post it at IML forum .

Super User
Posts: 10,018

Re: character question

Posted in reply to Niugg2010
data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;
proc format;
value fmt
 1='red';
run;

proc iml;
use a nobs nobs;
read all var {f1};
close;
n=length(f1)[1];
temp=j(nobs,n,' ');
do i=1 to nobs;
 temp[i,]=substr(f1[i],1:n,1);
end;
want=(countunique(temp,'col')>1);
create want from want;
append from want;
close;
run;
proc report data=want nowd;
define col:/style={backgroundcolor=fmt.};
run;
Frequent Contributor
Posts: 90

Re: character question

Cool. Thanks. Learn a lot. I have never used PROC IML.

Super Contributor
Posts: 298

Re: character question

Posted in reply to Niugg2010

Use of SUBSTR() function can be replaced by the new function, COMPARE(). It will compare both the strings and will return the first leftmost POSITION where they differ. If you need to search more than one character-position, then you could compare() to the right of the position returned. The benefit is that you can skip those strings which are same.

Frequent Contributor
Posts: 90

Re: character question

I an not familar to compare() function. Can you help me to optimize my code with compare()? Thanks.

Solution
‎11-04-2016 01:33 PM
Super Contributor
Posts: 298

Re: character question

Posted in reply to Niugg2010

Compare() function compares two strings. Returns left-most position of the byte which is not matching and 0(zero) when the two strings are same.  Since you have given only two strings which has a differeing byte at the 15-position and I am adding one more string to show that COMPARE() function returns 0.

 

data a;
length f1 $ 66;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;

data _null_;
   retain old;

   set a;
   if _n_ = 1 then old = f1;
   else do;
      dif = compare(f1, old);
      put dif = ;
      if dif = 0 then put 'No Difference';
      if dif ^= 0 then substr(f1, dif, 1) = lowcase(substr(f1, dif, 1));
      put f1 = ;
   end;
run;

 

Super User
Posts: 10,018

Re: character question

Posted in reply to Niugg2010

OK.If you really want data step.

 

data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAAGTGAACGTGGA
AAGCAAGCGCCATAGTCCTGTGGAGSAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;
proc format;
value fmt
 1='red';
run;
data _null_;
 set a;
 call symputx('n',length(f1));
 stop;
run;
data temp;
 set a;
 array x{&n} $ 1;
 do i=1 to &n;
  x{i}=char(f1,i);
 end;
 keep x:;
run;
proc transpose data=temp(obs=0) out=vnames;
var _all_;
run;
data _null_;
 set vnames end=last;
 if _n_=1 then call execute('proc sql;create table flag as select ');
 call execute(cat('count(distinct ',_name_,') as ',_name_));
 if last then call execute ('from temp;quit;');
  else call execute(',');
run;
proc transpose data=flag out=diff_temp;
var _all_;
run;
data diff_vname;
 set diff_temp;
 if col1 ne 1;
run;
data want;
if _n_=1 then do;
 if 0 then set diff_vname;
 declare hash h(dataset:'diff_vname');
 h.definekey('_name_');
 h.definedata('col1');
 h.definedone();
end;
call missing(of _all_);
 set vnames;
 rc=h.find();
run;
data _null_;
 set want end=last;
 if _n_=1 then call execute('proc report data=temp nowd;');
 call execute(cat('define ',_name_,'/display'));
 if not missing(col1) then call execute(' style={backgroundcolor=red}');
 call execute(';');
 if last then call execute('run;');
run;
Frequent Contributor
Posts: 90

Re: character question

Thanks. I got it.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 539 views
  • 2 likes
  • 4 in conversation