I have two gene sequences
(1)GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
(2)GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
I want to mark their differences.
Now I marked the differences with lowcase(char) (see my code below). My question is how I can mark the difference with Red color.
By the way, I appreciate if someone can optimize my code.
Thanks.
***Code Start*******************************************
data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
data b;
set a;
retain base len_1;
if _n_=1 then
do;
base=f1;
len_1=length(base);
end;
f2=f1;
len_2=length(f2);
x=min(len_1, len_2);
do i=1 to x;
substr_1=substr(base,i,1);
substr_2=substr(f2,i,1);
if substr_1 ^=substr_2 then substr(f2,i,1)=lowcase(substr_2);
else;
end;
run;
proc print data=b ;
var f1 f2;
run;
***Code end***********************************
Compare() function compares two strings. Returns left-most position of the byte which is not matching and 0(zero) when the two strings are same. Since you have given only two strings which has a differeing byte at the 15-position and I am adding one more string to show that COMPARE() function returns 0.
data a;
length f1 $ 66;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;
data _null_;
retain old;
set a;
if _n_ = 1 then old = f1;
else do;
dif = compare(f1, old);
put dif = ;
if dif = 0 then put 'No Difference';
if dif ^= 0 then substr(f1, dif, 1) = lowcase(substr(f1, dif, 1));
put f1 = ;
end;
run;
Color implies a method of display that includes such things. So the question to you is what output format are you looking for as addressing individual characters will likley require different methods. Do you want HTML, RTF, PDF or something else for output.
RTF or PDF are both fine for me.
Do you mean to use proc template to control the output? In my mind, Proc template can only define to CELLVALUE level, not the special character in each CELL.
@Niugg2010 wrote:
RTF or PDF are both fine for me.
Do you mean to use proc template to control the output? In my mind, Proc template can only define to CELLVALUE level, not the special character in each CELL.
Thats why the target definition is important. The only way I see likely is to build a string with inbeded markup codes. A pseudo code approach is going to yield a string that lools something like the following where {font color: color value} is replaced by the raw codes of the markup destination.
{font color: default}ABCABCABC{font color:red}abc{font color:default}BDABDABDA
using letters intentionally that do not resemble your data in any detail.
ESCAPECHAR and the RAW function will let you insert the control strings once the values needed are determined.
I would recommend hard coding a couple of examples to get the feel before trying to code conditionally based on the case of the letters. The latter shouldn't be to difficult actually once the correct code is determined.
Here's a real brief example of inserting codes to print, change the RTF filepath to something you can use:
ods escapchar="^";
data junk;
x = 'Example of ^{raw \cf12 RAW} function';
y ="Example ^{style [foreground=red] of Super, Alpha ^{super ^{unicode ALPHA}
^{style [foreground=green] Nested}} Formatting} and Scoping";
run;
ods rtf file='D:\data\junk.rtf' style=meadow;
proc print data=junk;
run;
ods rtf close;
Thanks. I tried. It is powerful. However, I just listed two sequences above. Actually I have over 50 sequences to mark. Do you have any method to add conditions to deal with the data? Thanks
It would be very convenient for SAS/IML , if you could post the output you want. Or Post it at IML forum .
data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;
proc format;
value fmt
1='red';
run;
proc iml;
use a nobs nobs;
read all var {f1};
close;
n=length(f1)[1];
temp=j(nobs,n,' ');
do i=1 to nobs;
temp[i,]=substr(f1[i],1:n,1);
end;
want=(countunique(temp,'col')>1);
create want from want;
append from want;
close;
run;
proc report data=want nowd;
define col:/style={backgroundcolor=fmt.};
run;
Cool. Thanks. Learn a lot. I have never used PROC IML.
Use of SUBSTR() function can be replaced by the new function, COMPARE(). It will compare both the strings and will return the first leftmost POSITION where they differ. If you need to search more than one character-position, then you could compare() to the right of the position returned. The benefit is that you can skip those strings which are same.
I an not familar to compare() function. Can you help me to optimize my code with compare()? Thanks.
Compare() function compares two strings. Returns left-most position of the byte which is not matching and 0(zero) when the two strings are same. Since you have given only two strings which has a differeing byte at the 15-position and I am adding one more string to show that COMPARE() function returns 0.
data a;
length f1 $ 66;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;
data _null_;
retain old;
set a;
if _n_ = 1 then old = f1;
else do;
dif = compare(f1, old);
put dif = ;
if dif = 0 then put 'No Difference';
if dif ^= 0 then substr(f1, dif, 1) = lowcase(substr(f1, dif, 1));
put f1 = ;
end;
run;
OK.If you really want data step.
data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAAGTGAACGTGGA
AAGCAAGCGCCATAGTCCTGTGGAGSAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;
proc format;
value fmt
1='red';
run;
data _null_;
set a;
call symputx('n',length(f1));
stop;
run;
data temp;
set a;
array x{&n} $ 1;
do i=1 to &n;
x{i}=char(f1,i);
end;
keep x:;
run;
proc transpose data=temp(obs=0) out=vnames;
var _all_;
run;
data _null_;
set vnames end=last;
if _n_=1 then call execute('proc sql;create table flag as select ');
call execute(cat('count(distinct ',_name_,') as ',_name_));
if last then call execute ('from temp;quit;');
else call execute(',');
run;
proc transpose data=flag out=diff_temp;
var _all_;
run;
data diff_vname;
set diff_temp;
if col1 ne 1;
run;
data want;
if _n_=1 then do;
if 0 then set diff_vname;
declare hash h(dataset:'diff_vname');
h.definekey('_name_');
h.definedata('col1');
h.definedone();
end;
call missing(of _all_);
set vnames;
rc=h.find();
run;
data _null_;
set want end=last;
if _n_=1 then call execute('proc report data=temp nowd;');
call execute(cat('define ',_name_,'/display'));
if not missing(col1) then call execute(' style={backgroundcolor=red}');
call execute(';');
if last then call execute('run;');
run;
Thanks. I got it.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.