Solved: Re: character question

Niugg2010 · Posted 11-03-2016 12:32 PM

I have two gene sequences

(1)GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA

(2)GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA

I want to mark their differences.

Now I marked the differences with lowcase(char) (see my code below). My question is how I can mark the difference with Red color.

By the way, I appreciate if someone can optimize my code.

Thanks.

***Code Start*******************************************

data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;

data b;
set a;
retain base len_1;
if _n_=1 then
do;
base=f1;
len_1=length(base);
end;
f2=f1;
len_2=length(f2);
x=min(len_1, len_2);
do i=1 to x;
substr_1=substr(base,i,1);
substr_2=substr(f2,i,1);
if substr_1 ^=substr_2 then substr(f2,i,1)=lowcase(substr_2);
else;
end;
run;

proc print data=b ;
var f1 f2;
run;

***Code end***********************************

KachiM · Posted 11-04-2016 10:45 AM

Compare() function compares two strings. Returns left-most position of the byte which is not matching and 0(zero) when the two strings are same. Since you have given only two strings which has a differeing byte at the 15-position and I am adding one more string to show that COMPARE() function returns 0.

data a;
length f1 $ 66;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;

data _null_;
   retain old;

   set a;
   if _n_ = 1 then old = f1;
   else do;
      dif = compare(f1, old);
      put dif = ;
      if dif = 0 then put 'No Difference';
      if dif ^= 0 then substr(f1, dif, 1) = lowcase(substr(f1, dif, 1));
      put f1 = ;
   end;
run;

View solution in original post

ballardw · Posted 11-03-2016 12:45 PM

Color implies a method of display that includes such things. So the question to you is what output format are you looking for as addressing individual characters will likley require different methods. Do you want HTML, RTF, PDF or something else for output.

Niugg2010 · Posted 11-03-2016 01:34 PM

RTF or PDF are both fine for me.

Do you mean to use proc template to control the output? In my mind, Proc template can only define to CELLVALUE level, not the special character in each CELL.

ballardw · Posted 11-03-2016 02:03 PM

@Niugg2010 wrote:

RTF or PDF are both fine for me.

Do you mean to use proc template to control the output? In my mind, Proc template can only define to CELLVALUE level, not the special character in each CELL.

Thats why the target definition is important. The only way I see likely is to build a string with inbeded markup codes. A pseudo code approach is going to yield a string that lools something like the following where {font color: color value} is replaced by the raw codes of the markup destination.

{font color: default}ABCABCABC{font color:red}abc{font color:default}BDABDABDA

using letters intentionally that do not resemble your data in any detail.

ESCAPECHAR and the RAW function will let you insert the control strings once the values needed are determined.

I would recommend hard coding a couple of examples to get the feel before trying to code conditionally based on the case of the letters. The latter shouldn't be to difficult actually once the correct code is determined.

Here's a real brief example of inserting codes to print, change the RTF filepath to something you can use:

ods escapchar="^";
data junk;

x = 'Example of ^{raw \cf12 RAW} function';
y ="Example ^{style [foreground=red] of Super, Alpha ^{super ^{unicode ALPHA}
       ^{style [foreground=green] Nested}} Formatting} and Scoping";
run;

ods rtf file='D:\data\junk.rtf' style=meadow;
proc print data=junk;
run;
ods rtf close;

Niugg2010 · Posted 11-03-2016 02:06 PM

Thanks. I tried. It is powerful. However, I just listed two sequences above. Actually I have over 50 sequences to mark. Do you have any method to add conditions to deal with the data? Thanks

Ksharp · Posted 11-03-2016 10:51 PM

It would be very convenient for SAS/IML , if you could post the output you want.
Or Post it at IML forum .

Ksharp · Posted 11-04-2016 12:39 AM

data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;
proc format;
value fmt
 1='red';
run;

proc iml;
use a nobs nobs;
read all var {f1};
close;
n=length(f1)[1];
temp=j(nobs,n,' ');
do i=1 to nobs;
 temp[i,]=substr(f1[i],1:n,1);
end;
want=(countunique(temp,'col')>1);
create want from want;
append from want;
close;
run;
proc report data=want nowd;
define col:/style={backgroundcolor=fmt.};
run;

Niugg2010 · Posted 11-04-2016 07:23 AM

Cool. Thanks. Learn a lot. I have never used PROC IML.

KachiM · Posted 11-04-2016 02:18 AM

Use of SUBSTR() function can be replaced by the new function, COMPARE(). It will compare both the strings and will return the first leftmost POSITION where they differ. If you need to search more than one character-position, then you could compare() to the right of the position returned. The benefit is that you can skip those strings which are same.

Niugg2010 · Posted 11-04-2016 07:24 AM

I an not familar to compare() function. Can you help me to optimize my code with compare()? Thanks.

KachiM · Posted 11-04-2016 10:45 AM

Compare() function compares two strings. Returns left-most position of the byte which is not matching and 0(zero) when the two strings are same. Since you have given only two strings which has a differeing byte at the 15-position and I am adding one more string to show that COMPARE() function returns 0.

data a;
length f1 $ 66;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;

data _null_;
   retain old;

   set a;
   if _n_ = 1 then old = f1;
   else do;
      dif = compare(f1, old);
      put dif = ;
      if dif = 0 then put 'No Difference';
      if dif ^= 0 then substr(f1, dif, 1) = lowcase(substr(f1, dif, 1));
      put f1 = ;
   end;
run;

Ksharp · Posted 11-04-2016 09:13 AM

OK.If you really want data step.

data a;
length f1 $ 200;
input f1;
datalines;
GAGCAAGCGCCATACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
GAGCAAGCGCCATAGTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAAGTGAACGTGGA
AAGCAAGCGCCATAGTCCTGTGGAGSAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGA
;
run;
proc format;
value fmt
 1='red';
run;
data _null_;
 set a;
 call symputx('n',length(f1));
 stop;
run;
data temp;
 set a;
 array x{&n} $ 1;
 do i=1 to &n;
  x{i}=char(f1,i);
 end;
 keep x:;
run;
proc transpose data=temp(obs=0) out=vnames;
var _all_;
run;
data _null_;
 set vnames end=last;
 if _n_=1 then call execute('proc sql;create table flag as select ');
 call execute(cat('count(distinct ',_name_,') as ',_name_));
 if last then call execute ('from temp;quit;');
  else call execute(',');
run;
proc transpose data=flag out=diff_temp;
var _all_;
run;
data diff_vname;
 set diff_temp;
 if col1 ne 1;
run;
data want;
if _n_=1 then do;
 if 0 then set diff_vname;
 declare hash h(dataset:'diff_vname');
 h.definekey('_name_');
 h.definedata('col1');
 h.definedone();
end;
call missing(of _all_);
 set vnames;
 rc=h.find();
run;
data _null_;
 set want end=last;
 if _n_=1 then call execute('proc report data=temp nowd;');
 call execute(cat('define ',_name_,'/display'));
 if not missing(col1) then call execute(' style={backgroundcolor=red}');
 call execute(';');
 if last then call execute('run;');
run;

Niugg2010 · Posted 11-04-2016 09:19 AM

Thanks. I got it.

Classroom Training Available!