Solved: Re: Comparing multiple character variables between IDs

MajaAGR · Posted 04-25-2021 06:56 AM

Hello!

Recently I was presented with this kind of data

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

Here we have 5 person (animal) with 12 variables describing some its traits. At every trait we can have score from A to F.

This scores can be interpreted also as a string containing all scores.

What would be goal is to compare scores between animals and to say where are they the same.

For example: id1 vs id2 are same at: t2,t3,t7,t11

id1 vs id2 same at: t11,t12

etc..

In original data set I have and will have much more animals.

I appreciate all of inputs!

Cheers!

sbxkoenk · Posted 04-25-2021 08:24 AM

Hello,

I think this program gives you what you want:

id = id of animal x

cid = id of animal y (which is compared to animal x)

location_of_diff = Traits different for animal x and animal y

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                                                                                                                                                                                                                                               
run ;

data have1(rename=(id=cid) drop=i T1-T12);
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 do i=1 to dim(origT);
  copyT(i)=origT(i);
 end;
run;

data _NULL_; 
 if 0 then set have1 nobs=count; 
 call symput('numobs',strip(put(count,8.)));
 STOP;
run;
%PUT &=numobs;

data want(drop=j T1-T12 cT1-cT12);
 LENGTH id cid 8 location_of_diff $ 200;
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 array equaT{12} $ eT1-eT12;
 do pointer=1 to &numobs.;
  set have1 point=pointer;
  location_of_diff='';
  do j=1 to dim(origT);
   if copyT(j)=origT(j) then equaT(j)='Y';
   else equaT(j)='N';
   if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
  end;
  if id ^= cid then output;
 end;
run;
/* end of program */

Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.

I haven't put any comments, hoping that you can grasp the code without. If not, tell me!

Cheers,

Koen

View solution in original post

PeterClemmensen · Posted 04-25-2021 07:18 AM

What does your desired result look like given this example data ?

The DATA to DATA Step Macro
Blog: SASnrd

MajaAGR · Posted 04-25-2021 07:55 AM

Well,

I would like to have variable or variables saying which animals are compared, and one variable with locations of differences

Var1 Var2 diff

1 2 t2,t3,t7,t11

2 3 t11,t12

Or something similar.

it can also be

t1 same in 3 and 4

t2 same in 1 and 2

sbxkoenk · Posted 04-25-2021 08:24 AM

Hello,

I think this program gives you what you want:

id = id of animal x

cid = id of animal y (which is compared to animal x)

location_of_diff = Traits different for animal x and animal y

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                                                                                                                                                                                                                                               
run ;

data have1(rename=(id=cid) drop=i T1-T12);
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 do i=1 to dim(origT);
  copyT(i)=origT(i);
 end;
run;

data _NULL_; 
 if 0 then set have1 nobs=count; 
 call symput('numobs',strip(put(count,8.)));
 STOP;
run;
%PUT &=numobs;

data want(drop=j T1-T12 cT1-cT12);
 LENGTH id cid 8 location_of_diff $ 200;
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 array equaT{12} $ eT1-eT12;
 do pointer=1 to &numobs.;
  set have1 point=pointer;
  location_of_diff='';
  do j=1 to dim(origT);
   if copyT(j)=origT(j) then equaT(j)='Y';
   else equaT(j)='N';
   if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
  end;
  if id ^= cid then output;
 end;
run;
/* end of program */

Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.

I haven't put any comments, hoping that you can grasp the code without. If not, tell me!

Cheers,

Koen

MajaAGR · Posted 04-25-2021 08:34 AM

This looks great!
I will try grasp the code and change it to give me the list of similarities instead of differences.
Let me test it on a bigger data set and will let you know

Thanks!

sbxkoenk · Posted 04-25-2021 08:45 AM

Hello,

The key is you want to compare a record to all other records in the same dataset. And you want to repeat this for every record (comparing it to all other records).

That's a kind of Cartesian product.

There are multiple ways to solve this. Many people do this with PROC SQL.

I have done it with a data step:

I take a first record in HAVE and compare it to all records in HAVE1 (HAVE1 is a copy of HAVE with renamed variables).

Then I take a second record in HAVE and compare it to all records in HAVE1.

Until I reach the end of file (last record) of HAVE.

Cheers,

Koen

MajaAGR · Posted 04-25-2021 09:00 AM

yes, Cartesian product was the last thing I successfully made from the data. Later I stuck with comparisons, your Y and N. Now all works fine. Actually my data looks a bit different so first I need to transpose and rename and.. but at the end this works!

MajaAGR · Posted 04-25-2021 08:40 AM

Without issues I changed the code to show same traits instead of different ones.

All works perfectly!

Thanks again!

Ksharp · Posted 04-25-2021 08:49 AM

Do you have SAS/IML ?

Otherwise, that would cost you many code.

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

proc iml;
use have;
read all var _char_ into t[c=vname];

do i=1 to nrow(t)-1;
 do j=i+1 to nrow(t);
   idx=loc(t[i,] ^= t[j,]) ;
   cat=rowcat(putc(t(vname[idx]),'$8.'));
   var1=var1//i;
   var2=var2//j;
   diff=diff//cat;
 end;
end;
create want var {var1 var2 diff};
append;
close;
quit;

proc print;run;

MajaAGR · Posted 04-25-2021 08:57 AM

This one works well too.
I have IML and will see if I can go for this solution. However, I need positions where they are the same.

Ksharp · Posted 04-25-2021 09:06 AM

Same postion is easy too.

Just change

idx=loc(t[i,] ^= t[j,]) ;

into

idx=loc(t[i,] = t[j,]) ;

Ksharp · Posted 04-25-2021 09:08 AM

Of course. For the data step just change 
    if c{var1} ne c{var2} then output;

into 
    if c{var1} = c{var2} then output;

Ksharp · Posted 04-25-2021 08:55 AM

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

proc transpose data=have out=temp;
var t:;
run;

data temp1;
 set temp;
 array c{*} col:;
 do var1=1 to dim(c)-1;
  do var2=var1+1 to dim(c);
    if c{var1} ne c{var2} then output;
  end;
 end;
keep _name_ var1 var2;
run;
proc sort data=temp1 out=temp2;
by var1 var2 _name_;
run;
data want;
 do until(last.var2);
  set temp2;
  by var1 var2;
  length diff $ 400;
  diff=catx(',',diff,_name_);
 end;
 drop _name_;
run;

PeterClemmensen · Posted 04-25-2021 09:42 AM

My 2 cents.

Should be reasonably fast as well.

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                     

data want(keep = id id2 same);

   set have;
   id2 = .;
   length same $ 1000;
   same = '';
   array t {*} $ t1  - t12;
   array tt{*} $ tt1 - tt12;

   do i = 1 to dim(tt);
      tt[i] = t[i];
   end;

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
      h.definekey("id2");
      h.definedata("id2");
      do i = 1 to dim(t);
         h.definedata(vname(t[i]));
      end;
      h.definedone();
      dcl hiter hi("h");
   end;

   hi.setcur(key : id);

   do while (hi.next() = 0);
      do i = 1 to dim(t);
         if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
      end;
      output;
   end;

run;

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 04-25-2021 09:44 AM

Small correction.

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                     

data want(keep = id id2 same);

   set have;
   id2 = .;
   length same $ 1000;
   same = '';
   array t {*}  $ t1  - t12;
   array tt{12} $ _temporary_;

   do i = 1 to dim(tt);
      tt[i] = t[i];
   end;

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
      h.definekey("id2");
      h.definedata("id2");
      do i = 1 to dim(t);
         h.definedata(vname(t[i]));
      end;
      h.definedone();
      dcl hiter hi("h");
   end;

   hi.setcur(key : id);

   do while (hi.next() = 0);
      do i = 1 to dim(t);
         if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
      end;
      output;
   end;

run;

The DATA to DATA Step Macro
Blog: SASnrd

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away