BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MajaAGR
Obsidian | Level 7

Hello!

Recently I was presented with this kind of data

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

Here we have 5 person (animal) with 12 variables describing some its traits. At every trait we can have score from A to F.

This scores can be interpreted also as a string containing all scores. 

What would be goal is to compare scores between animals and to say where are they the same.

For example: id1 vs id2 are same at: t2,t3,t7,t11

id1 vs id2 same at: t11,t12

etc..

In original data set I have and will have much more animals. 

I appreciate all of inputs!

Cheers!

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello,

 

I think this program gives you what you want:

id = id of animal x

cid = id of animal y (which is compared to animal x)

location_of_diff = Traits different for animal x and animal y

 

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                                                                                                                                                                                                                                               
run ;

data have1(rename=(id=cid) drop=i T1-T12);
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 do i=1 to dim(origT);
  copyT(i)=origT(i);
 end;
run;

data _NULL_; 
 if 0 then set have1 nobs=count; 
 call symput('numobs',strip(put(count,8.)));
 STOP;
run;
%PUT &=numobs;

data want(drop=j T1-T12 cT1-cT12);
 LENGTH id cid 8 location_of_diff $ 200;
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 array equaT{12} $ eT1-eT12;
 do pointer=1 to &numobs.;
  set have1 point=pointer;
  location_of_diff='';
  do j=1 to dim(origT);
   if copyT(j)=origT(j) then equaT(j)='Y';
   else equaT(j)='N';
   if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
  end;
  if id ^= cid then output;
 end;
run;
/* end of program */

Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.

I haven't put any comments, hoping that you can grasp the code without. If not, tell me!

 

Cheers,

Koen

View solution in original post

14 REPLIES 14
PeterClemmensen
Tourmaline | Level 20

What does your desired result look like given this example data ?

 

 

MajaAGR
Obsidian | Level 7

Well,

I would like to have variable or variables saying which animals are compared, and one variable with locations of differences

Var1 Var2 diff

1 2 t2,t3,t7,t11

2 3 t11,t12

 

Or something similar. 

it can also be

t1 same in 3 and 4

t2 same in 1 and 2

sbxkoenk
SAS Super FREQ

Hello,

 

I think this program gives you what you want:

id = id of animal x

cid = id of animal y (which is compared to animal x)

location_of_diff = Traits different for animal x and animal y

 

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                                                                                                                                                                                                                                               
run ;

data have1(rename=(id=cid) drop=i T1-T12);
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 do i=1 to dim(origT);
  copyT(i)=origT(i);
 end;
run;

data _NULL_; 
 if 0 then set have1 nobs=count; 
 call symput('numobs',strip(put(count,8.)));
 STOP;
run;
%PUT &=numobs;

data want(drop=j T1-T12 cT1-cT12);
 LENGTH id cid 8 location_of_diff $ 200;
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 array equaT{12} $ eT1-eT12;
 do pointer=1 to &numobs.;
  set have1 point=pointer;
  location_of_diff='';
  do j=1 to dim(origT);
   if copyT(j)=origT(j) then equaT(j)='Y';
   else equaT(j)='N';
   if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
  end;
  if id ^= cid then output;
 end;
run;
/* end of program */

Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.

I haven't put any comments, hoping that you can grasp the code without. If not, tell me!

 

Cheers,

Koen

MajaAGR
Obsidian | Level 7
This looks great!
I will try grasp the code and change it to give me the list of similarities instead of differences.
Let me test it on a bigger data set and will let you know

Thanks!
sbxkoenk
SAS Super FREQ

Hello,

 

The key is you want to compare a record to all other records in the same dataset. And you want to repeat this for every record (comparing it to all other records).

That's a kind of Cartesian product.

There are multiple ways to solve this. Many people do this with PROC SQL.

I have done it with a data step:

I take a first record in HAVE and compare it to all records in HAVE1 (HAVE1 is a copy of HAVE with renamed variables).

Then I take a second record in HAVE and compare it to all records in HAVE1.

Until I reach the end of file (last record) of HAVE.

 

Cheers,

Koen

MajaAGR
Obsidian | Level 7
yes, Cartesian product was the last thing I successfully made from the data. Later I stuck with comparisons, your Y and N. Now all works fine. Actually my data looks a bit different so first I need to transpose and rename and.. but at the end this works!
MajaAGR
Obsidian | Level 7

Without issues I changed the code to show same traits instead of different ones.

All works perfectly!

 

Thanks again!

 

Ksharp
Super User

Do you have SAS/IML ?

Otherwise, that would cost you many code.

 

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

proc iml;
use have;
read all var _char_ into t[c=vname];

do i=1 to nrow(t)-1;
 do j=i+1 to nrow(t);
   idx=loc(t[i,] ^= t[j,]) ;
   cat=rowcat(putc(t(vname[idx]),'$8.'));
   var1=var1//i;
   var2=var2//j;
   diff=diff//cat;
 end;
end;
create want var {var1 var2 diff};
append;
close;
quit;

proc print;run;

image.png

MajaAGR
Obsidian | Level 7
This one works well too.
I have IML and will see if I can go for this solution. However, I need positions where they are the same.
Ksharp
Super User

Same postion is easy too.

Just change 

idx=loc(t[i,] ^= t[j,]) ;

 

into 

idx=loc(t[i,] = t[j,]) ;

 

image.png

Ksharp
Super User
Of course. For the data step just change 
    if c{var1} ne c{var2} then output;

into 
    if c{var1} = c{var2} then output;

image.png

Ksharp
Super User
data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

proc transpose data=have out=temp;
var t:;
run;

data temp1;
 set temp;
 array c{*} col:;
 do var1=1 to dim(c)-1;
  do var2=var1+1 to dim(c);
    if c{var1} ne c{var2} then output;
  end;
 end;
keep _name_ var1 var2;
run;
proc sort data=temp1 out=temp2;
by var1 var2 _name_;
run;
data want;
 do until(last.var2);
  set temp2;
  by var1 var2;
  length diff $ 400;
  diff=catx(',',diff,_name_);
 end;
 drop _name_;
run;
PeterClemmensen
Tourmaline | Level 20

My 2 cents.

 

Should be reasonably fast as well.

 

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                     

data want(keep = id id2 same);

   set have;
   id2 = .;
   length same $ 1000;
   same = '';
   array t {*} $ t1  - t12;
   array tt{*} $ tt1 - tt12;

   do i = 1 to dim(tt);
      tt[i] = t[i];
   end;

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
      h.definekey("id2");
      h.definedata("id2");
      do i = 1 to dim(t);
         h.definedata(vname(t[i]));
      end;
      h.definedone();
      dcl hiter hi("h");
   end;

   hi.setcur(key : id);

   do while (hi.next() = 0);
      do i = 1 to dim(t);
         if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
      end;
      output;
   end;

run;
PeterClemmensen
Tourmaline | Level 20

Small correction.

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                     

data want(keep = id id2 same);

   set have;
   id2 = .;
   length same $ 1000;
   same = '';
   array t {*}  $ t1  - t12;
   array tt{12} $ _temporary_;

   do i = 1 to dim(tt);
      tt[i] = t[i];
   end;

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
      h.definekey("id2");
      h.definedata("id2");
      do i = 1 to dim(t);
         h.definedata(vname(t[i]));
      end;
      h.definedone();
      dcl hiter hi("h");
   end;

   hi.setcur(key : id);

   do while (hi.next() = 0);
      do i = 1 to dim(t);
         if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
      end;
      output;
   end;

run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 1620 views
  • 10 likes
  • 4 in conversation