SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MajaAGR
Obsidian | Level 7

Hello!

Recently I was presented with this kind of data

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

Here we have 5 person (animal) with 12 variables describing some its traits. At every trait we can have score from A to F.

This scores can be interpreted also as a string containing all scores. 

What would be goal is to compare scores between animals and to say where are they the same.

For example: id1 vs id2 are same at: t2,t3,t7,t11

id1 vs id2 same at: t11,t12

etc..

In original data set I have and will have much more animals. 

I appreciate all of inputs!

Cheers!

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello,

 

I think this program gives you what you want:

id = id of animal x

cid = id of animal y (which is compared to animal x)

location_of_diff = Traits different for animal x and animal y

 

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                                                                                                                                                                                                                                               
run ;

data have1(rename=(id=cid) drop=i T1-T12);
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 do i=1 to dim(origT);
  copyT(i)=origT(i);
 end;
run;

data _NULL_; 
 if 0 then set have1 nobs=count; 
 call symput('numobs',strip(put(count,8.)));
 STOP;
run;
%PUT &=numobs;

data want(drop=j T1-T12 cT1-cT12);
 LENGTH id cid 8 location_of_diff $ 200;
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 array equaT{12} $ eT1-eT12;
 do pointer=1 to &numobs.;
  set have1 point=pointer;
  location_of_diff='';
  do j=1 to dim(origT);
   if copyT(j)=origT(j) then equaT(j)='Y';
   else equaT(j)='N';
   if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
  end;
  if id ^= cid then output;
 end;
run;
/* end of program */

Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.

I haven't put any comments, hoping that you can grasp the code without. If not, tell me!

 

Cheers,

Koen

View solution in original post

14 REPLIES 14
PeterClemmensen
Tourmaline | Level 20

What does your desired result look like given this example data ?

 

 

MajaAGR
Obsidian | Level 7

Well,

I would like to have variable or variables saying which animals are compared, and one variable with locations of differences

Var1 Var2 diff

1 2 t2,t3,t7,t11

2 3 t11,t12

 

Or something similar. 

it can also be

t1 same in 3 and 4

t2 same in 1 and 2

sbxkoenk
SAS Super FREQ

Hello,

 

I think this program gives you what you want:

id = id of animal x

cid = id of animal y (which is compared to animal x)

location_of_diff = Traits different for animal x and animal y

 

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                                                                                                                                                                                                                                               
run ;

data have1(rename=(id=cid) drop=i T1-T12);
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 do i=1 to dim(origT);
  copyT(i)=origT(i);
 end;
run;

data _NULL_; 
 if 0 then set have1 nobs=count; 
 call symput('numobs',strip(put(count,8.)));
 STOP;
run;
%PUT &=numobs;

data want(drop=j T1-T12 cT1-cT12);
 LENGTH id cid 8 location_of_diff $ 200;
 set have;
 array origT{12} $  T1- T12;
 array copyT{12} $ cT1-cT12;
 array equaT{12} $ eT1-eT12;
 do pointer=1 to &numobs.;
  set have1 point=pointer;
  location_of_diff='';
  do j=1 to dim(origT);
   if copyT(j)=origT(j) then equaT(j)='Y';
   else equaT(j)='N';
   if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
  end;
  if id ^= cid then output;
 end;
run;
/* end of program */

Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.

I haven't put any comments, hoping that you can grasp the code without. If not, tell me!

 

Cheers,

Koen

MajaAGR
Obsidian | Level 7
This looks great!
I will try grasp the code and change it to give me the list of similarities instead of differences.
Let me test it on a bigger data set and will let you know

Thanks!
sbxkoenk
SAS Super FREQ

Hello,

 

The key is you want to compare a record to all other records in the same dataset. And you want to repeat this for every record (comparing it to all other records).

That's a kind of Cartesian product.

There are multiple ways to solve this. Many people do this with PROC SQL.

I have done it with a data step:

I take a first record in HAVE and compare it to all records in HAVE1 (HAVE1 is a copy of HAVE with renamed variables).

Then I take a second record in HAVE and compare it to all records in HAVE1.

Until I reach the end of file (last record) of HAVE.

 

Cheers,

Koen

MajaAGR
Obsidian | Level 7
yes, Cartesian product was the last thing I successfully made from the data. Later I stuck with comparisons, your Y and N. Now all works fine. Actually my data looks a bit different so first I need to transpose and rename and.. but at the end this works!
MajaAGR
Obsidian | Level 7

Without issues I changed the code to show same traits instead of different ones.

All works perfectly!

 

Thanks again!

 

Ksharp
Super User

Do you have SAS/IML ?

Otherwise, that would cost you many code.

 

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

proc iml;
use have;
read all var _char_ into t[c=vname];

do i=1 to nrow(t)-1;
 do j=i+1 to nrow(t);
   idx=loc(t[i,] ^= t[j,]) ;
   cat=rowcat(putc(t(vname[idx]),'$8.'));
   var1=var1//i;
   var2=var2//j;
   diff=diff//cat;
 end;
end;
create want var {var1 var2 diff};
append;
close;
quit;

proc print;run;

image.png

MajaAGR
Obsidian | Level 7
This one works well too.
I have IML and will see if I can go for this solution. However, I need positions where they are the same.
Ksharp
Super User

Same postion is easy too.

Just change 

idx=loc(t[i,] ^= t[j,]) ;

 

into 

idx=loc(t[i,] = t[j,]) ;

 

image.png

Ksharp
Super User
Of course. For the data step just change 
    if c{var1} ne c{var2} then output;

into 
    if c{var1} = c{var2} then output;

image.png

Ksharp
Super User
data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F                                                                                                                                                                                                                                                
;                                                                                                                                                                                                                                                               
run ;

proc transpose data=have out=temp;
var t:;
run;

data temp1;
 set temp;
 array c{*} col:;
 do var1=1 to dim(c)-1;
  do var2=var1+1 to dim(c);
    if c{var1} ne c{var2} then output;
  end;
 end;
keep _name_ var1 var2;
run;
proc sort data=temp1 out=temp2;
by var1 var2 _name_;
run;
data want;
 do until(last.var2);
  set temp2;
  by var1 var2;
  length diff $ 400;
  diff=catx(',',diff,_name_);
 end;
 drop _name_;
run;
PeterClemmensen
Tourmaline | Level 20

My 2 cents.

 

Should be reasonably fast as well.

 

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                     

data want(keep = id id2 same);

   set have;
   id2 = .;
   length same $ 1000;
   same = '';
   array t {*} $ t1  - t12;
   array tt{*} $ tt1 - tt12;

   do i = 1 to dim(tt);
      tt[i] = t[i];
   end;

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
      h.definekey("id2");
      h.definedata("id2");
      do i = 1 to dim(t);
         h.definedata(vname(t[i]));
      end;
      h.definedone();
      dcl hiter hi("h");
   end;

   hi.setcur(key : id);

   do while (hi.next() = 0);
      do i = 1 to dim(t);
         if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
      end;
      output;
   end;

run;
PeterClemmensen
Tourmaline | Level 20

Small correction.

data have ;                                                                                                                                                                                                                                                     
  input id (T1-T12) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;                                     

data want(keep = id id2 same);

   set have;
   id2 = .;
   length same $ 1000;
   same = '';
   array t {*}  $ t1  - t12;
   array tt{12} $ _temporary_;

   do i = 1 to dim(tt);
      tt[i] = t[i];
   end;

   if _N_ = 1 then do;
      dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
      h.definekey("id2");
      h.definedata("id2");
      do i = 1 to dim(t);
         h.definedata(vname(t[i]));
      end;
      h.definedone();
      dcl hiter hi("h");
   end;

   hi.setcur(key : id);

   do while (hi.next() = 0);
      do i = 1 to dim(t);
         if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
      end;
      output;
   end;

run;

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 1894 views
  • 10 likes
  • 4 in conversation