Hello!
Recently I was presented with this kind of data
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
Here we have 5 person (animal) with 12 variables describing some its traits. At every trait we can have score from A to F.
This scores can be interpreted also as a string containing all scores.
What would be goal is to compare scores between animals and to say where are they the same.
For example: id1 vs id2 are same at: t2,t3,t7,t11
id1 vs id2 same at: t11,t12
etc..
In original data set I have and will have much more animals.
I appreciate all of inputs!
Cheers!
Hello,
I think this program gives you what you want:
id = id of animal x
cid = id of animal y (which is compared to animal x)
location_of_diff = Traits different for animal x and animal y
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
data have1(rename=(id=cid) drop=i T1-T12);
set have;
array origT{12} $ T1- T12;
array copyT{12} $ cT1-cT12;
do i=1 to dim(origT);
copyT(i)=origT(i);
end;
run;
data _NULL_;
if 0 then set have1 nobs=count;
call symput('numobs',strip(put(count,8.)));
STOP;
run;
%PUT &=numobs;
data want(drop=j T1-T12 cT1-cT12);
LENGTH id cid 8 location_of_diff $ 200;
set have;
array origT{12} $ T1- T12;
array copyT{12} $ cT1-cT12;
array equaT{12} $ eT1-eT12;
do pointer=1 to &numobs.;
set have1 point=pointer;
location_of_diff='';
do j=1 to dim(origT);
if copyT(j)=origT(j) then equaT(j)='Y';
else equaT(j)='N';
if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
end;
if id ^= cid then output;
end;
run;
/* end of program */
Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.
I haven't put any comments, hoping that you can grasp the code without. If not, tell me!
Cheers,
Koen
What does your desired result look like given this example data ?
Well,
I would like to have variable or variables saying which animals are compared, and one variable with locations of differences
Var1 Var2 diff
1 2 t2,t3,t7,t11
2 3 t11,t12
Or something similar.
it can also be
t1 same in 3 and 4
t2 same in 1 and 2
Hello,
I think this program gives you what you want:
id = id of animal x
cid = id of animal y (which is compared to animal x)
location_of_diff = Traits different for animal x and animal y
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
data have1(rename=(id=cid) drop=i T1-T12);
set have;
array origT{12} $ T1- T12;
array copyT{12} $ cT1-cT12;
do i=1 to dim(origT);
copyT(i)=origT(i);
end;
run;
data _NULL_;
if 0 then set have1 nobs=count;
call symput('numobs',strip(put(count,8.)));
STOP;
run;
%PUT &=numobs;
data want(drop=j T1-T12 cT1-cT12);
LENGTH id cid 8 location_of_diff $ 200;
set have;
array origT{12} $ T1- T12;
array copyT{12} $ cT1-cT12;
array equaT{12} $ eT1-eT12;
do pointer=1 to &numobs.;
set have1 point=pointer;
location_of_diff='';
do j=1 to dim(origT);
if copyT(j)=origT(j) then equaT(j)='Y';
else equaT(j)='N';
if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
end;
if id ^= cid then output;
end;
run;
/* end of program */
Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.
I haven't put any comments, hoping that you can grasp the code without. If not, tell me!
Cheers,
Koen
Hello,
The key is you want to compare a record to all other records in the same dataset. And you want to repeat this for every record (comparing it to all other records).
That's a kind of Cartesian product.
There are multiple ways to solve this. Many people do this with PROC SQL.
I have done it with a data step:
I take a first record in HAVE and compare it to all records in HAVE1 (HAVE1 is a copy of HAVE with renamed variables).
Then I take a second record in HAVE and compare it to all records in HAVE1.
Until I reach the end of file (last record) of HAVE.
Cheers,
Koen
Without issues I changed the code to show same traits instead of different ones.
All works perfectly!
Thanks again!
Do you have SAS/IML ?
Otherwise, that would cost you many code.
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
proc iml;
use have;
read all var _char_ into t[c=vname];
do i=1 to nrow(t)-1;
do j=i+1 to nrow(t);
idx=loc(t[i,] ^= t[j,]) ;
cat=rowcat(putc(t(vname[idx]),'$8.'));
var1=var1//i;
var2=var2//j;
diff=diff//cat;
end;
end;
create want var {var1 var2 diff};
append;
close;
quit;
proc print;run;
Same postion is easy too.
Just change
idx=loc(t[i,] ^= t[j,]) ;
into
idx=loc(t[i,] = t[j,]) ;
Of course. For the data step just change if c{var1} ne c{var2} then output; into if c{var1} = c{var2} then output;
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
proc transpose data=have out=temp;
var t:;
run;
data temp1;
set temp;
array c{*} col:;
do var1=1 to dim(c)-1;
do var2=var1+1 to dim(c);
if c{var1} ne c{var2} then output;
end;
end;
keep _name_ var1 var2;
run;
proc sort data=temp1 out=temp2;
by var1 var2 _name_;
run;
data want;
do until(last.var2);
set temp2;
by var1 var2;
length diff $ 400;
diff=catx(',',diff,_name_);
end;
drop _name_;
run;
My 2 cents.
Should be reasonably fast as well.
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
data want(keep = id id2 same);
set have;
id2 = .;
length same $ 1000;
same = '';
array t {*} $ t1 - t12;
array tt{*} $ tt1 - tt12;
do i = 1 to dim(tt);
tt[i] = t[i];
end;
if _N_ = 1 then do;
dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
h.definekey("id2");
h.definedata("id2");
do i = 1 to dim(t);
h.definedata(vname(t[i]));
end;
h.definedone();
dcl hiter hi("h");
end;
hi.setcur(key : id);
do while (hi.next() = 0);
do i = 1 to dim(t);
if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
end;
output;
end;
run;
Small correction.
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
data want(keep = id id2 same);
set have;
id2 = .;
length same $ 1000;
same = '';
array t {*} $ t1 - t12;
array tt{12} $ _temporary_;
do i = 1 to dim(tt);
tt[i] = t[i];
end;
if _N_ = 1 then do;
dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
h.definekey("id2");
h.definedata("id2");
do i = 1 to dim(t);
h.definedata(vname(t[i]));
end;
h.definedone();
dcl hiter hi("h");
end;
hi.setcur(key : id);
do while (hi.next() = 0);
do i = 1 to dim(t);
if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
end;
output;
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.