Hello!
Recently I was presented with this kind of data
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
Here we have 5 person (animal) with 12 variables describing some its traits. At every trait we can have score from A to F.
This scores can be interpreted also as a string containing all scores.
What would be goal is to compare scores between animals and to say where are they the same.
For example: id1 vs id2 are same at: t2,t3,t7,t11
id1 vs id2 same at: t11,t12
etc..
In original data set I have and will have much more animals.
I appreciate all of inputs!
Cheers!
Hello,
I think this program gives you what you want:
id = id of animal x
cid = id of animal y (which is compared to animal x)
location_of_diff = Traits different for animal x and animal y
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
data have1(rename=(id=cid) drop=i T1-T12);
set have;
array origT{12} $ T1- T12;
array copyT{12} $ cT1-cT12;
do i=1 to dim(origT);
copyT(i)=origT(i);
end;
run;
data _NULL_;
if 0 then set have1 nobs=count;
call symput('numobs',strip(put(count,8.)));
STOP;
run;
%PUT &=numobs;
data want(drop=j T1-T12 cT1-cT12);
LENGTH id cid 8 location_of_diff $ 200;
set have;
array origT{12} $ T1- T12;
array copyT{12} $ cT1-cT12;
array equaT{12} $ eT1-eT12;
do pointer=1 to &numobs.;
set have1 point=pointer;
location_of_diff='';
do j=1 to dim(origT);
if copyT(j)=origT(j) then equaT(j)='Y';
else equaT(j)='N';
if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
end;
if id ^= cid then output;
end;
run;
/* end of program */
Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.
I haven't put any comments, hoping that you can grasp the code without. If not, tell me!
Cheers,
Koen
What does your desired result look like given this example data ?
Well,
I would like to have variable or variables saying which animals are compared, and one variable with locations of differences
Var1 Var2 diff
1 2 t2,t3,t7,t11
2 3 t11,t12
Or something similar.
it can also be
t1 same in 3 and 4
t2 same in 1 and 2
Hello,
I think this program gives you what you want:
id = id of animal x
cid = id of animal y (which is compared to animal x)
location_of_diff = Traits different for animal x and animal y
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
data have1(rename=(id=cid) drop=i T1-T12);
set have;
array origT{12} $ T1- T12;
array copyT{12} $ cT1-cT12;
do i=1 to dim(origT);
copyT(i)=origT(i);
end;
run;
data _NULL_;
if 0 then set have1 nobs=count;
call symput('numobs',strip(put(count,8.)));
STOP;
run;
%PUT &=numobs;
data want(drop=j T1-T12 cT1-cT12);
LENGTH id cid 8 location_of_diff $ 200;
set have;
array origT{12} $ T1- T12;
array copyT{12} $ cT1-cT12;
array equaT{12} $ eT1-eT12;
do pointer=1 to &numobs.;
set have1 point=pointer;
location_of_diff='';
do j=1 to dim(origT);
if copyT(j)=origT(j) then equaT(j)='Y';
else equaT(j)='N';
if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
end;
if id ^= cid then output;
end;
run;
/* end of program */
Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.
I haven't put any comments, hoping that you can grasp the code without. If not, tell me!
Cheers,
Koen
Hello,
The key is you want to compare a record to all other records in the same dataset. And you want to repeat this for every record (comparing it to all other records).
That's a kind of Cartesian product.
There are multiple ways to solve this. Many people do this with PROC SQL.
I have done it with a data step:
I take a first record in HAVE and compare it to all records in HAVE1 (HAVE1 is a copy of HAVE with renamed variables).
Then I take a second record in HAVE and compare it to all records in HAVE1.
Until I reach the end of file (last record) of HAVE.
Cheers,
Koen
Without issues I changed the code to show same traits instead of different ones.
All works perfectly!
Thanks again!
Do you have SAS/IML ?
Otherwise, that would cost you many code.
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
proc iml;
use have;
read all var _char_ into t[c=vname];
do i=1 to nrow(t)-1;
do j=i+1 to nrow(t);
idx=loc(t[i,] ^= t[j,]) ;
cat=rowcat(putc(t(vname[idx]),'$8.'));
var1=var1//i;
var2=var2//j;
diff=diff//cat;
end;
end;
create want var {var1 var2 diff};
append;
close;
quit;
proc print;run;
Same postion is easy too.
Just change
idx=loc(t[i,] ^= t[j,]) ;
into
idx=loc(t[i,] = t[j,]) ;
Of course. For the data step just change if c{var1} ne c{var2} then output; into if c{var1} = c{var2} then output;
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
run ;
proc transpose data=have out=temp;
var t:;
run;
data temp1;
set temp;
array c{*} col:;
do var1=1 to dim(c)-1;
do var2=var1+1 to dim(c);
if c{var1} ne c{var2} then output;
end;
end;
keep _name_ var1 var2;
run;
proc sort data=temp1 out=temp2;
by var1 var2 _name_;
run;
data want;
do until(last.var2);
set temp2;
by var1 var2;
length diff $ 400;
diff=catx(',',diff,_name_);
end;
drop _name_;
run;
My 2 cents.
Should be reasonably fast as well.
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
data want(keep = id id2 same);
set have;
id2 = .;
length same $ 1000;
same = '';
array t {*} $ t1 - t12;
array tt{*} $ tt1 - tt12;
do i = 1 to dim(tt);
tt[i] = t[i];
end;
if _N_ = 1 then do;
dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
h.definekey("id2");
h.definedata("id2");
do i = 1 to dim(t);
h.definedata(vname(t[i]));
end;
h.definedone();
dcl hiter hi("h");
end;
hi.setcur(key : id);
do while (hi.next() = 0);
do i = 1 to dim(t);
if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
end;
output;
end;
run;
Small correction.
data have ;
input id (T1-T12) ($) ;
cards ;
1 E A E B A A E A C B E F
2 F A E F E E E E A E E E
3 A C C E E E A F D D E F
4 A F F D B B C A A A C C
5 B B C C D A A A B E E F
;
data want(keep = id id2 same);
set have;
id2 = .;
length same $ 1000;
same = '';
array t {*} $ t1 - t12;
array tt{12} $ _temporary_;
do i = 1 to dim(tt);
tt[i] = t[i];
end;
if _N_ = 1 then do;
dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
h.definekey("id2");
h.definedata("id2");
do i = 1 to dim(t);
h.definedata(vname(t[i]));
end;
h.definedone();
dcl hiter hi("h");
end;
hi.setcur(key : id);
do while (hi.next() = 0);
do i = 1 to dim(t);
if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
end;
output;
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.