Obsidian | Level 7

## Comparing multiple character variables between IDs

Hello!

Recently I was presented with this kind of data

``````data have ;
input id (T1-T12) (\$) ;
cards ;
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;
run ;``````

Here we have 5 person (animal) with 12 variables describing some its traits. At every trait we can have score from A to F.

This scores can be interpreted also as a string containing all scores.

What would be goal is to compare scores between animals and to say where are they the same.

For example: id1 vs id2 are same at: t2,t3,t7,t11

id1 vs id2 same at: t11,t12

etc..

In original data set I have and will have much more animals.

I appreciate all of inputs!

Cheers!

1 ACCEPTED SOLUTION

Accepted Solutions
SAS Super FREQ

## Re: Comparing multiple character variables between IDs

Hello,

I think this program gives you what you want:

id = id of animal x

cid = id of animal y (which is compared to animal x)

location_of_diff = Traits different for animal x and animal y

``````data have ;
input id (T1-T12) (\$) ;
cards ;
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;
run ;

data have1(rename=(id=cid) drop=i T1-T12);
set have;
array origT{12} \$  T1- T12;
array copyT{12} \$ cT1-cT12;
do i=1 to dim(origT);
copyT(i)=origT(i);
end;
run;

data _NULL_;
if 0 then set have1 nobs=count;
call symput('numobs',strip(put(count,8.)));
STOP;
run;
%PUT &=numobs;

data want(drop=j T1-T12 cT1-cT12);
LENGTH id cid 8 location_of_diff \$ 200;
set have;
array origT{12} \$  T1- T12;
array copyT{12} \$ cT1-cT12;
array equaT{12} \$ eT1-eT12;
do pointer=1 to &numobs.;
set have1 point=pointer;
location_of_diff='';
do j=1 to dim(origT);
if copyT(j)=origT(j) then equaT(j)='Y';
else equaT(j)='N';
if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
end;
if id ^= cid then output;
end;
run;
/* end of program */``````

Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.

I haven't put any comments, hoping that you can grasp the code without. If not, tell me!

Cheers,

Koen

14 REPLIES 14
Tourmaline | Level 20

## Re: Comparing multiple character variables between IDs

What does your desired result look like given this example data ?

Obsidian | Level 7

## Re: Comparing multiple character variables between IDs

Well,

I would like to have variable or variables saying which animals are compared, and one variable with locations of differences

Var1 Var2 diff

1 2 t2,t3,t7,t11

2 3 t11,t12

Or something similar.

it can also be

t1 same in 3 and 4

t2 same in 1 and 2

SAS Super FREQ

## Re: Comparing multiple character variables between IDs

Hello,

I think this program gives you what you want:

id = id of animal x

cid = id of animal y (which is compared to animal x)

location_of_diff = Traits different for animal x and animal y

``````data have ;
input id (T1-T12) (\$) ;
cards ;
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;
run ;

data have1(rename=(id=cid) drop=i T1-T12);
set have;
array origT{12} \$  T1- T12;
array copyT{12} \$ cT1-cT12;
do i=1 to dim(origT);
copyT(i)=origT(i);
end;
run;

data _NULL_;
if 0 then set have1 nobs=count;
call symput('numobs',strip(put(count,8.)));
STOP;
run;
%PUT &=numobs;

data want(drop=j T1-T12 cT1-cT12);
LENGTH id cid 8 location_of_diff \$ 200;
set have;
array origT{12} \$  T1- T12;
array copyT{12} \$ cT1-cT12;
array equaT{12} \$ eT1-eT12;
do pointer=1 to &numobs.;
set have1 point=pointer;
location_of_diff='';
do j=1 to dim(origT);
if copyT(j)=origT(j) then equaT(j)='Y';
else equaT(j)='N';
if equaT(j)='N' then location_of_diff=strip(location_of_diff)!!strip(vname(origT(j)));
end;
if id ^= cid then output;
end;
run;
/* end of program */``````

Note: if you have 10 000+ animals, it may be worthwhile (performance-wise) to turn the data set have1 into a hash table and doing a Hash Object Table Look-up.

I haven't put any comments, hoping that you can grasp the code without. If not, tell me!

Cheers,

Koen

Obsidian | Level 7

## Re: Comparing multiple character variables between IDs

This looks great!
I will try grasp the code and change it to give me the list of similarities instead of differences.
Let me test it on a bigger data set and will let you know

Thanks!
SAS Super FREQ

## Re: Comparing multiple character variables between IDs

Hello,

The key is you want to compare a record to all other records in the same dataset. And you want to repeat this for every record (comparing it to all other records).

That's a kind of Cartesian product.

There are multiple ways to solve this. Many people do this with PROC SQL.

I have done it with a data step:

I take a first record in HAVE and compare it to all records in HAVE1 (HAVE1 is a copy of HAVE with renamed variables).

Then I take a second record in HAVE and compare it to all records in HAVE1.

Until I reach the end of file (last record) of HAVE.

Cheers,

Koen

Obsidian | Level 7

## Re: Comparing multiple character variables between IDs

yes, Cartesian product was the last thing I successfully made from the data. Later I stuck with comparisons, your Y and N. Now all works fine. Actually my data looks a bit different so first I need to transpose and rename and.. but at the end this works!
Obsidian | Level 7

## Re: Comparing multiple character variables between IDs

Without issues I changed the code to show same traits instead of different ones.

All works perfectly!

Thanks again!

Super User

## Re: Comparing multiple character variables between IDs

Do you have SAS/IML ?

Otherwise, that would cost you many code.

``````data have ;
input id (T1-T12) (\$) ;
cards ;
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;
run ;

proc iml;
use have;
read all var _char_ into t[c=vname];

do i=1 to nrow(t)-1;
do j=i+1 to nrow(t);
idx=loc(t[i,] ^= t[j,]) ;
cat=rowcat(putc(t(vname[idx]),'\$8.'));
var1=var1//i;
var2=var2//j;
diff=diff//cat;
end;
end;
create want var {var1 var2 diff};
append;
close;
quit;

proc print;run;``````

Obsidian | Level 7

## Re: Comparing multiple character variables between IDs

This one works well too.
I have IML and will see if I can go for this solution. However, I need positions where they are the same.
Super User

## Re: Comparing multiple character variables between IDs

Same postion is easy too.

Just change

idx=loc(t[i,] ^= t[j,]) ;

into

idx=loc(t[i,] = t[j,]) ;

Super User

## Re: Comparing multiple character variables between IDs

```Of course. For the data step just change
if c{var1} ne c{var2} then output;

into
if c{var1} = c{var2} then output;
```

Super User

## Re: Comparing multiple character variables between IDs

``````data have ;
input id (T1-T12) (\$) ;
cards ;
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;
run ;

proc transpose data=have out=temp;
var t:;
run;

data temp1;
set temp;
array c{*} col:;
do var1=1 to dim(c)-1;
do var2=var1+1 to dim(c);
if c{var1} ne c{var2} then output;
end;
end;
keep _name_ var1 var2;
run;
proc sort data=temp1 out=temp2;
by var1 var2 _name_;
run;
data want;
do until(last.var2);
set temp2;
by var1 var2;
length diff \$ 400;
diff=catx(',',diff,_name_);
end;
drop _name_;
run;``````
Tourmaline | Level 20

## Re: Comparing multiple character variables between IDs

My 2 cents.

Should be reasonably fast as well.

``````data have ;
input id (T1-T12) (\$) ;
cards ;
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;

data want(keep = id id2 same);

set have;
id2 = .;
length same \$ 1000;
same = '';
array t {*} \$ t1  - t12;
array tt{*} \$ tt1 - tt12;

do i = 1 to dim(tt);
tt[i] = t[i];
end;

if _N_ = 1 then do;
dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
h.definekey("id2");
h.definedata("id2");
do i = 1 to dim(t);
h.definedata(vname(t[i]));
end;
h.definedone();
dcl hiter hi("h");
end;

hi.setcur(key : id);

do while (hi.next() = 0);
do i = 1 to dim(t);
if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
end;
output;
end;

run;
``````
Tourmaline | Level 20

## Re: Comparing multiple character variables between IDs

Small correction.

``````data have ;
input id (T1-T12) (\$) ;
cards ;
1	E	A	E	B	A	A	E	A	C	B	E	F
2	F	A	E	F	E	E	E	E	A	E	E	E
3	A	C	C	E	E	E	A	F	D	D	E	F
4	A	F	F	D	B	B	C	A	A	A	C	C
5	B	B	C	C	D	A	A	A	B	E	E	F
;

data want(keep = id id2 same);

set have;
id2 = .;
length same \$ 1000;
same = '';
array t {*}  \$ t1  - t12;
array tt{12} \$ _temporary_;

do i = 1 to dim(tt);
tt[i] = t[i];
end;

if _N_ = 1 then do;
dcl hash h(dataset : "have(rename = id = id2)", ordered : "Y");
h.definekey("id2");
h.definedata("id2");
do i = 1 to dim(t);
h.definedata(vname(t[i]));
end;
h.definedone();
dcl hiter hi("h");
end;

hi.setcur(key : id);

do while (hi.next() = 0);
do i = 1 to dim(t);
if t[i] = tt[i] then same = catx(',', same, vname(t[i]));
end;
output;
end;

run;
``````

``` ```

Discussion stats
• 14 replies
• 865 views
• 10 likes
• 4 in conversation