BookmarkSubscribeRSS Feed
Obsidian | Level 7

How can I create a variable indicating which values are the same?

Is there an easy way to create a variable called match which will list all the IDs that are the same? The data is in the format below. Any help would be greatly appreciated.

ID_A      ID_B     ID_C     ID_D      MATCH

1               1           1              1            A,B,C,D

2              2           2             3           A,B,C

3              4          4            4            B,C,D

1               1            3           5             A.B

5 REPLIES 5
Super User

Re: How can I create a variable indicating which values are the same?

What if you have a combination of

`1   1   3   3`

?

Super User

Re: How can I create a variable indicating which values are the same?

Steps to follow:

Create a variable to uniquely identify the observations (e.g. derived from _N_).

Transpose the ID: variables, use the above variable as BY.

Sort by this unique id, COL1 and _NAME_;

Then

``````data want;
set long;
by unique_id col1;
length match \$10;
retain match count;
if first.col1
then do;
match = scan(_name_,2,"_");
count = 1;
end;
else do;
match = catx(",",match,scan(_name_,2,"_"));
count + 1;
end;
if last.col1 and count > 1;
drop count _name_ /* col1 */;
run;``````
Tourmaline | Level 20

Re: How can I create a variable indicating which values are the same?

Like this?

``````data SAMPLE;
A=1; B=1; C=1; D=3; output;
A=1; B=2; C=2; D=2; output;
A=1; B=2; C=3; D=4; output;
run;

data WANT;
set SAMPLE;
_POS = prxmatch('/(.)\1/', cats(A,B,C,D));          * Find the position of a repeated value;
if _POS then do;                                    * If found then analyse;
_CHAR = char(cats(A,B,C,D), _POS);                * Extract repeated value;
MATCH = catx(',', ifc(put(A,1.)=_CHAR,'A','')
, ifc(put(B,1.)=_CHAR,'B','')
, ifc(put(C,1.)=_CHAR,'C','')
, ifc(put(D,1.)=_CHAR,'D','')
);                                    * Extract occurrences;
end;
drop _:;
run;``````

This creates this table:

A B C D MATCH
1 1 1 3 A,B,C
1 2 2 2 B,C,D
1 2 3 4

Tourmaline | Level 20

Re: How can I create a variable indicating which values are the same?

If the values are longer than a digit, or not contiguous:

``````data SAMPLE;
A=11; B=11; C=11; D= 3; output;
A= 1; B=22; C= 4; D=22; output;
A= 1; B= 2; C= 3; D= 4; output;
run;

data WANT;
set SAMPLE;
call sort (of A--D);
_STR   = catx('|', of A--D, .);
_VALUE = prxchange('s/.*((\d+)\|)\1.*/\2/',1, _STR);
set SAMPLE;
if _VALUE ne _STR then do;
MATCH = catx(',', ifc(put(A,8. -l)=_VALUE, 'A', '')
, ifc(put(B,8. -l)=_VALUE, 'B', '')
, ifc(put(C,8. -l)=_VALUE, 'C', '')
, ifc(put(D,8. -l)=_VALUE, 'D', '')
);
end;
drop _: ;
run;
``````
A B C D MATCH
11 11 11 3 A,B,C
1 22 4 22 B,D
1 2 3 4

The regular expression

``.*((\d+)\|)\1.*/\2/``

means:

/ start match pattern

.* any character(s), then

start a group, then

(\d+)  another ,group with one digit or more, then

\|  a pipe character, then

) close group, then

/1  repeat the group just matched (so we want the same value twice), then

.* any character(s)

/ end match pattern, start replace pattern

\1  replace matched characters with 2nd (inner) group: (\d+)

/ end of replace pattern

You can shorten to something more compact if you want:

``````data WANT;
set SAMPLE;
call sort (of A--D);
_VALUE = prxchange('s/.*((\d+)\|)\1.*/\2/',1, catx('|', of A--D, .));
set SAMPLE;
MATCH = catx(',', ifc(put(A, 8. -l)=_VALUE, 'A', '')
, ifc(put(B, 8. -l)=_VALUE, 'B', '')
, ifc(put(C, 8. -l)=_VALUE, 'C', '')
, ifc(put(D, 8. -l)=_VALUE, 'D', '')
);
drop _: ;
run;``````

or

``````data WANT;
set SAMPLE;
call sort(of A--D);
_VALUE = input(prxchange('s/.*((\d+)\|)\1.*/\2/', 1, catx('|', of A--D, .)), ?? 8.);
set SAMPLE;
MATCH = catx(',', ifc(A=_VALUE, 'A', '')
, ifc(B=_VALUE, 'B', '')
, ifc(C=_VALUE, 'C', '')
, ifc(D=_VALUE, 'D', '')
);
drop _: ;
run; ``````

Meteorite | Level 14

Re: How can I create a variable indicating which values are the same?

Assuming data like this (I added the ROW_ID variable to identify rows):

``````data have;
row_id=_N_;
input ID_A ID_B ID_C ID_D;
cards;
1 1 1 1
2 2 2 3
3 4 4 4
1 1 3 5
2 2 6 6
1 2 3 4
;run;``````

I would first convert to a long format and sort:

``````proc transpose data=have out=long;
var ID_:;
by row_id;
run;

proc sort data=long;
by row_id col1;
run;``````

Then it is easy to find the matches:

``````data match;
do until(last.col1);
set long;
by row_id col1;
length match \$10;
call catx(',',match,substr(_name_,4));
end;
if index(match,','); /* not single values */
keep row_id match;
run;``````

And finally merge with the original data:

``````data want;
merge have match;
by row_id;
run;``````

The answer to @Kurt_Bremser 's question is in this case that if there are two matching pairs, we will create an output row for each.

Discussion stats
• 5 replies
• 403 views
• 0 likes
• 4 in conversation