Is there an easy way to create a variable called match which will list all the IDs that are the same? The data is in the format below. Any help would be greatly appreciated.
ID_A ID_B ID_C ID_D MATCH
1 1 1 1 A,B,C,D
2 2 2 3 A,B,C
3 4 4 4 B,C,D
1 1 3 5 A.B
What if you have a combination of
1 1 3 3
?
Steps to follow:
Create a variable to uniquely identify the observations (e.g. derived from _N_).
Transpose the ID: variables, use the above variable as BY.
Sort by this unique id, COL1 and _NAME_;
Then
data want;
set long;
by unique_id col1;
length match $10;
retain match count;
if first.col1
then do;
match = scan(_name_,2,"_");
count = 1;
end;
else do;
match = catx(",",match,scan(_name_,2,"_"));
count + 1;
end;
if last.col1 and count > 1;
drop count _name_ /* col1 */;
run;
Like this?
data SAMPLE;
A=1; B=1; C=1; D=3; output;
A=1; B=2; C=2; D=2; output;
A=1; B=2; C=3; D=4; output;
run;
data WANT;
set SAMPLE;
_POS = prxmatch('/(.)\1/', cats(A,B,C,D)); * Find the position of a repeated value;
if _POS then do; * If found then analyse;
_CHAR = char(cats(A,B,C,D), _POS); * Extract repeated value;
MATCH = catx(',', ifc(put(A,1.)=_CHAR,'A','')
, ifc(put(B,1.)=_CHAR,'B','')
, ifc(put(C,1.)=_CHAR,'C','')
, ifc(put(D,1.)=_CHAR,'D','')
); * Extract occurrences;
end;
drop _:;
run;
This creates this table:
A | B | C | D | MATCH |
---|---|---|---|---|
1 | 1 | 1 | 3 | A,B,C |
1 | 2 | 2 | 2 | B,C,D |
1 | 2 | 3 | 4 |
If the values are longer than a digit, or not contiguous:
data SAMPLE;
A=11; B=11; C=11; D= 3; output;
A= 1; B=22; C= 4; D=22; output;
A= 1; B= 2; C= 3; D= 4; output;
run;
data WANT;
set SAMPLE;
call sort (of A--D);
_STR = catx('|', of A--D, .);
_VALUE = prxchange('s/.*((\d+)\|)\1.*/\2/',1, _STR);
set SAMPLE;
if _VALUE ne _STR then do;
MATCH = catx(',', ifc(put(A,8. -l)=_VALUE, 'A', '')
, ifc(put(B,8. -l)=_VALUE, 'B', '')
, ifc(put(C,8. -l)=_VALUE, 'C', '')
, ifc(put(D,8. -l)=_VALUE, 'D', '')
);
end;
drop _: ;
run;
A | B | C | D | MATCH |
---|---|---|---|---|
11 | 11 | 11 | 3 | A,B,C |
1 | 22 | 4 | 22 | B,D |
1 | 2 | 3 | 4 |
The regular expression
.*((\d+)\|)\1.*/\2/
means:
/ start match pattern
.* any character(s), then
( start a group, then
(\d+) another ,group with one digit or more, then
\| a pipe character, then
) close group, then
/1 repeat the group just matched (so we want the same value twice), then
.* any character(s)
/ end match pattern, start replace pattern
\1 replace matched characters with 2nd (inner) group: (\d+)
/ end of replace pattern
You can shorten to something more compact if you want:
data WANT;
set SAMPLE;
call sort (of A--D);
_VALUE = prxchange('s/.*((\d+)\|)\1.*/\2/',1, catx('|', of A--D, .));
set SAMPLE;
MATCH = catx(',', ifc(put(A, 8. -l)=_VALUE, 'A', '')
, ifc(put(B, 8. -l)=_VALUE, 'B', '')
, ifc(put(C, 8. -l)=_VALUE, 'C', '')
, ifc(put(D, 8. -l)=_VALUE, 'D', '')
);
drop _: ;
run;
or
data WANT;
set SAMPLE;
call sort(of A--D);
_VALUE = input(prxchange('s/.*((\d+)\|)\1.*/\2/', 1, catx('|', of A--D, .)), ?? 8.);
set SAMPLE;
MATCH = catx(',', ifc(A=_VALUE, 'A', '')
, ifc(B=_VALUE, 'B', '')
, ifc(C=_VALUE, 'C', '')
, ifc(D=_VALUE, 'D', '')
);
drop _: ;
run;
Assuming data like this (I added the ROW_ID variable to identify rows):
data have;
row_id=_N_;
input ID_A ID_B ID_C ID_D;
cards;
1 1 1 1
2 2 2 3
3 4 4 4
1 1 3 5
2 2 6 6
1 2 3 4
;run;
I would first convert to a long format and sort:
proc transpose data=have out=long;
var ID_:;
by row_id;
run;
proc sort data=long;
by row_id col1;
run;
Then it is easy to find the matches:
data match;
do until(last.col1);
set long;
by row_id col1;
length match $10;
call catx(',',match,substr(_name_,4));
end;
if index(match,','); /* not single values */
keep row_id match;
run;
And finally merge with the original data:
data want;
merge have match;
by row_id;
run;
The answer to @Kurt_Bremser 's question is in this case that if there are two matching pairs, we will create an output row for each.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.