BookmarkSubscribeRSS Feed
kyle234
Obsidian | Level 7

Is there an easy way to create a variable called match which will list all the IDs that are the same? The data is in the format below. Any help would be greatly appreciated.

 

 

ID_A      ID_B     ID_C     ID_D      MATCH

1               1           1              1            A,B,C,D

2              2           2             3           A,B,C

3              4          4            4            B,C,D

1               1            3           5             A.B

5 REPLIES 5
Kurt_Bremser
Super User

Steps to follow:

Create a variable to uniquely identify the observations (e.g. derived from _N_).

Transpose the ID: variables, use the above variable as BY.

Sort by this unique id, COL1 and _NAME_;

Then

data want;
set long;
by unique_id col1;
length match $10;
retain match count;
if first.col1
then do;
  match = scan(_name_,2,"_");
  count = 1;
end;
else do;
  match = catx(",",match,scan(_name_,2,"_"));
  count + 1;
end;
if last.col1 and count > 1;
drop count _name_ /* col1 */;
run;
ChrisNZ
Tourmaline | Level 20

Like this?

 

data SAMPLE;
   A=1; B=1; C=1; D=3; output;
   A=1; B=2; C=2; D=2; output;
   A=1; B=2; C=3; D=4; output;
run;

data WANT;
  set SAMPLE;
  _POS = prxmatch('/(.)\1/', cats(A,B,C,D));          * Find the position of a repeated value;
  if _POS then do;                                    * If found then analyse;
    _CHAR = char(cats(A,B,C,D), _POS);                * Extract repeated value;
    MATCH = catx(',', ifc(put(A,1.)=_CHAR,'A','')
                    , ifc(put(B,1.)=_CHAR,'B','')
                    , ifc(put(C,1.)=_CHAR,'C','')
                    , ifc(put(D,1.)=_CHAR,'D','')
                );                                    * Extract occurrences;
  end; 
  drop _:;
run;

This creates this table:

A B C D MATCH
1 1 1 3 A,B,C
1 2 2 2 B,C,D
1 2 3 4  

 

 

ChrisNZ
Tourmaline | Level 20

If the values are longer than a digit, or not contiguous:

data SAMPLE;
   A=11; B=11; C=11; D= 3; output;
   A= 1; B=22; C= 4; D=22; output;
   A= 1; B= 2; C= 3; D= 4; output;
run;

data WANT;
  set SAMPLE;   
  call sort (of A--D); 
  _STR   = catx('|', of A--D, .);
  _VALUE = prxchange('s/.*((\d+)\|)\1.*/\2/',1, _STR);
  set SAMPLE;   
  if _VALUE ne _STR then do;
    MATCH = catx(',', ifc(put(A,8. -l)=_VALUE, 'A', '')
                    , ifc(put(B,8. -l)=_VALUE, 'B', '')
                    , ifc(put(C,8. -l)=_VALUE, 'C', '')
                    , ifc(put(D,8. -l)=_VALUE, 'D', '')
                );
  end; 
  drop _: ;
run; 
   
A B C D MATCH
11 11 11 3 A,B,C
1 22 4 22 B,D
1 2 3 4  

 

The regular expression

.*((\d+)\|)\1.*/\2/

means:

/ start match pattern

.* any character(s), then

    start a group, then

  (\d+)  another ,group with one digit or more, then

  \|  a pipe character, then

) close group, then

/1  repeat the group just matched (so we want the same value twice), then

.* any character(s)

/ end match pattern, start replace pattern

\1  replace matched characters with 2nd (inner) group: (\d+)

 / end of replace pattern

 

 

You can shorten to something more compact if you want:

data WANT;
  set SAMPLE;   
  call sort (of A--D); 
  _VALUE = prxchange('s/.*((\d+)\|)\1.*/\2/',1, catx('|', of A--D, .));
  set SAMPLE;   
  MATCH = catx(',', ifc(put(A, 8. -l)=_VALUE, 'A', '')
                  , ifc(put(B, 8. -l)=_VALUE, 'B', '')
	              , ifc(put(C, 8. -l)=_VALUE, 'C', '')
	              , ifc(put(D, 8. -l)=_VALUE, 'D', '')
	           );
  drop _: ;
run;

or

data WANT;
  set SAMPLE;   
  call sort(of A--D); 
  _VALUE = input(prxchange('s/.*((\d+)\|)\1.*/\2/', 1, catx('|', of A--D, .)), ?? 8.);
  set SAMPLE;   
  MATCH = catx(',', ifc(A=_VALUE, 'A', '')
                  , ifc(B=_VALUE, 'B', '')
                  , ifc(C=_VALUE, 'C', '')
                  , ifc(D=_VALUE, 'D', '')
              );
   drop _: ;
run; 

 

 

s_lassen
Meteorite | Level 14

Assuming data like this (I added the ROW_ID variable to identify rows):

data have;
  row_id=_N_;
  input ID_A ID_B ID_C ID_D;
cards;
1 1 1 1
2 2 2 3
3 4 4 4
1 1 3 5
2 2 6 6
1 2 3 4
;run;

I would first convert to a long format and sort:

proc transpose data=have out=long;
  var ID_:;
  by row_id;
run;

proc sort data=long;
  by row_id col1;
run;

 

Then it is easy to find the matches:

data match;
  do until(last.col1);
    set long;
    by row_id col1;
    length match $10;
    call catx(',',match,substr(_name_,4));
    end;
  if index(match,','); /* not single values */
  keep row_id match;
run;

And finally merge with the original data:

data want;
  merge have match;
  by row_id;
run;

The answer to @Kurt_Bremser 's question is in this case that if there are two matching pairs, we will create an output row for each.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 361 views
  • 0 likes
  • 4 in conversation