How can I create a variable indicating which values are the same?

kyle234

Is there an easy way to create a variable called match which will list all the IDs that are the same? The data is in the format below. Any help would be greatly appreciated.

ID_A ID_B ID_C ID_D MATCH

1 1 1 1 A,B,C,D

2 2 2 3 A,B,C

3 4 4 4 B,C,D

1 1 3 5 A.B

Kurt_Bremser

What if you have a combination of

1   1   3   3

?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser

Steps to follow:

Create a variable to uniquely identify the observations (e.g. derived from _N_).

Transpose the ID: variables, use the above variable as BY.

Sort by this unique id, COL1 and _NAME_;

Then

data want;
set long;
by unique_id col1;
length match $10;
retain match count;
if first.col1
then do;
  match = scan(_name_,2,"_");
  count = 1;
end;
else do;
  match = catx(",",match,scan(_name_,2,"_"));
  count + 1;
end;
if last.col1 and count > 1;
drop count _name_ /* col1 */;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ

Like this?

data SAMPLE;
   A=1; B=1; C=1; D=3; output;
   A=1; B=2; C=2; D=2; output;
   A=1; B=2; C=3; D=4; output;
run;

data WANT;
  set SAMPLE;
  _POS = prxmatch('/(.)\1/', cats(A,B,C,D));          * Find the position of a repeated value;
  if _POS then do;                                    * If found then analyse;
    _CHAR = char(cats(A,B,C,D), _POS);                * Extract repeated value;
    MATCH = catx(',', ifc(put(A,1.)=_CHAR,'A','')
                    , ifc(put(B,1.)=_CHAR,'B','')
                    , ifc(put(C,1.)=_CHAR,'C','')
                    , ifc(put(D,1.)=_CHAR,'D','')
                );                                    * Extract occurrences;
  end; 
  drop _:;
run;

This creates this table:

A	B	C	D	MATCH
1	1	1	3	A,B,C
1	2	2	2	B,C,D
1	2	3	4

High-Performance SAS Coding - Third Edition

ChrisNZ

If the values are longer than a digit, or not contiguous:

data SAMPLE;
   A=11; B=11; C=11; D= 3; output;
   A= 1; B=22; C= 4; D=22; output;
   A= 1; B= 2; C= 3; D= 4; output;
run;

data WANT;
  set SAMPLE;   
  call sort (of A--D); 
  _STR   = catx('|', of A--D, .);
  _VALUE = prxchange('s/.*((\d+)\|)\1.*/\2/',1, _STR);
  set SAMPLE;   
  if _VALUE ne _STR then do;
    MATCH = catx(',', ifc(put(A,8. -l)=_VALUE, 'A', '')
                    , ifc(put(B,8. -l)=_VALUE, 'B', '')
                    , ifc(put(C,8. -l)=_VALUE, 'C', '')
                    , ifc(put(D,8. -l)=_VALUE, 'D', '')
                );
  end; 
  drop _: ;
run;

A	B	C	D	MATCH
11	11	11	3	A,B,C
1	22	4	22	B,D
1	2	3	4

The regular expression

.*((\d+)\|)\1.*/\2/

means:

/ start match pattern

.* any character(s), then

( start a group, then

(\d+) another ,group with one digit or more, then

\| a pipe character, then

) close group, then

/1 repeat the group just matched (so we want the same value twice), then

.* any character(s)

/ end match pattern, start replace pattern

\1 replace matched characters with 2nd (inner) group: (\d+)

/ end of replace pattern

You can shorten to something more compact if you want:

data WANT;
  set SAMPLE;   
  call sort (of A--D); 
  _VALUE = prxchange('s/.*((\d+)\|)\1.*/\2/',1, catx('|', of A--D, .));
  set SAMPLE;   
  MATCH = catx(',', ifc(put(A, 8. -l)=_VALUE, 'A', '')
                  , ifc(put(B, 8. -l)=_VALUE, 'B', '')
	              , ifc(put(C, 8. -l)=_VALUE, 'C', '')
	              , ifc(put(D, 8. -l)=_VALUE, 'D', '')
	           );
  drop _: ;
run;

or

data WANT;
  set SAMPLE;   
  call sort(of A--D); 
  _VALUE = input(prxchange('s/.*((\d+)\|)\1.*/\2/', 1, catx('|', of A--D, .)), ?? 8.);
  set SAMPLE;   
  MATCH = catx(',', ifc(A=_VALUE, 'A', '')
                  , ifc(B=_VALUE, 'B', '')
                  , ifc(C=_VALUE, 'C', '')
                  , ifc(D=_VALUE, 'D', '')
              );
   drop _: ;
run;

High-Performance SAS Coding - Third Edition

s_lassen

Assuming data like this (I added the ROW_ID variable to identify rows):

data have;
  row_id=_N_;
  input ID_A ID_B ID_C ID_D;
cards;
1 1 1 1
2 2 2 3
3 4 4 4
1 1 3 5
2 2 6 6
1 2 3 4
;run;

I would first convert to a long format and sort:

proc transpose data=have out=long;
  var ID_:;
  by row_id;
run;

proc sort data=long;
  by row_id col1;
run;

Then it is easy to find the matches:

data match;
  do until(last.col1);
    set long;
    by row_id col1;
    length match $10;
    call catx(',',match,substr(_name_,4));
    end;
  if index(match,','); /* not single values */
  keep row_id match;
run;

And finally merge with the original data:

data want;
  merge have match;
  by row_id;
run;

The answer to @Kurt_Bremser 's question is in this case that if there are two matching pairs, we will create an output row for each.

How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Re: How can I create a variable indicating which values are the same?

Click image to register for webinar

Classroom Training Available!