identifying similar id numbers

sophia_SAS · Posted 03-13-2012 08:19 AM

Hi SAS experts,

Please advise on a SAS procedure for a large dataset that will allow me to identify subjects who have similar, but not identical ID numbers (i.e. all ID numbers are the same except for the last 2 numbers).

For example, a study has the following 8 subject ID numbers:

888709

234294

888710

098762

546849

888721

234276

888733

The SAS procedure should be able to identify the following matched groups:

Group 1 -- 888709, 888710, 888721, 888733 (same 8887 string)

Group 2 -- 234294, 234276 (same 2342 string)

ID numbers 098762, 546849 do not have matches.

Thanks,

SS

Reeza · Posted 03-13-2012 09:46 AM

Assuming ID numbers are character:

*Create the group of 4 characters;

data want;

set have;

first_four=substr(id, 1, 4);

run;

*sort it by the group;

proc sort data=want; by first_four; run;

*Identify each group uniquely;

data group;

set want;

retain group 0;

if first.first_four then group+1;

else group;

run;

sophia_SAS · Posted 03-13-2012 09:59 AM

Thanks Reeza. I'm a bit confused by the last lines of the code .

I can't seem to figure out how to assign the grouped (matched?) values detailed in the last set of code.

data group;

set want;

retain group 0;

if first.first_four then group+1;

else group;

run;

Thanks.

Linlin · Posted 03-13-2012 10:17 AM

is the example helpful?

data have;

input id $ @@;

cards;

a b c d a b c d d e

;

proc sort;

by id;

proc print;

run;

data grouped;

set have;

by id;

if first.id then group+1;

run;

proc print;

run;

Obs id group

1 a 1

2 a 1

3 b 2

4 b 2

5 c 3

6 c 3

7 d 4

8 d 4

9 d 4

10 e 5