Solved: Re: Assign unique identifier by multiple variables

Cruise · Posted 03-18-2020 02:18 PM

Hi Folks:

I have an unidentified data (no data on how many times each patient was hospitalized). Therefore, I'd like to understand the extent of patients/data rows that share the same birth_y, birth_m, sex, zip, discharge_y and discharge_m. Given my research question is concerned with a rapidly fatal (max survival time ~ 6month) rare disease (~8 per 100,000 people), one could relatively safely assume that it's unlikely that the two individuals diagnosed with this rare medical condition is to occur to have the same birth_y, birth_m, sex, zip, discharge_y and discharge_m. This gives me a hope that I could create a synthetic unique individual identifier based on these variables. I'm aware of proc sort nodupkey by listing these variables to de-duplicate the data. But I have to assess the reliability of this assumption before I get to the point of de-duplication.

Do you know how to create unique identifier based on the multiple variables?

A patient hospitalized twice a year could take different discharge_y and discharge_m. But this could be solved later based on this initial screening.

Thanks for your time in advance.

See mock data below, if that helps.

data have; 
input birth_y birth_m sex zip discharge_year discharge_month; 
cards;
1980 2 1 12202 1991 3
1982 2 1 12202 1991 3
1970 6 2 12307 1971 8
1965 7 2 12907 1968 9
;

FreelanceReinh · Posted 03-18-2020 03:58 PM

Hi @Cruise,

As a start, I would count combinations like this:

proc freq data=have noprint;
tables birth_y*birth_m*sex*zip / out=cnt;
run;

proc freq data=cnt;
tables count;
run;

This will give you the numbers of unique, duplicate, triplicate, ... combinations of birth_y, birth_m, sex and zip. Ideally, it turns out that 100% of these combinations are unique. Otherwise, add the discharge year and month to the first TABLES specification (...*discharge_year*discharge_month) and rerun the two PROC FREQ steps. Then you can decide whether discharge year and month need to be included in concatenations (as have been suggested) or other methods to possibly obtain a unique identifier.

View solution in original post

JerryV · Posted 03-18-2020 03:26 PM

Would something like this work?

Cruise · Posted 03-18-2020 08:53 PM

It worked no problem. I'll frequent this approach you shared. Thanks a lot.

FreelanceReinh · Posted 03-18-2020 03:58 PM

Hi @Cruise,

As a start, I would count combinations like this:

proc freq data=have noprint;
tables birth_y*birth_m*sex*zip / out=cnt;
run;

proc freq data=cnt;
tables count;
run;

This will give you the numbers of unique, duplicate, triplicate, ... combinations of birth_y, birth_m, sex and zip. Ideally, it turns out that 100% of these combinations are unique. Otherwise, add the discharge year and month to the first TABLES specification (...*discharge_year*discharge_month) and rerun the two PROC FREQ steps. Then you can decide whether discharge year and month need to be included in concatenations (as have been suggested) or other methods to possibly obtain a unique identifier.

Cruise · Posted 03-18-2020 08:51 PM

@FreelanceReinh

Thanks a lot Reinhard!

I tried multiple comb of variables. However, it appears that de-duplicating my hospital discharge data by the two separate comb of 6 and 5 variables brings the N of unique records closer to the Reference N (incidence data from registry a.k.a gold-standard). PFI=permanent facility ID. dis_y=discharge year. I'm inclined to use 5-variable combo because patients could use multiple facilities. Pls, let me know if any comments.

/*6 variables: birth_y birth_m sex zip dis_y pfi*/

proc freq data=have noprint;
tables birth_y*birth_m*sex*zip*dis_y*pfi/ out=cnt6var;
where DX1_ID=1; 
run;
proc freq data=cnt6var;
tables count;
run;
proc sort data=have nodupkey out=dedup6var; /*34,320 vs 31,541*/
by birth_y birth_m sex zip dis_y pfi;
where DX1_ID=1; 
run; 

/*5 variables: birth_y birth_m sex zip dis_y*/ 

proc freq data=have noprint;
tables birth_y*birth_m*sex*zip*dis_y/ out=cnt5var;
run;
proc freq data=cnt5var;
tables count;
run;
proc sort data=have nodupkey out=dedup5var; /*31,448 vs 31,541*/ 
by birth_y birth_m sex zip dis_y;
run;

FreelanceReinh · Posted 03-19-2020 06:10 AM

@Cruise wrote:

I tried multiple comb of variables. However, it appears that de-duplicating my hospital discharge data by the two separate comb of 6 and 5 variables brings the N of unique records closer to the Reference N (incidence data from registry a.k.a gold-standard). PFI=permanent facility ID. dis_y=discharge year. I'm inclined to use 5-variable combo because patients could use multiple facilities. Pls, let me know if any comments.

Sounds reasonable to me (without background knowledge). Also given that the 31,448 combinations of birth_y, birth_m, sex, zip and dis_y are within 0.5% of the "Reference N," I would assume that these five variables form a suitable identifier.

Cruise · Posted 03-19-2020 09:27 AM

Thank you. I have to work out the cases with the same 4 variables (birth_y, birth_m, sex and ZIP) but different discharge years.

Shmuel · Posted 03-18-2020 04:00 PM

You can create a unique ID either by concatenation of all selected variables, in a fixed character format and length or by assigning a sequential number:

/* option 1 */
data want;
 set have;
       ID = cats(birth_y, birth_m, sex, ..... );
run;

/*option 2 */
data want;
 set have;
     by birth_y birth_m sex .....;   /* if need sort data have */
          retain ID 0;
          if first.<last variable in BY list> then ID+1;
run;

Cruise · Posted 03-18-2020 08:54 PM

Thanks a lot for this intuitive approach ! It helps.

Cruise · Posted 03-19-2020 07:13 PM

I ended up using Option #2.

Assign unique identifier by multiple variables

Re: Synthetic unique identifier in an un-identified longitudinal data

Re: Assign unique identifier by multiple variables

Re: Assign unique identifier by multiple variables

Re: Synthetic unique identifier in an un-identified longitudinal data

Re: Synthetic unique identifier in an un-identified longitudinal data

Re: Synthetic unique identifier in an un-identified longitudinal data

Re: Synthetic unique identifier in an un-identified longitudinal data

Re: Assign unique identifier by multiple variables

Re: Assign unique identifier by multiple variables

Re: Assign unique identifier by multiple variables

SAS Innovate 2025: Call for Content

Classroom Training Available!