Hi Folks:
I have an unidentified data (no data on how many times each patient was hospitalized). Therefore, I'd like to understand the extent of patients/data rows that share the same birth_y, birth_m, sex, zip, discharge_y and discharge_m. Given my research question is concerned with a rapidly fatal (max survival time ~ 6month) rare disease (~8 per 100,000 people), one could relatively safely assume that it's unlikely that the two individuals diagnosed with this rare medical condition is to occur to have the same birth_y, birth_m, sex, zip, discharge_y and discharge_m. This gives me a hope that I could create a synthetic unique individual identifier based on these variables. I'm aware of proc sort nodupkey by listing these variables to de-duplicate the data. But I have to assess the reliability of this assumption before I get to the point of de-duplication.
Do you know how to create unique identifier based on the multiple variables?
A patient hospitalized twice a year could take different discharge_y and discharge_m. But this could be solved later based on this initial screening.
Thanks for your time in advance.
See mock data below, if that helps.
data have;
input birth_y birth_m sex zip discharge_year discharge_month;
cards;
1980 2 1 12202 1991 3
1982 2 1 12202 1991 3
1970 6 2 12307 1971 8
1965 7 2 12907 1968 9
;
Hi @Cruise,
As a start, I would count combinations like this:
proc freq data=have noprint;
tables birth_y*birth_m*sex*zip / out=cnt;
run;
proc freq data=cnt;
tables count;
run;
This will give you the numbers of unique, duplicate, triplicate, ... combinations of birth_y, birth_m, sex and zip. Ideally, it turns out that 100% of these combinations are unique. Otherwise, add the discharge year and month to the first TABLES specification (...*discharge_year*discharge_month) and rerun the two PROC FREQ steps. Then you can decide whether discharge year and month need to be included in concatenations (as have been suggested) or other methods to possibly obtain a unique identifier.
Would something like this work?
Hi @Cruise,
As a start, I would count combinations like this:
proc freq data=have noprint;
tables birth_y*birth_m*sex*zip / out=cnt;
run;
proc freq data=cnt;
tables count;
run;
This will give you the numbers of unique, duplicate, triplicate, ... combinations of birth_y, birth_m, sex and zip. Ideally, it turns out that 100% of these combinations are unique. Otherwise, add the discharge year and month to the first TABLES specification (...*discharge_year*discharge_month) and rerun the two PROC FREQ steps. Then you can decide whether discharge year and month need to be included in concatenations (as have been suggested) or other methods to possibly obtain a unique identifier.
Thanks a lot Reinhard!
I tried multiple comb of variables. However, it appears that de-duplicating my hospital discharge data by the two separate comb of 6 and 5 variables brings the N of unique records closer to the Reference N (incidence data from registry a.k.a gold-standard). PFI=permanent facility ID. dis_y=discharge year. I'm inclined to use 5-variable combo because patients could use multiple facilities. Pls, let me know if any comments.
/*6 variables: birth_y birth_m sex zip dis_y pfi*/
proc freq data=have noprint;
tables birth_y*birth_m*sex*zip*dis_y*pfi/ out=cnt6var;
where DX1_ID=1;
run;
proc freq data=cnt6var;
tables count;
run;
proc sort data=have nodupkey out=dedup6var; /*34,320 vs 31,541*/
by birth_y birth_m sex zip dis_y pfi;
where DX1_ID=1;
run;
/*5 variables: birth_y birth_m sex zip dis_y*/
proc freq data=have noprint;
tables birth_y*birth_m*sex*zip*dis_y/ out=cnt5var;
run;
proc freq data=cnt5var;
tables count;
run;
proc sort data=have nodupkey out=dedup5var; /*31,448 vs 31,541*/
by birth_y birth_m sex zip dis_y;
run;
@Cruise wrote:
I tried multiple comb of variables. However, it appears that de-duplicating my hospital discharge data by the two separate comb of 6 and 5 variables brings the N of unique records closer to the Reference N (incidence data from registry a.k.a gold-standard). PFI=permanent facility ID. dis_y=discharge year. I'm inclined to use 5-variable combo because patients could use multiple facilities. Pls, let me know if any comments.
Sounds reasonable to me (without background knowledge). Also given that the 31,448 combinations of birth_y, birth_m, sex, zip and dis_y are within 0.5% of the "Reference N," I would assume that these five variables form a suitable identifier.
You can create a unique ID either by concatenation of all selected variables, in a fixed character format and length or by assigning a sequential number:
/* option 1 */
data want;
set have;
ID = cats(birth_y, birth_m, sex, ..... );
run;
/*option 2 */
data want;
set have;
by birth_y birth_m sex .....; /* if need sort data have */
retain ID 0;
if first.<last variable in BY list> then ID+1;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.