Hi Folks:
I have an unidentified data (no data on how many times each patient was hospitalized). Therefore, I'd like to understand the extent of patients/data rows that share the same birth_y, birth_m, sex, zip, discharge_y and discharge_m. Given my research question is concerned with a rapidly fatal (max survival time ~ 6month) rare disease (~8 per 100,000 people), one could relatively safely assume that it's unlikely that the two individuals diagnosed with this rare medical condition is to occur to have the same birth_y, birth_m, sex, zip, discharge_y and discharge_m. This gives me a hope that I could create a synthetic unique individual identifier based on these variables. I'm aware of proc sort nodupkey by listing these variables to de-duplicate the data. But I have to assess the reliability of this assumption before I get to the point of de-duplication.
Do you know how to create unique identifier based on the multiple variables?
A patient hospitalized twice a year could take different discharge_y and discharge_m. But this could be solved later based on this initial screening.
Thanks for your time in advance.
See mock data below, if that helps.
data have;
input birth_y birth_m sex zip discharge_year discharge_month;
cards;
1980 2 1 12202 1991 3
1982 2 1 12202 1991 3
1970 6 2 12307 1971 8
1965 7 2 12907 1968 9
;
... View more