BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

Hi Folks:

 

    I have an unidentified data (no data on how many times each patient was hospitalized). Therefore, I'd like to understand the extent of patients/data rows that share the same birth_y, birth_m, sex, zip, discharge_y and discharge_m. Given my research question is concerned with a rapidly fatal (max survival time ~ 6month) rare disease (~8 per 100,000 people), one could relatively safely assume that it's unlikely that the two individuals diagnosed with this rare medical condition is to occur to have the same birth_y, birth_m, sex, zip, discharge_y and discharge_m. This gives me a hope that I could create a synthetic unique individual identifier based on these variables. I'm aware of proc sort nodupkey by listing these variables to de-duplicate the data. But I have to assess the reliability of this assumption before I get to the point of de-duplication. 


    Do you know how to create unique identifier based on the multiple variables?

A patient hospitalized twice a year could take different discharge_y and discharge_m. But this could be solved later based on this initial screening.

 

Thanks for your time in advance.

See mock data below, if that helps.

 

data have; 
input birth_y birth_m sex zip discharge_year discharge_month; 
cards;
1980 2 1 12202 1991 3
1982 2 1 12202 1991 3
1970 6 2 12307 1971 8
1965 7 2 12907 1968 9
;

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @Cruise,

 

As a start, I would count combinations like this:

proc freq data=have noprint;
tables birth_y*birth_m*sex*zip / out=cnt;
run;

proc freq data=cnt;
tables count;
run;

This will give you the numbers of unique, duplicate, triplicate, ... combinations of birth_y, birth_m, sex and zip. Ideally, it turns out that 100% of these combinations are unique. Otherwise, add the discharge year and month to the first TABLES specification (...*discharge_year*discharge_month) and rerun the two PROC FREQ steps. Then you can decide whether discharge year and month need to be included in concatenations (as have been suggested) or other methods to possibly obtain a unique identifier.

View solution in original post

9 REPLIES 9
JerryV
SAS Employee

Would something like this work?

Cruise
Ammonite | Level 13
It worked no problem. I'll frequent this approach you shared. Thanks a lot.
FreelanceReinh
Jade | Level 19

Hi @Cruise,

 

As a start, I would count combinations like this:

proc freq data=have noprint;
tables birth_y*birth_m*sex*zip / out=cnt;
run;

proc freq data=cnt;
tables count;
run;

This will give you the numbers of unique, duplicate, triplicate, ... combinations of birth_y, birth_m, sex and zip. Ideally, it turns out that 100% of these combinations are unique. Otherwise, add the discharge year and month to the first TABLES specification (...*discharge_year*discharge_month) and rerun the two PROC FREQ steps. Then you can decide whether discharge year and month need to be included in concatenations (as have been suggested) or other methods to possibly obtain a unique identifier.

Cruise
Ammonite | Level 13

@FreelanceReinh 

Thanks a lot Reinhard! 

I tried multiple comb of variables. However, it appears that de-duplicating my hospital discharge data by the two separate comb of 6 and 5 variables brings the N of unique records closer to the Reference N (incidence data from registry a.k.a gold-standard). PFI=permanent facility ID. dis_y=discharge year. I'm inclined to use 5-variable combo because patients could use multiple facilities. Pls, let me know if any comments.  

 

proc freq tables.png

 

/*6 variables: birth_y birth_m sex zip dis_y pfi*/

proc freq data=have noprint;
tables birth_y*birth_m*sex*zip*dis_y*pfi/ out=cnt6var;
where DX1_ID=1; 
run;
proc freq data=cnt6var;
tables count;
run;
proc sort data=have nodupkey out=dedup6var; /*34,320 vs 31,541*/
by birth_y birth_m sex zip dis_y pfi;
where DX1_ID=1; 
run; 

/*5 variables: birth_y birth_m sex zip dis_y*/ 

proc freq data=have noprint;
tables birth_y*birth_m*sex*zip*dis_y/ out=cnt5var;
run;
proc freq data=cnt5var;
tables count;
run;
proc sort data=have nodupkey out=dedup5var; /*31,448 vs 31,541*/ 
by birth_y birth_m sex zip dis_y;
run;
FreelanceReinh
Jade | Level 19

@Cruise wrote:

I tried multiple comb of variables. However, it appears that de-duplicating my hospital discharge data by the two separate comb of 6 and 5 variables brings the N of unique records closer to the Reference N (incidence data from registry a.k.a gold-standard). PFI=permanent facility ID. dis_y=discharge year. I'm inclined to use 5-variable combo because patients could use multiple facilities. Pls, let me know if any comments. 


Sounds reasonable to me (without background knowledge). Also given that the 31,448 combinations of birth_y, birth_m, sex, zip and dis_y are within 0.5% of the "Reference N," I would assume that these five variables form a suitable identifier.

Cruise
Ammonite | Level 13
Thank you. I have to work out the cases with the same 4 variables (birth_y, birth_m, sex and ZIP) but different discharge years.
Shmuel
Garnet | Level 18

You can create a unique ID either by concatenation of all selected variables, in a fixed character format and length or by assigning a sequential number:

/* option 1 */
data want;
 set have;
       ID = cats(birth_y, birth_m, sex, ..... );
run;

/*option 2 */
data want;
 set have;
     by birth_y birth_m sex .....;   /* if need sort data have */
          retain ID 0;
          if first.<last variable in BY list> then ID+1;
run;
Cruise
Ammonite | Level 13
Thanks a lot for this intuitive approach ! It helps.
Cruise
Ammonite | Level 13
I ended up using Option #2.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2367 views
  • 4 likes
  • 4 in conversation