Hi Folks:
I'm trying to assign ID to identify unique individual patients to the best of the information in hand. So far, data rows matching on the personal variables such as: birth year, birth month, sex, race, ethnicity and 5-digit ZIP take same ID. However, I noticed that some records had hospital admissions after 'death' registered. My question is how to account for 'death' in the process of assigning ID to the rows? Right now, i.e., it's not logical that ID=1 patient has hospital admission (DIS_Y) in 2004 after death (DISP=20) registered in 1997. After accounting for the death-logic, only first two rows would take the same ID. Same for ID=2 as his/her death happened (DISP=20) in 1997 this record takes an independent ID from the following two rows as shown in NEW_ID.
Thanks for your time in advance.
Please see below is the short mock data. More complete data can be found as an attachment to the post.
DATA BE_LOGICAL;
input ID count birth_y birth_m SEX $ ZIP RACE ETHNIC DISP dis_y NEW_ID;
cards;
1 3 1909 04 F 10002 02 1 06 1996 1
1 3 1909 04 F 10002 02 1 20 1997 1
1 3 1909 04 F 10002 02 1 06 2004 2
2 3 1916 11 F 10458 02 2 20 1997 3
2 3 1916 11 F 10458 02 2 01 2005 4
2 3 1916 11 F 10458 02 2 01 2006 4
3 4 1917 04 F 14450 01 2 20 1996 5
3 4 1917 04 F 14450 01 2 03 1999 6
3 4 1917 04 F 14450 01 2 01 2007 6
3 4 1917 04 F 14450 01 2 03 2008 6
4 3 1917 12 F 10025 02 2 05 1999 7
4 3 1917 12 F 10025 02 2 06 2002 7
4 3 1917 12 F 10025 02 2 20 2003 7
5 3 1919 09 F 11561 01 2 06 1998 8
5 3 1919 09 F 11561 01 2 06 1999 8
5 3 1919 09 F 11561 01 2 20 2006 8
6 4 1920 02 F 10023 01 2 01 1999 9
6 4 1920 02 F 10023 01 2 06 2000 9
6 4 1920 02 F 10023 01 2 20 2003 9
6 4 1920 02 F 10023 01 2 06 2006 10
;
proc print; run;
/*DISP=20 indicates death event, DIS_Y=discharge year*/
You need to have a way to increment ID whenever:
DATA BE_LOGICAL;
input ID count birth_y birth_m SEX $ ZIP RACE ETHNIC DISP dis_y expected_NEW_ID;
cards;
1 3 1909 04 F 10002 02 1 06 1996 1
1 3 1909 04 F 10002 02 1 20 1997 1
1 3 1909 04 F 10002 02 1 06 2004 2
2 3 1916 11 F 10458 02 2 20 1997 3
2 3 1916 11 F 10458 02 2 01 2005 4
2 3 1916 11 F 10458 02 2 01 2006 4
3 4 1917 04 F 14450 01 2 20 1996 5
3 4 1917 04 F 14450 01 2 03 1999 6
3 4 1917 04 F 14450 01 2 01 2007 6
3 4 1917 04 F 14450 01 2 03 2008 6
4 3 1917 12 F 10025 02 2 05 1999 7
4 3 1917 12 F 10025 02 2 06 2002 7
4 3 1917 12 F 10025 02 2 20 2003 7
5 3 1919 09 F 11561 01 2 06 1998 8
5 3 1919 09 F 11561 01 2 06 1999 8
5 3 1919 09 F 11561 01 2 20 2006 8
6 4 1920 02 F 10023 01 2 01 1999 9
6 4 1920 02 F 10023 01 2 06 2000 9
6 4 1920 02 F 10023 01 2 20 2003 9
6 4 1920 02 F 10023 01 2 06 2006 10
;
data want;
set be_logical;
by birth_y birth_m SEX ZIP RACE ETHNIC dis_y;
if first.ethnic=1 or (first.ethnic=0 and lag(disp)=20)
then new_id+1;
run;
Note the test "if first.ethnic=1" sets the dummy var first.ethnic to 1 whenever ETHNIC or any by-variable to its left changes. BTW, the BY statement tells sas the expected sort order of the data. That's why I put DIS_Y in the by-list even though I don't do a test on first.dis_y. It's there merelyh to force sas to stop if data are not in chronological order.
The other half of the IF tests checks on whether the preceding record was a death even though the current record presents no changes in BIRTH_Y through ETHNIC.
I renamed your variables to expected_new_id, so you can compare it to the calculated new_id in this program.
Here's a way to add an "id subscript" to each set of records that follow the most recent death (maybe you could have an ID with 3 sets of records encompassing 2 disp=20 death records:
DATA BE_LOGICAL;
input ID count birth_y birth_m SEX $ ZIP RACE ETHNIC DISP dis_y;
cards;
589 3 1909 04 F 10002 02 1 06 1996
589 3 1909 04 F 10002 02 1 20 1997
589 3 1909 04 F 10002 02 1 06 2004
2938 3 1916 11 F 10458 02 2 20 1997
2938 3 1916 11 F 10458 02 2 01 2005
2938 3 1916 11 F 10458 02 2 01 2006
3167 4 1917 04 F 14450 01 2 20 1996
3167 4 1917 04 F 14450 01 2 03 1999
3167 4 1917 04 F 14450 01 2 01 2007
3167 4 1917 04 F 14450 01 2 03 2008
3470 3 1917 12 F 10025 02 2 05 1999
3470 3 1917 12 F 10025 02 2 06 2002
3470 3 1917 12 F 10025 02 2 20 2003
4383 3 1919 09 F 11561 01 2 06 1998
4383 3 1919 09 F 11561 01 2 06 1999
4383 3 1919 09 F 11561 01 2 20 2006
4612 4 1920 02 F 10023 01 2 01 1999
4612 4 1920 02 F 10023 01 2 06 2000
4612 4 1920 02 F 10023 01 2 20 2003
4612 4 1920 02 F 10023 01 2 06 2006
;
data want (drop=id_subscript);
set be_logical;
by id dis_y;
if first.id then id_subscript=0;
id=id+id_subscript;
output;
if disp=20 then id_subscript+ 0.1;
run;
Note this assumes records sorted by ID/DIS_Y.
Thank you so much. I used +1 instead +0.1 to keep my ID integer. Do you have any objection to that?
please ignore my previous comment. If I understood you approach correctly, right now, subscript IDs are marking where death is not accounted in the assignment of ID. I wonder how to assign ID through 'death' (DISP=20) and stops there to assign new sequential ID to the next record(s).
I was just showing a way where likely mis-matched records can be kept in an identifiable cluster for possibly re-assignment. And the program started with your preliminary results assigning a constant id to a cluster of loosely matched records.
After all consider two people (person A and person B) have the same birth year, birth month, sex, race, ethnicity and 5-digit ZIP, i.e. a single ID in your starting example. Now what if both A and B have records over the first 4 years, and then B dies in year 5 followed by a number of records belonging exclusively to A? This program will mis-assign some of the early records to B.
But by incrementing only by 0.1 instead of by 1.0, you can investigate all instances in which there are mutliple "sub-clusters" belonging to the same primary cluster, just by keeping all records with the same integer component, but varying "id subscript".
Got it. I edited my original post a little bit. Would you please help me create NEW_ID variable shown in my edited post? I'm in a steep learning curve. If possible, only if possible, please let me know your approach to NEW_ID assuming that accounting for 'death' events in the process of assigning ID was the only wrinkle to worry about.
You need to have a way to increment ID whenever:
DATA BE_LOGICAL;
input ID count birth_y birth_m SEX $ ZIP RACE ETHNIC DISP dis_y expected_NEW_ID;
cards;
1 3 1909 04 F 10002 02 1 06 1996 1
1 3 1909 04 F 10002 02 1 20 1997 1
1 3 1909 04 F 10002 02 1 06 2004 2
2 3 1916 11 F 10458 02 2 20 1997 3
2 3 1916 11 F 10458 02 2 01 2005 4
2 3 1916 11 F 10458 02 2 01 2006 4
3 4 1917 04 F 14450 01 2 20 1996 5
3 4 1917 04 F 14450 01 2 03 1999 6
3 4 1917 04 F 14450 01 2 01 2007 6
3 4 1917 04 F 14450 01 2 03 2008 6
4 3 1917 12 F 10025 02 2 05 1999 7
4 3 1917 12 F 10025 02 2 06 2002 7
4 3 1917 12 F 10025 02 2 20 2003 7
5 3 1919 09 F 11561 01 2 06 1998 8
5 3 1919 09 F 11561 01 2 06 1999 8
5 3 1919 09 F 11561 01 2 20 2006 8
6 4 1920 02 F 10023 01 2 01 1999 9
6 4 1920 02 F 10023 01 2 06 2000 9
6 4 1920 02 F 10023 01 2 20 2003 9
6 4 1920 02 F 10023 01 2 06 2006 10
;
data want;
set be_logical;
by birth_y birth_m SEX ZIP RACE ETHNIC dis_y;
if first.ethnic=1 or (first.ethnic=0 and lag(disp)=20)
then new_id+1;
run;
Note the test "if first.ethnic=1" sets the dummy var first.ethnic to 1 whenever ETHNIC or any by-variable to its left changes. BTW, the BY statement tells sas the expected sort order of the data. That's why I put DIS_Y in the by-list even though I don't do a test on first.dis_y. It's there merelyh to force sas to stop if data are not in chronological order.
The other half of the IF tests checks on whether the preceding record was a death even though the current record presents no changes in BIRTH_Y through ETHNIC.
I renamed your variables to expected_new_id, so you can compare it to the calculated new_id in this program.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.