Solved: Assign Identifier accounting the end of the series of rows by the spec...

Cruise · Posted 03-21-2020 03:11 PM

Hi Folks:

I'm trying to assign ID to identify unique individual patients to the best of the information in hand. So far, data rows matching on the personal variables such as: birth year, birth month, sex, race, ethnicity and 5-digit ZIP take same ID. However, I noticed that some records had hospital admissions after 'death' registered. My question is how to account for 'death' in the process of assigning ID to the rows? Right now, i.e., it's not logical that ID=1 patient has hospital admission (DIS_Y) in 2004 after death (DISP=20) registered in 1997. After accounting for the death-logic, only first two rows would take the same ID. Same for ID=2 as his/her death happened (DISP=20) in 1997 this record takes an independent ID from the following two rows as shown in NEW_ID.

Thanks for your time in advance.

Please see below is the short mock data. More complete data can be found as an attachment to the post.

DATA BE_LOGICAL; 
input ID count birth_y birth_m SEX $ ZIP RACE ETHNIC DISP dis_y NEW_ID; 
cards;
1 3 1909 04 F 10002 02 1 06 1996 1 
1 3 1909 04 F 10002 02 1 20 1997 1 
1 3 1909 04 F 10002 02 1 06 2004 2 
2 3 1916 11 F 10458 02 2 20 1997 3 
2 3 1916 11 F 10458 02 2 01 2005 4 
2 3 1916 11 F 10458 02 2 01 2006 4 
3 4 1917 04 F 14450 01 2 20 1996 5 
3 4 1917 04 F 14450 01 2 03 1999 6 
3 4 1917 04 F 14450 01 2 01 2007 6 
3 4 1917 04 F 14450 01 2 03 2008 6 
4 3 1917 12 F 10025 02 2 05 1999 7
4 3 1917 12 F 10025 02 2 06 2002 7
4 3 1917 12 F 10025 02 2 20 2003 7 
5 3 1919 09 F 11561 01 2 06 1998 8 
5 3 1919 09 F 11561 01 2 06 1999 8
5 3 1919 09 F 11561 01 2 20 2006 8
6 4 1920 02 F 10023 01 2 01 1999 9
6 4 1920 02 F 10023 01 2 06 2000 9
6 4 1920 02 F 10023 01 2 20 2003 9
6 4 1920 02 F 10023 01 2 06 2006 10
;

proc print; run;
/*DISP=20 indicates death event, DIS_Y=discharge year*/

mkeintz · Posted 03-21-2020 07:33 PM

You need to have a way to increment ID whenever:

Some value in the variables birth_y birth_m SEX ZIP RACE ETHNIC change
Or there is no such change, but the preceding record was a death (DISP=20).

DATA BE_LOGICAL; 
input ID count birth_y birth_m SEX $ ZIP RACE ETHNIC DISP dis_y expected_NEW_ID; 
cards;
1 3 1909 04 F 10002 02 1 06 1996 1 
1 3 1909 04 F 10002 02 1 20 1997 1 
1 3 1909 04 F 10002 02 1 06 2004 2 
2 3 1916 11 F 10458 02 2 20 1997 3 
2 3 1916 11 F 10458 02 2 01 2005 4 
2 3 1916 11 F 10458 02 2 01 2006 4 
3 4 1917 04 F 14450 01 2 20 1996 5 
3 4 1917 04 F 14450 01 2 03 1999 6 
3 4 1917 04 F 14450 01 2 01 2007 6 
3 4 1917 04 F 14450 01 2 03 2008 6 
4 3 1917 12 F 10025 02 2 05 1999 7
4 3 1917 12 F 10025 02 2 06 2002 7
4 3 1917 12 F 10025 02 2 20 2003 7 
5 3 1919 09 F 11561 01 2 06 1998 8 
5 3 1919 09 F 11561 01 2 06 1999 8
5 3 1919 09 F 11561 01 2 20 2006 8
6 4 1920 02 F 10023 01 2 01 1999 9
6 4 1920 02 F 10023 01 2 06 2000 9
6 4 1920 02 F 10023 01 2 20 2003 9
6 4 1920 02 F 10023 01 2 06 2006 10
;
data want;
  set be_logical;
  by birth_y birth_m SEX ZIP RACE ETHNIC dis_y;
  if first.ethnic=1  or (first.ethnic=0 and lag(disp)=20) 
     then new_id+1;
run;

Note the test "if first.ethnic=1" sets the dummy var first.ethnic to 1 whenever ETHNIC or any by-variable to its left changes. BTW, the BY statement tells sas the expected sort order of the data. That's why I put DIS_Y in the by-list even though I don't do a test on first.dis_y. It's there merelyh to force sas to stop if data are not in chronological order.

The other half of the IF tests checks on whether the preceding record was a death even though the current record presents no changes in BIRTH_Y through ETHNIC.

I renamed your variables to expected_new_id, so you can compare it to the calculated new_id in this program.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 03-21-2020 03:24 PM

Here's a way to add an "id subscript" to each set of records that follow the most recent death (maybe you could have an ID with 3 sets of records encompassing 2 disp=20 death records:

DATA BE_LOGICAL; 
input ID count birth_y birth_m SEX $ ZIP RACE ETHNIC DISP dis_y; 
cards;
589 3 1909 04 F 10002 02 1 06 1996 
589 3 1909 04 F 10002 02 1 20 1997 
589 3 1909 04 F 10002 02 1 06 2004 
2938 3 1916 11 F 10458 02 2 20 1997 
2938 3 1916 11 F 10458 02 2 01 2005 
2938 3 1916 11 F 10458 02 2 01 2006 
3167 4 1917 04 F 14450 01 2 20 1996 
3167 4 1917 04 F 14450 01 2 03 1999 
3167 4 1917 04 F 14450 01 2 01 2007 
3167 4 1917 04 F 14450 01 2 03 2008 
3470 3 1917 12 F 10025 02 2 05 1999 
3470 3 1917 12 F 10025 02 2 06 2002 
3470 3 1917 12 F 10025 02 2 20 2003 
4383 3 1919 09 F 11561 01 2 06 1998 
4383 3 1919 09 F 11561 01 2 06 1999 
4383 3 1919 09 F 11561 01 2 20 2006 
4612 4 1920 02 F 10023 01 2 01 1999 
4612 4 1920 02 F 10023 01 2 06 2000 
4612 4 1920 02 F 10023 01 2 20 2003 
4612 4 1920 02 F 10023 01 2 06 2006 
;

data want (drop=id_subscript);
  set be_logical;
  by id  dis_y;

  if first.id then id_subscript=0;
  id=id+id_subscript;
  output;
  if disp=20 then id_subscript+ 0.1;
run;

Note this assumes records sorted by ID/DIS_Y.

ID is always adjusted by the id_subscript value (initialized to zero for each starting id)
The OUTPUT has to take place prior to adjusting the subscript an account of a death record. The death record should be the last record for a given adjusted id.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Cruise · Posted 03-21-2020 03:35 PM

Thank you so much. I used +1 instead +0.1 to keep my ID integer. Do you have any objection to that?

Cruise · Posted 03-21-2020 04:03 PM

@mkeintz

please ignore my previous comment. If I understood you approach correctly, right now, subscript IDs are marking where death is not accounted in the assignment of ID. I wonder how to assign ID through 'death' (DISP=20) and stops there to assign new sequential ID to the next record(s).

mkeintz · Posted 03-21-2020 05:11 PM

I was just showing a way where likely mis-matched records can be kept in an identifiable cluster for possibly re-assignment. And the program started with your preliminary results assigning a constant id to a cluster of loosely matched records.

After all consider two people (person A and person B) have the same birth year, birth month, sex, race, ethnicity and 5-digit ZIP, i.e. a single ID in your starting example. Now what if both A and B have records over the first 4 years, and then B dies in year 5 followed by a number of records belonging exclusively to A? This program will mis-assign some of the early records to B.

But by incrementing only by 0.1 instead of by 1.0, you can investigate all instances in which there are mutliple "sub-clusters" belonging to the same primary cluster, just by keeping all records with the same integer component, but varying "id subscript".

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Cruise · Posted 03-21-2020 06:45 PM

@mkeintz

Got it. I edited my original post a little bit. Would you please help me create NEW_ID variable shown in my edited post? I'm in a steep learning curve. If possible, only if possible, please let me know your approach to NEW_ID assuming that accounting for 'death' events in the process of assigning ID was the only wrinkle to worry about.

https://communities.sas.com/t5/SAS-Programming/Assign-Identifier-accounting-the-end-of-the-series-of...

mkeintz · Posted 03-21-2020 07:33 PM

You need to have a way to increment ID whenever:

Some value in the variables birth_y birth_m SEX ZIP RACE ETHNIC change
Or there is no such change, but the preceding record was a death (DISP=20).

DATA BE_LOGICAL; 
input ID count birth_y birth_m SEX $ ZIP RACE ETHNIC DISP dis_y expected_NEW_ID; 
cards;
1 3 1909 04 F 10002 02 1 06 1996 1 
1 3 1909 04 F 10002 02 1 20 1997 1 
1 3 1909 04 F 10002 02 1 06 2004 2 
2 3 1916 11 F 10458 02 2 20 1997 3 
2 3 1916 11 F 10458 02 2 01 2005 4 
2 3 1916 11 F 10458 02 2 01 2006 4 
3 4 1917 04 F 14450 01 2 20 1996 5 
3 4 1917 04 F 14450 01 2 03 1999 6 
3 4 1917 04 F 14450 01 2 01 2007 6 
3 4 1917 04 F 14450 01 2 03 2008 6 
4 3 1917 12 F 10025 02 2 05 1999 7
4 3 1917 12 F 10025 02 2 06 2002 7
4 3 1917 12 F 10025 02 2 20 2003 7 
5 3 1919 09 F 11561 01 2 06 1998 8 
5 3 1919 09 F 11561 01 2 06 1999 8
5 3 1919 09 F 11561 01 2 20 2006 8
6 4 1920 02 F 10023 01 2 01 1999 9
6 4 1920 02 F 10023 01 2 06 2000 9
6 4 1920 02 F 10023 01 2 20 2003 9
6 4 1920 02 F 10023 01 2 06 2006 10
;
data want;
  set be_logical;
  by birth_y birth_m SEX ZIP RACE ETHNIC dis_y;
  if first.ethnic=1  or (first.ethnic=0 and lag(disp)=20) 
     then new_id+1;
run;

Note the test "if first.ethnic=1" sets the dummy var first.ethnic to 1 whenever ETHNIC or any by-variable to its left changes. BTW, the BY statement tells sas the expected sort order of the data. That's why I put DIS_Y in the by-list even though I don't do a test on first.dis_y. It's there merelyh to force sas to stop if data are not in chronological order.

The other half of the IF tests checks on whether the preceding record was a death even though the current record presents no changes in BIRTH_Y through ETHNIC.

I renamed your variables to expected_new_id, so you can compare it to the calculated new_id in this program.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Cruise · Posted 03-23-2020 08:20 AM

I follow the logic. I appreciate how you took advantage of 'lag' function here. Amazing.

Assign Identifier accounting the end of the series of rows by the specific event : death

Re: Assign Identifier accounting the end of the series of rows by the specific event : death

Re: Assign Identifier accounting the end of the series of rows by the specific event : death

Re: Assign Identifier accounting the end of the series of rows by the specific event : death

Re: Assign Identifier accounting the end of the series of rows by the specific event : death

Re: Assign Identifier accounting the end of the series of rows by the specific event : death

Re: Assign Identifier accounting the end of the series of rows by the specific event : death

Re: Assign Identifier accounting the end of the series of rows by the specific event : death

Re: Assign Identifier accounting the end of the series of rows by the specific event : death

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away