Solved: Combining duplicate id's based on different variables

nharuka · Posted 06-28-2022 07:47 PM

I have multiple observations per person. What I'm trying to do is if COMPDT = '.' for all rows for that same ID, then I just want to keep one of the obs(they're duplicates in this case). However, if COMPFL='Y' then I want ESDT=COMPDT and ETDT=ETDT where

DSSEQ =1 for that id. How can I do this?

Current data

ID    DSSEQ    ESDT         ETDT          COMPFL   COMPDT
001     1      01Jan2016    01Jan2016       N        .
001     2      01Jan2016    01Jan2016       N        .
002     4      07JUL2014    07JUL2014       N        .  
002     2      13NOV2012    13NOV2012       Y      13NOV2012
002     3      06DEC2018    06DEC2018       N        .
002     1      14FEB2020    14FEB2020       N        .
003     2       25OCT2018   25OCT2018       N        .       
003     1       25OCT2018   25OCT2018       N        .

Wanted data

ID    DSSEQ    ESDT         ETDT          
001     1      01Jan2016    01Jan2016              
002     2      13NOV2012    14FEB2020            
003     1       25OCT2018   25OCT2018

mkeintz · Posted 06-29-2022 01:08 AM

Assuming that every ID has exactly one DSSEQ=1 record, and that any ID which has a non-missing COMPDT has only one such record, then a self-MERGE with a selective KEEP parameter will work just fine:

data want (drop=compfl compdt);
  merge have (where=(dsseq=1))
        have (keep=id dsseq compdt where=(compdt^=.));
  by id;
  if compdt^=. then esdt=compdt;
run;

This also assume data are sorted by ID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 06-29-2022 01:08 AM

Assuming that every ID has exactly one DSSEQ=1 record, and that any ID which has a non-missing COMPDT has only one such record, then a self-MERGE with a selective KEEP parameter will work just fine:

data want (drop=compfl compdt);
  merge have (where=(dsseq=1))
        have (keep=id dsseq compdt where=(compdt^=.));
  by id;
  if compdt^=. then esdt=compdt;
run;

This also assume data are sorted by ID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Combining duplicate id's based on different variables

Re: Combining duplicate id's based on different variables

Re: Combining duplicate id's based on different variables

Combining duplicate id's based on different variables

Re: Combining duplicate id's based on different variables

Re: Combining duplicate id's based on different variables

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!