03-27-2018 12:21 PM
I have a dataset with 65,000+ separate hospitalization records. I need to delete any entries that indicate a repeated hospital visit by the same person, so the resulting file only has one entry per person (will be used to match to another data file later on in the project).
Each event in the dataset has a unique hospital ID, and I do not have access to SSN. Therefore, I will depend on a combination of first/last name and DOB to identify repeated admissions. Is there a straightforward way to do this? Thanks!
03-27-2018 12:30 PM
PROC SORT with UNIQUEREC option.
However, I would be very cautious with this, removing repeats is a strange request for health care data analysis. Usually that record is summarized in some manner, ie count the number of admissions, number of 30 day readmission and other metrics, but straight delete seems dangerous. This comes from almost a decade of working with health data.
03-27-2018 12:37 PM
For this task I just need a list of anyone who has had at least one hospitalization in my original discharge dataset, for linkage purposes. The full set of hospitalizations will be used for any analysis.
03-27-2018 12:38 PM
If you only need the ids, then use something like
proc sql; create table id_list as select distinct first_name, last_name, birth_date, sex from table1; quit;