Re: Merging datasets

Merging datasets

Re: Merging datasets

Re: Merging datasets

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!

takpdpb7 · Posted 05-20-2020 12:35 PM

Hello!

I am trying to merge two vital records datasets (that have maternal DOB and infant DOB) with a linelist of individuals (both male & female) who have a certain disease that have a unique ID #. Within the line-list of individuals, there is a variable for pregnancy status.

I am trying to ascertain pregnancy status by matching women on maternal DOB & name.

I have a code that I ran, but for some reason, it is repeating the same unique ID number for multiple individuals and the unique linelist has name repeating.

libname vitals "PATH";

data births2018 (keep=SFN_NUM CHILD_GNAME CHILD_LNAME IDOB MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR MDOB RES_CITY YEAR);
set VITALS 2018;
run;

data births2019 (keep=SFN_NUM CHILD_GNAME CHILD_LNAME IDOB MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR MDOB RES_CITY YEAR);
set VITALS2019;
run;

data births2018_2019 (drop=MDOB);
set births2018 births2019;
date = input(MDOB,mmddyy10.);
run;

data allbirths (drop=date RES_CITY);
length MOTHER_GNAME MOTHER_LNAME $13;
set births2018_2019;
MDOB=date;
VitalsDOB=date;
VitalsFN=MOTHER_GNAME;
VitalsLN=MOTHER_LNAME;
Vitalscity=RES_CITY; 
run;

proc import out=BAB 
datafile="PATH"
dbms=excel
replace;
run;

proc contents data=BAB; run; 

*this removes male cases*;
data BAB1 (keep= CaseID GENDER  Disease_Classification_Status Disease MOTHER_GNAME MOTHER_LNAME MDOB PREGNANCY_STATUS SYMPTOM_ONSET_DATE Event_Date OFFICIAL_CITY);
retain CaseID  GENDER Disease_Classification_Status Disease MOTHER_GNAME MOTHER_LNAME MDOB OFFICIAL_CITY SYMPTOM_ONSET_DATE Event_Date PREGNANCY_STATUS;
set BAB;
MDOB=BIRTH_DATE;
MOTHER_GNAME=First_Name;
MOTHER_LNAME=Last_Name;

MOTHER_GNAME=upcase(MOTHER_GNAME); 
MOTHER_LNAME=upcase(MOTHER_LNAME); 
if GENDER='Male' then delete;
run;

data BAB2 (drop= OFFICIAL_CITY GENDER) ;
set BAB1;
MavenFN=MOTHER_GNAME;
MavenLN=MOTHER_LNAME;
MavenCity=OFFICIAL_CITY;
MavenDOB=MDOB;
run;

proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run;
proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run;

*Match based on First Name, Last name and DOB; 
data Names;
merge BAB2 allbirths;
by MOTHER_GNAME MOTHER_LNAME MDOB;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run;
proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run;

*Match based on First AND Last name; 
data Names;
merge BAB2 allbirths;
by MOTHER_GNAME MOTHER_LNAME MDOB;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MDOB MOTHER_GNAME;run;
proc sort data=allbirths; by MDOB MOTHER_GNAME; run;

*Match based on DOB and First Name;
data together1;
merge BAB2 allbirths;
by MDOB MOTHER_GNAME;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MDOB MOTHER_LNAME;run;
proc sort data=allbirths; by MDOB MOTHER_LNAME; run;

*Match based on DOB and Last Name; 
data together2; 
merge BAB2 allbirths;
by MDOB MOTHER_LNAME;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MDOB;run;
proc sort data=allbirths; by MDOB; run;

*Match based on DOB and Mother's Prior Last Name; 
data DOB; 
merge BAB2 allbirths;
by MDOB;
if MOTHER_LNAME ne MOTHER_LNAME_PRIOR then delete;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

data final1;
set Names together1 together2 DOB;
run;

*There are a few individuals who had multiple pregnancy outcomes*;
data final2 (Drop= MOTHER_LNAME MOTHER_GNAME MDOB SFN_NUM);
Retain CaseID Disease_Classification_Status Disease MavenFN VitalsFN MavenLN VitalsLN MOTHER_LNAME_PRIOR MavenDOB VitalsDOB MavenCity Vitalscity 
SYMPTOM_ONSET_DATE Event_Date IDOB YEAR CHILD_GNAME CHILD_LNAME;
set final1;
if MavenFN ne VitalsFN then delete;
run;

The pertinent variable from the linelist are maternal DOB, maternal name, DOB, and pregnancy status.

Sample of what the vital records data looks like:

Reeza · Posted 05-20-2020 12:43 PM

I have a code that I ran, but for some reason, it is repeating the same unique ID number for multiple individuals and the unique linelist has name repeating.

This means that in one of your merges, you have a data set that has multiples in each data set. When you go through your log, look at the number of records of each step and see where it's getting duplicated. Then you can figure out where you need to filter it down and where the issue is. We cannot run your code without the data so if you can reduce this to simpler problem we can definitely help you out.

Spoiler

@takpdpb7 wrote:

Hello!

I am trying to merge two vital records datasets (that have maternal DOB and infant DOB) with a linelist of individuals (both male & female) who have a certain disease that have a unique ID #. Within the line-list of individuals, there is a variable for pregnancy status.

I am trying to ascertain pregnancy status by matching women on maternal DOB & name.

I have a code that I ran, but for some reason, it is repeating the same unique ID number for multiple individuals and the unique linelist has name repeating.

libname vitals "PATH";

data births2018 (keep=SFN_NUM CHILD_GNAME CHILD_LNAME IDOB MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR MDOB RES_CITY YEAR);
set VITALS 2018;
run;

data births2019 (keep=SFN_NUM CHILD_GNAME CHILD_LNAME IDOB MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR MDOB RES_CITY YEAR);
set VITALS2019;
run;

data births2018_2019 (drop=MDOB);
set births2018 births2019;
date = input(MDOB,mmddyy10.);
run;

data allbirths (drop=date RES_CITY);
length MOTHER_GNAME MOTHER_LNAME $13;
set births2018_2019;
MDOB=date;
VitalsDOB=date;
VitalsFN=MOTHER_GNAME;
VitalsLN=MOTHER_LNAME;
Vitalscity=RES_CITY; 
run;

proc import out=BAB 
datafile="PATH"
dbms=excel
replace;
run;

proc contents data=BAB; run; 

*this removes male cases*;
data BAB1 (keep= CaseID GENDER  Disease_Classification_Status Disease MOTHER_GNAME MOTHER_LNAME MDOB PREGNANCY_STATUS SYMPTOM_ONSET_DATE Event_Date OFFICIAL_CITY);
retain CaseID  GENDER Disease_Classification_Status Disease MOTHER_GNAME MOTHER_LNAME MDOB OFFICIAL_CITY SYMPTOM_ONSET_DATE Event_Date PREGNANCY_STATUS;
set BAB;
MDOB=BIRTH_DATE;
MOTHER_GNAME=First_Name;
MOTHER_LNAME=Last_Name;

MOTHER_GNAME=upcase(MOTHER_GNAME); 
MOTHER_LNAME=upcase(MOTHER_LNAME); 
if GENDER='Male' then delete;
run;

data BAB2 (drop= OFFICIAL_CITY GENDER) ;
set BAB1;
MavenFN=MOTHER_GNAME;
MavenLN=MOTHER_LNAME;
MavenCity=OFFICIAL_CITY;
MavenDOB=MDOB;
run;

proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run;
proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run;

*Match based on First Name, Last name and DOB; 
data Names;
merge BAB2 allbirths;
by MOTHER_GNAME MOTHER_LNAME MDOB;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run;
proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run;

*Match based on First AND Last name; 
data Names;
merge BAB2 allbirths;
by MOTHER_GNAME MOTHER_LNAME MDOB;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MDOB MOTHER_GNAME;run;
proc sort data=allbirths; by MDOB MOTHER_GNAME; run;

*Match based on DOB and First Name;
data together1;
merge BAB2 allbirths;
by MDOB MOTHER_GNAME;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MDOB MOTHER_LNAME;run;
proc sort data=allbirths; by MDOB MOTHER_LNAME; run;

*Match based on DOB and Last Name; 
data together2; 
merge BAB2 allbirths;
by MDOB MOTHER_LNAME;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MDOB;run;
proc sort data=allbirths; by MDOB; run;

*Match based on DOB and Mother's Prior Last Name; 
data DOB; 
merge BAB2 allbirths;
by MDOB;
if MOTHER_LNAME ne MOTHER_LNAME_PRIOR then delete;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

data final1;
set Names together1 together2 DOB;
run;

*There are a few individuals who had multiple pregnancy outcomes*;
data final2 (Drop= MOTHER_LNAME MOTHER_GNAME MDOB SFN_NUM);
Retain CaseID Disease_Classification_Status Disease MavenFN VitalsFN MavenLN VitalsLN MOTHER_LNAME_PRIOR MavenDOB VitalsDOB MavenCity Vitalscity 
SYMPTOM_ONSET_DATE Event_Date IDOB YEAR CHILD_GNAME CHILD_LNAME;
set final1;
if MavenFN ne VitalsFN then delete;
run;

The pertinent variable from the linelist are maternal DOB, maternal name, DOB, and pregnancy status.

Sample of what the vital records data looks like:

wrote: Hello! I am trying to merge two vital records datasets (that have maternal DOB and infant DOB) with a linelist of individuals (both male & female) who have a certain disease that have a unique ID #. Within the line-list of individuals, there is a variable for pregnancy status. I am trying to ascertain pregnancy status by matching women on maternal DOB & name. I have a code that I ran, but for some reason, it is repeating the same unique ID number for multiple individuals and the unique linelist has name repeating. libname vitals "PATH"; data births2018 (keep=SFN_NUM CHILD_GNAME CHILD_LNAME IDOB MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR MDOB RES_CITY YEAR); set VITALS 2018; run; data births2019 (keep=SFN_NUM CHILD_GNAME CHILD_LNAME IDOB MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR MDOB RES_CITY YEAR); set VITALS2019; run; data births2018_2019 (drop=MDOB); set births2018 births2019; date = input(MDOB,mmddyy10.); run; data allbirths (drop=date RES_CITY); length MOTHER_GNAME MOTHER_LNAME $13; set births2018_2019; MDOB=date; VitalsDOB=date; VitalsFN=MOTHER_GNAME; VitalsLN=MOTHER_LNAME; Vitalscity=RES_CITY; run; proc import out=BAB datafile="PATH" dbms=excel replace; run; proc contents data=BAB; run; *this removes male cases*; data BAB1 (keep= CaseID GENDER Disease_Classification_Status Disease MOTHER_GNAME MOTHER_LNAME MDOB PREGNANCY_STATUS SYMPTOM_ONSET_DATE Event_Date OFFICIAL_CITY); retain CaseID GENDER Disease_Classification_Status Disease MOTHER_GNAME MOTHER_LNAME MDOB OFFICIAL_CITY SYMPTOM_ONSET_DATE Event_Date PREGNANCY_STATUS; set BAB; MDOB=BIRTH_DATE; MOTHER_GNAME=First_Name; MOTHER_LNAME=Last_Name; MOTHER_GNAME=upcase(MOTHER_GNAME); MOTHER_LNAME=upcase(MOTHER_LNAME); if GENDER='Male' then delete; run; data BAB2 (drop= OFFICIAL_CITY GENDER) ; set BAB1; MavenFN=MOTHER_GNAME; MavenLN=MOTHER_LNAME; MavenCity=OFFICIAL_CITY; MavenDOB=MDOB; run; proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run; proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run; *Match based on First Name, Last name and DOB; data Names; merge BAB2 allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; if CaseID="" then delete; if SFN_NUM=" " then delete; run; proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run; proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run; *Match based on First AND Last name; data Names; merge BAB2 allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; if CaseID="" then delete; if SFN_NUM=" " then delete; run; proc sort data=BAB2; by MDOB MOTHER_GNAME;run; proc sort data=allbirths; by MDOB MOTHER_GNAME; run; *Match based on DOB and First Name; data together1; merge BAB2 allbirths; by MDOB MOTHER_GNAME; if CaseID="" then delete; if SFN_NUM=" " then delete; run; proc sort data=BAB2; by MDOB MOTHER_LNAME;run; proc sort data=allbirths; by MDOB MOTHER_LNAME; run; *Match based on DOB and Last Name; data together2; merge BAB2 allbirths; by MDOB MOTHER_LNAME; if CaseID="" then delete; if SFN_NUM=" " then delete; run; proc sort data=BAB2; by MDOB;run; proc sort data=allbirths; by MDOB; run; *Match based on DOB and Mother's Prior Last Name; data DOB; merge BAB2 allbirths; by MDOB; if MOTHER_LNAME ne MOTHER_LNAME_PRIOR then delete; if CaseID="" then delete; if SFN_NUM=" " then delete; run; data final1; set Names together1 together2 DOB; run; *There are a few individuals who had multiple pregnancy outcomes*; data final2 (Drop= MOTHER_LNAME MOTHER_GNAME MDOB SFN_NUM); Retain CaseID Disease_Classification_Status Disease MavenFN VitalsFN MavenLN VitalsLN MOTHER_LNAME_PRIOR MavenDOB VitalsDOB MavenCity Vitalscity SYMPTOM_ONSET_DATE Event_Date IDOB YEAR CHILD_GNAME CHILD_LNAME; set final1; if MavenFN ne VitalsFN then delete; run; The pertinent variable from the linelist are maternal DOB, maternal name, DOB, and pregnancy status. Sample of what the vital records data looks like: SFN_NUM CHILD_GNAME CHILD_LNAME IDOB MOTHER_GNAME MOTHER_LNAME MDOB

ballardw · Posted 05-20-2020 02:55 PM

The first problem I see is in this block of code:

*Match based on First Name, Last name and DOB; 
data Names;
merge BAB2 allbirths;
by MOTHER_GNAME MOTHER_LNAME MDOB;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run;
proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run;

*Match based on First AND Last name; 
data Names;
merge BAB2 allbirths;
by MOTHER_GNAME MOTHER_LNAME MDOB;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;

The second date step overwrites the first version of Names which does not show anything that preserves or uses it in between. So you do not have a match on all three fields.

I have done similar processes and as soon as you identify a match then the matched records need to be removed from BOTH data sets to avoid re-matching already matched persons. Since all of your merges involve the exact same data sets then when

Mary Jones 1/1/1990 matches in the BAB2 and Allbirths sets, BOTH of those are available.

So they BOTH match

Mary Smith 1/1/1990 in your First name and Dob match.

And then you stack the result sets. Multiplying the matches.

Here is an example of removing the matched values and creating reduced sets in a merge.

*Match based on First Name, Last name and DOB; 
data matchNames1
     bab2only
     allbirthonly
    ;
   merge BAB2  (in=inbabs)
         allbirths (in=inall);
   by MOTHER_GNAME MOTHER_LNAME MDOB;
   if inbabs and inall then output MatchNames1;
   else if inbabs then output bab2only;
   else output allbirthonly;
run;

The IN= dataset option creates a temporary variable that is 1 (true) when the current record from that set contributes to the merge and 0 otherwise.

So when all of the IN= variables are true then all of the data sources contributed and you have a full match. You can create multiple data sets in a single pass. The explicit OUTPUT tells when to write to which set name.

You likely need to have a Keep or Drop statement associated with one or more of the Babs2only and Allbirthonly so their structure stays the same (only the variables they to begin with).

I would suggest a similar step at each Merge creating a different Match, Babsonly and Allbirthsonly at every step.

Something else to consider is how Proc Import works with Excel files. It will only examine 20 rows of data by default before assigning variable type (character or numeric) and Length. Which can lead to truncation of names if the first 20 last names are all less than 20 characters and then you have a long name like Cartwright-Chickering (21 characters) so the "g" at the end would get truncated. It would be better to open the file in your spreadsheet and the save as CSV to import. Then use the option GUESSINGROWS=MAX; So the entire file is read before the length of variables is set.

With your current example where you are setting names to 13 characters, I am also wondering if that length is appropriate. I have some pretty small files, only a few hundred people and 25 characters for each of first and last name currently provides me with 2 unused character (i.e I have names of 23 characters). My code was intended to allow for some so that when I read later data I wasn't likely to have issues and I picked 25 when the longest name I had in the data was 17 characters. I'm glad I picked something on the order of 25 as otherwise I would have a slew of sets with potential issues of mismatched name lengths.

But I am extremely leery of your First Name Dob match as a blind faith match unless both of your sets are VERY small.

And you may have to use a manual step or two at the end to find things where spelling errors have crept in "Marry" instead of "Mary" or nicknames "Bobby" instead of "Roberta".

Or you can look for other tools that do probabilistic matching. The CDC website has a tool name LINKPLUS that you can download that will take two text files and do the matches. One nice thing is that you don't have to change variable "names" as you can tell it to match "FirstName" in one set to "NameFirst" in the second set.

The result file will match through a hierarchy of variables you specify (including Address fields if you have them, even if just postal code or city name) and give a probability of matches between values.

CHILD_GNAME

CHILD_LNAME

MOTHER_GNAME

MOTHER_LNAME