The first problem I see is in this block of code:
*Match based on First Name, Last name and DOB;
data Names;
merge BAB2 allbirths;
by MOTHER_GNAME MOTHER_LNAME MDOB;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;
proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run;
proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run;
*Match based on First AND Last name;
data Names;
merge BAB2 allbirths;
by MOTHER_GNAME MOTHER_LNAME MDOB;
if CaseID="" then delete;
if SFN_NUM=" " then delete;
run;
The second date step overwrites the first version of Names which does not show anything that preserves or uses it in between. So you do not have a match on all three fields.
I have done similar processes and as soon as you identify a match then the matched records need to be removed from BOTH data sets to avoid re-matching already matched persons. Since all of your merges involve the exact same data sets then when
Mary Jones 1/1/1990 matches in the BAB2 and Allbirths sets, BOTH of those are available.
So they BOTH match
Mary Smith 1/1/1990 in your First name and Dob match.
And then you stack the result sets. Multiplying the matches.
Here is an example of removing the matched values and creating reduced sets in a merge.
*Match based on First Name, Last name and DOB;
data matchNames1
bab2only
allbirthonly
;
merge BAB2 (in=inbabs)
allbirths (in=inall);
by MOTHER_GNAME MOTHER_LNAME MDOB;
if inbabs and inall then output MatchNames1;
else if inbabs then output bab2only;
else output allbirthonly;
run;
The IN= dataset option creates a temporary variable that is 1 (true) when the current record from that set contributes to the merge and 0 otherwise.
So when all of the IN= variables are true then all of the data sources contributed and you have a full match. You can create multiple data sets in a single pass. The explicit OUTPUT tells when to write to which set name.
You likely need to have a Keep or Drop statement associated with one or more of the Babs2only and Allbirthonly so their structure stays the same (only the variables they to begin with).
I would suggest a similar step at each Merge creating a different Match, Babsonly and Allbirthsonly at every step.
Something else to consider is how Proc Import works with Excel files. It will only examine 20 rows of data by default before assigning variable type (character or numeric) and Length. Which can lead to truncation of names if the first 20 last names are all less than 20 characters and then you have a long name like Cartwright-Chickering (21 characters) so the "g" at the end would get truncated. It would be better to open the file in your spreadsheet and the save as CSV to import. Then use the option GUESSINGROWS=MAX; So the entire file is read before the length of variables is set.
With your current example where you are setting names to 13 characters, I am also wondering if that length is appropriate. I have some pretty small files, only a few hundred people and 25 characters for each of first and last name currently provides me with 2 unused character (i.e I have names of 23 characters). My code was intended to allow for some so that when I read later data I wasn't likely to have issues and I picked 25 when the longest name I had in the data was 17 characters. I'm glad I picked something on the order of 25 as otherwise I would have a slew of sets with potential issues of mismatched name lengths.
But I am extremely leery of your First Name Dob match as a blind faith match unless both of your sets are VERY small.
And you may have to use a manual step or two at the end to find things where spelling errors have crept in "Marry" instead of "Mary" or nicknames "Bobby" instead of "Roberta".
Or you can look for other tools that do probabilistic matching. The CDC website has a tool name LINKPLUS that you can download that will take two text files and do the matches. One nice thing is that you don't have to change variable "names" as you can tell it to match "FirstName" in one set to "NameFirst" in the second set.
The result file will match through a hierarchy of variables you specify (including Address fields if you have them, even if just postal code or city name) and give a probability of matches between values.
... View more