Hi, I am linking a vaccine dataset with a COVID-19 case dataset using the first and last names. I made sure the format, and length of variables are the same. The linkage works only for a handful of records and many with the same exact name and last name do not link. If the linkage works for some cases, it should work for all the matching cases? Please let me know if you can think of any solution.
Thanks...
First thing would be to share the code for how you doing the linkage.
Second would be to provide some examples where the link fails and some where it works.
Provide the data in the form of data steps so we can test code.
Common issues with names, mostly boil down to making an exact comparison with spelling differences. "Alice" does not match " Alice" (leading space), "Le Blanc" does not match "LeBlanc" or "Le blanc", foreign language characters such as accents only appearing in one version or use of something else along with the names that does not match. There are things that can creep in if you are comparing one variable with first and last name versus two variables with only the first or last name in each.
Generally the approach is identify why some don't work, modify the data, remove or change characters in the data so the values are consistent, make sure the language coding and character sets are the same.
Can't provide specific choices without specific examples.
Thank you very much for the reply. I think the problem is probably the encoding while I import the data "utf-16le". The second file encoding seems to be WLatin1. I got this far by using your suggestions and SAS help. But it is taking way too long to fix. it will be great if you could help me.
Please see below for the log FOR IMPORT AND LINKAE:
NOTE: A byte-order mark in the file "C:\Users\-----------------------.txt" (for
fileref "#LN00084") indicates that the data is encoded in "utf-16le". This encoding will
be used to process the file.
NOTE: The infile 'C:\Users\------------.txt' is:
Filename=C:\Users\-------------------.txt,
RECFM=V,LRECL=65534,File Size (bytes)=96315034,
Last Modified=23Apr2021:11:05:04,
Create Time=23Apr2021:10:50:37
NOTE: Invalid data for Age in line 42553 233-241.
NOTE: Invalid data for LHJNumber in line 42553 327-330.
NOTE: Invalid data for DtInvestigator in line 42553 454-457.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---
42553 21734345.DI..Coronavirus Disease 2019 - Non-positive ELR.E-NCOV2019NP."All-NoHIV, All-No
ZONE 3333333304400467666767772467667623333222466276767676244504244453333450246624644522466246
NUMR 21734345949993F2FE16925304935135020190D0EFED0F39496505C295DE3F62019E0921CCDEF896C01CCDEF
89 TB, All-WithHIV, CD, IDB, IZB, nCoV2019, VRDL, zCoronavirus Disease 2019 - Non-positive
MORE OF THESE AND FINALLY:
NOTE: 95867 records were read from the infile 'C:\Users\-----------------------------.txt'.
The minimum record length was 415.
The maximum record length was 7259.
NOTE: The data set CAL.ALLNAPA has 95867 observations and 81 variables.
NOTE: DATA statement used (Total process time):
real time 1.27 seconds
cpu time 1.25 seconds
Errors detected in submitted DATA step. Examine log.
95867 rows created in CAL.ALLNAPA from C:\Users\----------------.txt.
ERROR: Import unsuccessful. See SAS Log for details.
HERE IS THE LINKAGE ERROR MESSAGE:
2119 PROC SORT DATA = IRIS;
2120 BY recip_first_name recip_last_name RECIP_DOB;
2121 RUN;
NOTE: There were 121741 observations read from the data set IRIS.IRIS412.
NOTE: The data set IRIS has 121741 observations and 69 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.22 seconds
cpu time 0.20 seconds
2122
2123 PROC SORT DATA = CAL;
2124 BY recip_last_name recip_first_name RECIP_DOB;
2125 RUN;
NOTE: There were 95867 observations read from the data set CAL.
NOTE: The data set CAL has 95867 observations and 82 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 1.69 seconds
cpu time 0.78 seconds
2126
2127 data CAL.CALVAX;
2128 merge IRIS (in=a) CAL (in=b);
2129 by recip_first_name recip_last_name RECIP_DOB;
2130 IRIS = a;
2131 CAL = b;
2132 run;
ERROR: BY variables are not properly sorted on data set CAL.
LAST.recip_first_name=1 FIRST.Recip_last_name=1 LAST.Recip_last_name=1 FIRST.recip_dob=1
LAST.recip_dob=1 IRIS=. CAL=. _ERROR_=1 _N_=72367
...
...
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 72367 observations read from the data set IRIS.
NOTE: There were 2 observations read from the data set CAL.
WARNING: The data set CAL.CALVAX may be incomplete. When this step was stopped there were 72366
observations and 149 variables.
WARNING: Data set CAL.CALVAX was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1.01 seconds
cpu time 0.25 seconds
2133 proc freq data = CAL.CALVAX;
2134 table IRIS*CAL;
2135 run;
NOTE: There were 131981 observations read from the data set CAL.CALVAX.
NOTE: PROCEDURE FREQ used (Total process time):
real time 1.41 seconds
cpu time 0.54 seconds
Another area to investigate ... sometimes values look like they match but they really don't. For example, some values might end with a tab character instead of a space. Take two values that look like they should match (but don't), and inspect them more closely. The easiest way:
proc print data=should_match;
var lastname firstname;
format lastname firstname $hex40.;
run;
You could easily find differences between values that you think should match.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.