BookmarkSubscribeRSS Feed
takpdpb7
Calcite | Level 5

Hello! Really looking for some help with my code. I have three data sets. One is an Excel data file that I have cleaned to only include the most pertinent data of maternal disease status (first name, last name, DOB, disease classification status, pregnancy status). I am merging this Excel file to two SAS datasets that are vital records datasets from 2018 and 2019 that contains the previous variables, in addition to their infant's information. 

 

I am trying to determine women that had the disease AND were pregnant. By merging the disease excel file with the vital records datasets, I should only receive a list of women who meet those two criteria. 

 

I decided to use fuzzy matching to identify the matches. 

libname vitals "path";
data births2018 (keep= SFN_NUM MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR MDOB);
set vitals.maresbths18_02182020;
run;
data births2019 (keep= SFN_NUM MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR MDOB);
set vitals.Maresbths19_02182020;
run;
data births2018_2019 (drop=MDOB);
length MOTHER_GNAME MOTHER_LNAME MOTHER_LNAME_PRIOR $128;
set births2018 births2019;
date = input(MDOB,mmddyy10.);
format date mmddyy10.;
run;
data allbirths;
length MOTHER_GNAME MOTHER_LNAME $128;
set births2018_2019; 
MDOB=date;
VitalsDOB=date;
VitalsFN=MOTHER_GNAME;
VitalsLN=MOTHER_LNAME;
if MOTHER_GNAME=" " or MOTHER_GNAME = "---" then delete;
if MOTHER_LNAME=" " or MOTHER_LNAME = "---" then delete;
if MDOB = "." then delete; 
if VitalsDOB = "." then delete; 
run;
proc import out=BAB 
datafile="path"
dbms=excel replace;
run;
data BAB1(keep= CaseID DISEASE_CLASSIFICATION_STATUS MOTHER_GNAME MOTHER_LNAME MDOB PREGNANCY_STATUS);
retain CaseID DISEASE_CLASSIFICATION_STATUS MOTHER_GNAME MOTHER_LNAME MDOB PREGNANCY_STATUS MDOB;
set BAB;
format MDOB mmddyy10.;
MOTHER_GNAME=COMPRESS(MOTHER_GNAME, " ", "p");
MOTHER_LNAME=COMPRESS(MOTHER_LNAME, " " , "p");
run;
proc sort data=BAB1; by MOTHER_GNAME MOTHER_LNAME MDOB;run;
proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME date; run;
data matched; 
merge allbirths (in=a) BAB1 (in=b);
by MDOB;
if a; 
GNAME_COMPLEV=COMPLEV(MOTHER_GNAME, MOTHER_GNAME, ':iln');
LNAME_COMPLEV=COMPLEV(MOTHER_LNAME, MOTHER_LNAME);
TOTALCOMP=GNAME_COMPLEV+LNAME_COMPLEV;
if totalcomp>5 then delete;
run;
proc print data= matched;run;

LOG:

30 data BAB1(keep= CaseID DISEASE_CLASSIFICATION_STATUS MOTHER_GNAME MOTHER_LNAME MDOB
30 ! PREGNANCY_STATUS);
31 retain CaseID DISEASE_CLASSIFICATION_STATUS MOTHER_GNAME MOTHER_LNAME MDOB PREGNANCY_STATUS
31 ! MDOB;
32 set BAB;
33 format MDOB mmddyy10.;
34 MOTHER_GNAME=COMPRESS(MOTHER_GNAME, " ", "p");
35 MOTHER_LNAME=COMPRESS(MOTHER_LNAME, " " , "p");
36 run;

NOTE: There were 597 observations read from the data set WORK.BAB.
NOTE: The data set WORK.BAB1 has 597 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds


37 proc sort data=BAB1; by MOTHER_GNAME MOTHER_LNAME MDOB;run;

NOTE: There were 597 observations read from the data set WORK.BAB1.
NOTE: The data set WORK.BAB1 has 597 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds


38 proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME date; run;

NOTE: There were 137662 observations read from the data set WORK.ALLBIRTHS.
NOTE: The data set WORK.ALLBIRTHS has 137662 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.26 seconds
cpu time 0.45 seconds


39 data matched;
40 merge allbirths (in=a) BAB1 (in=b);
41 by MDOB;
42 if a;
43 GNAME_COMPLEV=COMPLEV(MOTHER_GNAME, MOTHER_GNAME, ':iln');
44 LNAME_COMPLEV=COMPLEV(MOTHER_LNAME, MOTHER_LNAME);
45 TOTALCOMP=GNAME_COMPLEV+LNAME_COMPLEV;
46 if totalcomp>5 then delete;
47 run;

ERROR: BY variables are not properly sorted on data set WORK.ALLBIRTHS.
a=1 b=0 MOTHER_GNAME=A'KAYLA MOTHER_LNAME=LIPSCOMB MOTHER_LNAME_PRIOR=LIPSCOMB SFN_NUM=017274
date=02/05/2002 MDOB=02/05/2002 VitalsDOB=15376 VitalsFN=A'KAYLA VitalsLN=LIPSCOMB CaseID=
DISEASE_CLASSIFICATION_STATUS= PREGNANCY_STATUS= FIRST.MDOB=1 LAST.MDOB=1 GNAME_COMPLEV=.
LNAME_COMPLEV=. TOTALCOMP=. _ERROR_=1 _N_=2
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 3 observations read from the data set WORK.ALLBIRTHS.
NOTE: There were 1 observations read from the data set WORK.BAB1.
WARNING: The data set WORK.MATCHED may be incomplete. When this step was stopped there were 1
observations and 15 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.03 seconds


48 proc print data= matched;run;
NOTE: Writing HTML Body file: sashtml.htm

NOTE: There were 1 observations read from the data set WORK.MATCHED.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.58 seconds
cpu time 0.12 seconds

 

 

 

 

 

5 REPLIES 5
PaigeMiller
Diamond | Level 26
40 merge allbirths (in=a) BAB1 (in=b);
41 by MDOB;

To do this merge, both data sets have to be sorted by MDOB. They are not sorted by MDOB.

 

Please, in the future, when you show us the log, paste it into the window that appears when you click on the </> icon, as I have done. This preserves the formatting of the log and makes it much more readable.

--
Paige Miller
takpdpb7
Calcite | Level 5

Thank you--a similar error is occuring the line above that now. 

"ERROR: BY variables are not properly sorted on data set WORK.BAB1.
a=0 b=1 MOTHER_GNAME=Abbey MOTHER_LNAME=McGowan MOTHER_LNAME_PRIOR=  SFN_NUM=  date=.
MDOB=11/21/2013 VitalsDOB=. VitalsFN=  VitalsLN=  CaseID=103224572
DISEASE_CLASSIFICATION_STATUS=PROBABLE PREGNANCY_STATUS=NO FIRST.MDOB=1 LAST.MDOB=1
GNAME_COMPLEV=. LNAME_COMPLEV=. TOTALCOMP=. _ERROR_=1 _N_=137663
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 137662 observations read from the data set WORK.ALLBIRTHS.
NOTE: There were 2 observations read from the data set WORK.BAB1.
WARNING: The data set WORK.MATCHED may be incomplete.  When this step was stopped there were
         137662 observations and 15 variables.
WARNING: Data set WORK.MATCHED was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           0.32 seconds
      cpu time            0.28 seconds"

The output also seems to be wrong? 

The data set WORK.MATCHED has 137662 observations and 15 variables.

There should only be a match on the women who were pregnant and had the disease, and this is too many to be accurate. 

 

Thank you! 

I appreciate your help!

 

 

ballardw
Super User

Same error, sort first.

 

A critical part of that is

WARNING: Data set WORK.MATCHED was not replaced because this step was stopped.

The bit about the number of records tells you about where something went wrong but you can't see the results because your previous version of the data set was NOT REPLACED.

 

For what it might be worth, I tend to create a full name as something like "lastname firstname" before the compged function call. Then sorting the output on the matching varaible like MDOB and descending compged value tends to cluster full names better than two compged values.

 

 

takpdpb7
Calcite | Level 5

Great! I sorted both of the variables by the MDOB variable, but now the date and MDOB variables are repeating several times, instead of giving me the matched set?

ballardw
Super User

If a BY variable (or group) has more than one record in one data set then the matches from the other set will get duplicated.

 

If both data sets have more than on record with the same by variable values the result can be somewhat unpredictable and generally not desired. If this is the case then you may need to provide example data and what you expect for the result as the techniques are likely to require more than a simple merge.

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 556 views
  • 0 likes
  • 3 in conversation