I have merged two dataset with common variables (id), but unfortunately after I proc print, the variables are not lining up with the two datasets
***************************************************************
NW NTM Biobank Dashboard
***************************************************************
***
Setting the library
***;
libname enroll "X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\";
***
Load the data from SAS
***;
proc import file="X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv"
out=work.redcap
dbms=csv replace;
run;
proc import file="X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\OCTRI_data.xlsx"
out=work.access
dbms=xlsx;
run;
options MSGLEVEL=I;
proc sort data=redcap; by id; run;
proc sort data=access; by ID; run;
**Convert sex to a characteristic**;
data want;
set access;
if sex='M' then sex_r=1;
if sex='F' then sex_r=0;
drop sex;
rename sex_r=sex;
run;
**Merge both datasets into one excel**;
data enrollments;
merge redcap want;
by id;
run;
proc print data=enrollments;
var id newntm sex oldntmculture micro1result micro1resultmulti micro2result micro2resultmulti source;
run;
The ID start at 1001, but the source (clinic/clinical trial, etc.) start at 1. Here is my log:
NOTE: The infile 'X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv' is: Filename=X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv, RECFM=V,LRECL=32767,File Size (bytes)=2318647, Last Modified=14Sep2023:10:58:14, Create Time=14Sep2023:10:58:14 NOTE: 614 records were read from the infile 'X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv'. The minimum record length was 3393. The maximum record length was 4400. NOTE: The data set WORK.REDCAP has 614 observations and 2582 variables. NOTE: DATA statement used (Total process time): real time 1.34 seconds cpu time 1.23 seconds 614 rows created in WORK.REDCAP from X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv. NOTE: WORK.REDCAP data set was successfully created. NOTE: The data set WORK.REDCAP has 614 observations and 2582 variables. NOTE: PROCEDURE IMPORT used (Total process time): real time 28.67 seconds cpu time 28.40 seconds 54699 54700 proc import file="X:\SPH\Restricted\Winthrop\Current Projects\Biobank 54700! OCTRI\SAS\OCTRI_data.xlsx" 54701 out=work.access 54702 dbms=xlsx; 54703 run; NOTE: Import cancelled. Output dataset WORK.ACCESS already exists. Specify REPLACE option to overwrite it. NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE IMPORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 54704 54705 options MSGLEVEL=I; 54706 54707 proc sort data=redcap; by id; run; NOTE: There were 614 observations read from the data set WORK.REDCAP. NOTE: SAS sort was used. NOTE: The data set WORK.REDCAP has 614 observations and 2582 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.02 seconds cpu time 0.03 seconds 54708 proc sort data=access; by ID; run; NOTE: Input data set is already sorted, no sorting done. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 54709 **Convert sex to a characteristic**; 54710 54711 data want; 54712 set access; 54713 if sex='M' then sex_r=1; 54714 if sex='F' then sex_r=0; 54715 drop sex; 54716 rename sex_r=sex; 54717 run; NOTE: There were 848 observations read from the data set WORK.ACCESS. NOTE: The data set WORK.WANT has 848 observations and 73 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 54718 54719 **Merge both datasets into one excel**; 54720 54721 data enrollments; 54722 merge redcap want; 54723 by id; 54724 run; INFO: The variable sex on data set WORK.REDCAP will be overwritten by data set WORK.WANT. NOTE: There were 614 observations read from the data set WORK.REDCAP. NOTE: There were 848 observations read from the data set WORK.WANT. NOTE: The data set WORK.ENROLLMENTS has 1462 observations and 2653 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.04 seconds 54725 54726 54727 proc print data=enrollments; 54728 var id newntm sex oldntmculture micro1result micro1resultmulti micro2result 54728! micro2resultmulti source; 54729 run; NOTE: There were 1462 observations read from the data set WORK.ENROLLMENTS. NOTE: PROCEDURE PRINT used (Total process time): real time 0.51 seconds cpu time 0.45 seconds
How can I correct this? TIA.
NOTE: There were 614 observations read from the data set WORK.REDCAP. NOTE: There were 848 observations read from the data set WORK.WANT. NOTE: The data set WORK.ENROLLMENTS has 1462 observations and 2653 variables.
614 + 848 = 1462. This means that there is not a single match for ID in your datasets.
Hi Kurt,
Thanks for the response.
So, when I tried it prior using
merge redcap (in=a) want (in=b); by ID; if a; run;
the ID variable with the two datasets DID line up, but the problem there was the data from the "want" dataset showed up as "."
So I played around with it, and now this problem is occurring.
Thanks for your help.
I believe you're missing the point of these none matching id's. See if below examples help you clarify the issue.
All variables from your source tables get added to the target table (the pdv) but they only get populated if at least one of the tables contributes a value to it.
If there is a row from source table 1 with no match to source table 2 then all variables that only exist in source table 2 will be missing.
Here a merge of two tables where none of the ID's match.
This was your initial problem where you observed that the number of rows in the result table is the sum of the rows from both source tables.
Variable ina is 1 if table redcap contributes to the result, variable inb is 1 if table want contributes to the result. Else the value is 0.
data redcap;
input id var_r $;
datalines;
1001 A
1002 b
;
data want;
input id var_w $;
datalines;
1 X
2 Y
;
data merge_1;
merge redcap(in=a) want(in=b);
by id;
ina=a;
inb=b;
run;
proc print data=merge_1;
run;
If you subset the result to only select cases where table redcap contributes data then the result table will have the same number of rows as source table redcap. Because none of the id's match all variables that only exist in source table want will be missing.
data merge_2;
merge redcap(in=a) want(in=b);
by id;
if a;
ina=a;
inb=b;
run;
proc print data=merge_2;
run;
From what you describe the id's in the one source table start at one and in the other source table start at 1000. For the id's to match in the merge you need to transform the value in one of the tables. The correct transformation logic depends on your real data. For the example here it's just about adding 1000 to the set of id's that start at 1.
data want2;
set want(rename=(id=id_in));
id=id_in+1000;
drop id_in;
run;
data merge_3;
merge redcap(in=a) want2(in=b);
by id;
ina=a;
inb=b;
run;
proc print data=merge_3;
run;
Do above examples explain the observed behavior to you and give you some guidance how to resolve the challenge?
You mentioned that one dataset's ID start at 1 but the other starts at 1001. If you know IDs 1 and 1001 are the same individual then their observations should match. You'll have to change one of your ID variables by either adding or subtracting 1000.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.