Variables are not lined up correctly

grhyne · Posted 11-08-2023 01:34 PM

I have merged two dataset with common variables (id), but unfortunately after I proc print, the variables are not lining up with the two datasets

***************************************************************

		NW NTM Biobank Dashboard

***************************************************************



***
Setting the library
***;


libname enroll "X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\";

***
Load the data from SAS
***;

proc import file="X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv"
out=work.redcap
dbms=csv replace;
run;

proc import file="X:\SPH\Restricted\Winthrop\Current Projects\Biobank OCTRI\SAS\OCTRI_data.xlsx"
out=work.access
dbms=xlsx;
run;

options MSGLEVEL=I;

proc sort data=redcap; by id; run;
proc sort data=access; by ID; run;
**Convert sex to a characteristic**;

data want;
set access;
if sex='M' then sex_r=1;
if sex='F' then sex_r=0;
drop sex;
rename sex_r=sex;
run;

**Merge both datasets into one excel**;

data enrollments;
merge redcap want;
by id;
run;


proc print data=enrollments;
var id newntm sex oldntmculture micro1result micro1resultmulti micro2result micro2resultmulti source;
run;

The ID start at 1001, but the source (clinic/clinical trial, etc.) start at 1. Here is my log:

NOTE: The infile 'X:\SPH\Restricted\Winthrop\Current Projects\Biobank
      OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv' is:

      Filename=X:\SPH\Restricted\Winthrop\Current Projects\Biobank
      OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=2318647,
      Last Modified=14Sep2023:10:58:14,
      Create Time=14Sep2023:10:58:14

NOTE: 614 records were read from the infile 'X:\SPH\Restricted\Winthrop\Current
      Projects\Biobank OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv'.
      The minimum record length was 3393.
      The maximum record length was 4400.
NOTE: The data set WORK.REDCAP has 614 observations and 2582 variables.
NOTE: DATA statement used (Total process time):
      real time           1.34 seconds
      cpu time            1.23 seconds


614 rows created in WORK.REDCAP from X:\SPH\Restricted\Winthrop\Current Projects\Biobank
OCTRI\SAS\OCTRI4259NTMDiseaseP_DATA_2023-09-14_1057.csv.



NOTE: WORK.REDCAP data set was successfully created.
NOTE: The data set WORK.REDCAP has 614 observations and 2582 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           28.67 seconds
      cpu time            28.40 seconds


54699
54700  proc import file="X:\SPH\Restricted\Winthrop\Current Projects\Biobank
54700! OCTRI\SAS\OCTRI_data.xlsx"
54701  out=work.access
54702  dbms=xlsx;
54703  run;

NOTE: Import cancelled.  Output dataset WORK.ACCESS already exists.  Specify REPLACE option to
      overwrite it.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

54704
54705  options MSGLEVEL=I;
54706


54707  proc sort data=redcap; by id; run;

NOTE: There were 614 observations read from the data set WORK.REDCAP.
NOTE: SAS sort was used.
NOTE: The data set WORK.REDCAP has 614 observations and 2582 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds


54708  proc sort data=access; by ID; run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


54709  **Convert sex to a characteristic**;
54710
54711  data want;
54712  set access;
54713  if sex='M' then sex_r=1;
54714  if sex='F' then sex_r=0;
54715  drop sex;
54716  rename sex_r=sex;
54717  run;

NOTE: There were 848 observations read from the data set WORK.ACCESS.
NOTE: The data set WORK.WANT has 848 observations and 73 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


54718
54719  **Merge both datasets into one excel**;
54720
54721  data enrollments;
54722  merge redcap want;
54723  by id;
54724  run;

INFO: The variable sex on data set WORK.REDCAP will be overwritten by data set WORK.WANT.
NOTE: There were 614 observations read from the data set WORK.REDCAP.
NOTE: There were 848 observations read from the data set WORK.WANT.
NOTE: The data set WORK.ENROLLMENTS has 1462 observations and 2653 variables.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.04 seconds


54725
54726
54727  proc print data=enrollments;
54728  var id newntm sex oldntmculture micro1result micro1resultmulti micro2result
54728! micro2resultmulti source;
54729  run;

NOTE: There were 1462 observations read from the data set WORK.ENROLLMENTS.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.51 seconds
      cpu time            0.45 seconds

How can I correct this? TIA.

Kurt_Bremser · Posted 11-08-2023 01:49 PM

NOTE: There were 614 observations read from the data set WORK.REDCAP.
NOTE: There were 848 observations read from the data set WORK.WANT.
NOTE: The data set WORK.ENROLLMENTS has 1462 observations and 2653 variables.

614 + 848 = 1462. This means that there is not a single match for ID in your datasets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

grhyne · Posted 11-08-2023 06:37 PM

Hi Kurt,

Thanks for the response.

So, when I tried it prior using

merge redcap (in=a) want (in=b);
by ID;
if a;
run;

the ID variable with the two datasets DID line up, but the problem there was the data from the "want" dataset showed up as "."

So I played around with it, and now this problem is occurring.

Thanks for your help.

Patrick · Posted 11-08-2023 09:48 PM

I believe you're missing the point of these none matching id's. See if below examples help you clarify the issue.

All variables from your source tables get added to the target table (the pdv) but they only get populated if at least one of the tables contributes a value to it.

If there is a row from source table 1 with no match to source table 2 then all variables that only exist in source table 2 will be missing.

Here a merge of two tables where none of the ID's match.

This was your initial problem where you observed that the number of rows in the result table is the sum of the rows from both source tables.

Variable ina is 1 if table redcap contributes to the result, variable inb is 1 if table want contributes to the result. Else the value is 0.

data redcap;
  input id var_r $;
  datalines;
1001 A
1002 b
;

data want;
  input id var_w $;
  datalines;
1 X
2 Y
;


data merge_1;
  merge redcap(in=a) want(in=b);
  by id;
  ina=a;
  inb=b;
run;
proc print data=merge_1;
run;

If you subset the result to only select cases where table redcap contributes data then the result table will have the same number of rows as source table redcap. Because none of the id's match all variables that only exist in source table want will be missing.

data merge_2;
  merge redcap(in=a) want(in=b);
  by id;
  if a;
  ina=a;
  inb=b;
run;
proc print data=merge_2;
run;

From what you describe the id's in the one source table start at one and in the other source table start at 1000. For the id's to match in the merge you need to transform the value in one of the tables. The correct transformation logic depends on your real data. For the example here it's just about adding 1000 to the set of id's that start at 1.

data want2;
  set want(rename=(id=id_in));
  id=id_in+1000;
  drop id_in;
run;
data merge_3;
  merge redcap(in=a) want2(in=b);
  by id;
  ina=a;
  inb=b;
run;
proc print data=merge_3;
run;

Do above examples explain the observed behavior to you and give you some guidance how to resolve the challenge?

antonbcristina · Posted 11-08-2023 02:27 PM

You mentioned that one dataset's ID start at 1 but the other starts at 1001. If you know IDs 1 and 1001 are the same individual then their observations should match. You'll have to change one of your ID variables by either adding or subtracting 1000.

Variables are not lined up correctly

Re: Variables are not lined up correctly

Re: Variables are not lined up correctly

Re: Variables are not lined up correctly

Re: Variables are not lined up correctly

Variables are not lined up correctly

Re: Variables are not lined up correctly

Re: Variables are not lined up correctly

Re: Variables are not lined up correctly

Re: Variables are not lined up correctly

Ready to join fellow brilliant minds for the SAS Hackathon?