Solved: Re: Merging by two variables troubleshoot

bretthouston · Posted 03-02-2020 04:07 PM

Hello,

I am trying to merge two tables (example below) by two variables (PatientID and AdmissionID). When I do so, the data in one of the variables (patientWeight) is dropped. If I merge exclusively on PatientID however, this isn't the case.

Here is an example of table 1 (merged_txainformation):

PatientID AdmissionID txaAdministrationType

1 HSC1 Infusion

1 HSC2 Infusion

2 34253 Bolus

3 32223 Infusion

Here is an example of table 2 (weightinfo):

PatientID AdmissionID PatientWeight

1 HSC1 68

1 HSC2 78

1 HSC3 110

2 34253 96

3 32223 .

Ideally, I would like the table to look like the following:

PatientID AdmissionID txaAdministrationType PatientWeight

1 HSC1 Infusion 68

1 HSC2 Infusion 78

2 34253 Bolus 96

3 32223 Infusion .

I've tried the following code:

data mergedTXAweight;
merge merged_txaInformation (in=TXA)
weightinfo (in=Ottweight);
by PatientID AdmissionID;
if TXA then
output;
run;

If I merge exclusively by PatientID then the values in PatientWeight do result in the merged table, but so does the row with PatientID #1 and AdmissionID HSC3 (I only want the people in the merged_txaInformation table). If I merge by PatientID and AdmissionID, then the values in PatientWeight are all missing.

Of note, PatientID is a numeric variable (length = 😎 in both tables, and AdmissionID is a character variable (length = 10) in both tables. I'm not sure if this is the issue?

Thanks in advance!

Brett

bretthouston · Posted 03-02-2020 04:49 PM

Thanks for the reply - I really appreciate it! As best I can tell they match? Here is an excerpt from the proc contents I've run on both of the tables.

Is there a different way for me to figure out if there are non-printing characters? Could it be because of the informats or label?

View solution in original post

ballardw · Posted 03-02-2020 04:35 PM

@bretthouston wrote:

Hello,

I am trying to merge two tables (example below) by two variables (PatientID and AdmissionID). When I do so, the data in one of the variables (patientWeight) is dropped. If I merge exclusively on PatientID however, this isn't the case.

Here is an example of table 1 (merged_txainformation):

PatientID AdmissionID txaAdministrationType

1 HSC1 Infusion

1 HSC2 Infusion

2 34253 Bolus

3 32223 Infusion

Here is an example of table 2 (weightinfo):

PatientID AdmissionID PatientWeight

1 HSC1 68

1 HSC2 78

1 HSC3 110

2 34253 96

3 32223 .

Ideally, I would like the table to look like the following:

PatientID AdmissionID txaAdministrationType PatientWeight

1 HSC1 Infusion 68

1 HSC2 Infusion 78

2 34253 Bolus 96

3 32223 Infusion .

I've tried the following code:

data mergedTXAweight;
merge merged_txaInformation (in=TXA)
weightinfo (in=Ottweight);
by PatientID AdmissionID;
if TXA then
output;
run;

If I merge exclusively by PatientID then the values in PatientWeight do result in the merged table, but so does the row with PatientID #1 and AdmissionID HSC3 (I only want the people in the merged_txaInformation table). If I merge by PatientID and AdmissionID, then the values in PatientWeight are all missing.

Of note, PatientID is a numeric variable (length = 😎 in both tables, and AdmissionID is a character variable (length = 10) in both tables. I'm not sure if this is the issue?

Thanks in advance!

Brett

What does this have to do with what is going on? Is this a different data set?

If I merge exclusively by PatientID then the values in PatientWeight do result in the merged table, but so does the row with PatientID #1 and AdmissionID HSC3 (I only want the people in the merged_txaInformation table). If I merge by PatientID and AdmissionID, then the values in PatientWeight are all missing.

If you made a separate data set merged_txaInformation that is different than the set you used above then you may have done something such that the AdmissionId values do not align.

I suggest that you try running your code without the "if TXA then output; " and see if you have data where the PatientId and Admission values look a bit odd sequentially. Note that if one of your data sets somehow has leading spaces in the AdmissionId that "HSC3" will not be identical to " HSC3" and the merge doesn't happen as you expect.

Tom · Posted 03-02-2020 04:39 PM

Your data works find for me. Are you sure the ADMISSIONID values actually match? Does one have non-printing characters, such as TAB, or leading spaces, case differences?

data one ;
 input PatientID  AdmissionID $ txaAdministrationType :$20.;
cards;
1                    HSC1             Infusion
1                    HSC2             Infusion
2                    34253             Bolus
3                   32223              Infusion
;

data two;
  input PatientID  AdmissionID $   PatientWeight ;
cards;
1                    HSC1                  68
1                    HSC2                  78
1                    HSC3                  110
2                    34253                  96
3                    32223                  .
;

data want;
 merge one (in=in1) two (in=in2) ;
 by patientid admissionid ;
 if in1;
run;

proc print; run;

                                    txa
       Patient    Admission    Administration    Patient
Obs       ID         ID             Type          Weight

 1        1         HSC1          Infusion          68
 2        1         HSC2          Infusion          78
 3        2         34253         Bolus             96
 4        3         32223         Infusion           .

bretthouston · Posted 03-02-2020 04:49 PM

Thanks for the reply - I really appreciate it! As best I can tell they match? Here is an excerpt from the proc contents I've run on both of the tables.

Is there a different way for me to figure out if there are non-printing characters? Could it be because of the informats or label?

bretthouston · Posted 03-02-2020 05:37 PM

I compressed the values in one of the AdmissionID tables, and now it is merging properly. Presumably there we leading characters I couldn't see. Thanks for the advice!

ballardw · Posted 03-02-2020 06:19 PM

@bretthouston wrote:

I compressed the values in one of the AdmissionID tables, and now it is merging properly. Presumably there we leading characters I couldn't see. Thanks for the advice!

I'm glad you solved the problem. In that case I would suggest changing your marked solution from my hints to your compressed data solution.

I was going to mention that a character format shorter than the length of a character variable is also something to be careful with.

What will happen when you either print data or use a table viewer is that the last character(s) may not appear. Which visually may lead you to believe that the values are identical when they actually are not.

data example;
   input x $ 1-10;
   format x $9.;
datalines;
1234567890
;

Proc print data=example;
   title 'Default format $9.';
run;
/* compare with*/

proc print data=example;
   title 'Longer format $10.';
   format x $10.;
run;  title;

So if I were attempting to merge with another data set where X was actually only 9 characters then the merge would come out wonky in some form.

There are a number of functions you can use to find specific sorts of characters that might not be printable. ANYCTRL, ANYSPACE will return the first position of either a control character (a number of odd things) or space type character (blank, tab, carriage return, line feed or form feed) or zero if not found.

A brief example:

data example;
   input x $ 1-10;
   if anyspace(x)>0 then put 'Some space found in x: ' x=;
   else put 'No space found in x';
datalines;
12345 7890
abcdefghij
;

Tom · Posted 03-02-2020 06:16 PM

This is an example of why I always try to remove formats from character variables. You have define ADMISSIONID is length 10 but assigned a display format that only has a width of 9. So your printouts will look the same even when the 10tth character is different.

To remove a format from a variable use the FORMAT statement that lists the variable(s) without any format specification.

data want;
  set have;
  format _character_ ;
run;

Formats convert values to text. They are used to tell SAS how to display the variable's values. SAS already knows how to display character variables, so it does not need to have any special format attached.

SAS Innovate 2025: Save the Date