Solved: Re: Inner Join creating more rows than Master Table

Schtroumpfette · Posted 01-09-2023 11:06 AM

Hello,

Here is my issue. I am linking TableA to TableB using inner join to acquire additional two variables from TableB.

TableA: Master cohort table has uniqueID (unique identifier, each identifier representing 1 patient), encounterID (each uniqueID can have multiple encounters, which means the same patient had multiple hospital admissions), age, and 55 other variables such as DrugA (binary), diagnosisID, date of cohort entry, etc. It was prepared linking different other tables together.

TableB has uniqueID (unique identifier), gender and race, two additional variables I would like to add to TableA.

Common identifier for TableA and TableB is uniqueID-nothing else in common.

Two issues:

1. Due to database errors, in rare instances, some patients may have different race or gender overtime. This is not the norm though.

2. The inner join is generating more rows or observations that what is now in TableA. Based on my research, an inner join can generate more observations since it repeats matching row in Table B for each row in Table A. (a*b for each row).

The inner join works but I cannot open TableC because it is too big, but the uniqueID and EncounterID distinct count is the same as TableA.

Total in Table C would be 57 variables.

Note these are huge tables:

-Table A : 1.7 million uniqueID and 3.4 million encounters (distinct counts)

-TableB: Over 60 million unique patients, but only interested in the common 1.7 million in TableA.

TableA has duplicate IDs and encounters and at this point, the decision is not to delete these duplicates for different reasons, and to ensure we don’t loose any diagnosed cases.

I found I can use the Select Function to just limit to TableA and use left join. So here is my code but it is generating errors and I tried several changes based on what I found online and it is still the same issue.

Schtroumpfette · Posted 01-09-2023 01:21 PM

Thanks again Tom,

I will look at all of this and get back to you.

View solution in original post

Tom · Posted 01-09-2023 11:22 AM

I think you just over complicated things and that is the cause of the error messages. For example you cannot select the variable RACE from the table referenced as alias B if RACE is not one of the variables in that table.

create TableC as
  select a.* ,b.gender ,b.race 
  from TableA as a
  left join TableB as b
  on a.uniqueID = b.uniqueID
;

Another source of error will be if GENDER and RACE already exist in the TABLEA. Since the dataset TABLEC cannot have two variables named GENDER you will need to either rename one or remove one.

Also if TABLEB has repeated observations for the same value of UNIQUEID then you could end up with more observations out than exist in TABLEA. If that is the case you need to figure out how to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA.

Note that for this type of many to one merge it will normally be much easier and faster to just use normal SAS syntax instead of PROC SQL.

data tablec;
  merge tablea(in=in1) tableb(keep=uniqueid gender race);
  by uniqueid;
  if in1;
run;

Schtroumpfette · Posted 01-09-2023 11:31 AM

Thanks Tom for your quick response.

Race and gender are in TableB only and not in TableA. My issue is one more observations than exist in TableA. The code works, just with more observations.

Is there a code to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA?

I will also try the data step code below and update you (the merge one), thanks.

Nisrine

ballardw · Posted 01-09-2023 01:16 PM

@Schtroumpfette wrote:

Thanks Tom for your quick response.

Race and gender are in TableB only and not in TableA. My issue is one more observations than exist in TableA. The code works, just with more observations.

Is there a code to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA?

I will also try the data step code below and update you (the merge one), thanks.

Nisrine

Make sure that your TAbleB only has one value of Gender and Race for a unique id first.

Without examples of your data the "best" way to do that is next to impossible. I have a data source that has changes of race, gender, ethnicity and when required to make a report will typically use either the first or last recorded(most recent) depending on the purpose of the report.

Since SQL doesn't have a nature concept of "order" I might do something like:

Proc sort data=Tableb;
   by uniqueid;
run;

data tomerge;
   set TableB;
   if first.uniqueid;
run;

and use the Tomerge set instead of TableB. If there were another variable such as date of information include that in the sort and use First for earliest or Last for lastest record in the Tomerge. Such a date variable could also use a HAVING clause with the max or min value of the date.

Schtroumpfette · Posted 01-27-2023 09:46 PM

This code worked. Thank you all for your help.

data tomerge;
set Tableb;
by uniqueID;
if first.uniqueID;
run;

Kurt_Bremser · Posted 01-09-2023 01:20 PM

Run this:

proc sort
  data=tableb (keep=uniqueid)
  out=test
  nodupkey
;
by uniqueid;
run;

and look at the log. If you find a message about deleted duplicates, you have your culprit.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Schtroumpfette · Posted 01-09-2023 01:21 PM

Thanks again Tom,

I will look at all of this and get back to you.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away