BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Schtroumpfette
Obsidian | Level 7

Hello,

Here is my issue.  I am linking TableA to TableB using inner join to acquire additional two variables from TableB.

 

TableA: Master cohort table has uniqueID (unique identifier, each identifier representing 1 patient), encounterID (each uniqueID can have multiple encounters, which means the same patient had multiple hospital admissions), age, and 55 other variables such as DrugA (binary), diagnosisID, date of cohort entry, etc. It was prepared linking different other tables together.

 

TableB has uniqueID (unique identifier), gender and race, two additional variables I would like to add to TableA. 

 

Common identifier for TableA and TableB is uniqueID-nothing else in common. 

 

Two issues: 

1. Due to database errors, in rare instances, some patients may have different race or gender overtime.  This is not the norm though.

2. The  inner join is generating more rows or observations that what is now in TableA.  Based on my research, an inner join can generate more observations since it repeats matching row in Table B for each row in Table A.  (a*b for each row).

The inner join works but I cannot open TableC because it is too big, but the uniqueID and EncounterID distinct count is the same as TableA. 

 

Total in Table C would be 57 variables.

Note these are huge tables:

-Table A : 1.7 million uniqueID and 3.4 million encounters (distinct counts)

-TableB:  Over 60 million unique patients, but only interested in the common 1.7 million in TableA. 

TableA has duplicate IDs and encounters and at this point, the decision is not to delete these duplicates for different reasons, and to ensure we don’t loose any diagnosed cases. 

I found I can use the Select Function to just limit to TableA and use left join. So here is my code but it is generating errors and I tried several changes based on what I found online and it is still the same issue. 

1 ACCEPTED SOLUTION

Accepted Solutions
Schtroumpfette
Obsidian | Level 7

Thanks again Tom, 

 

I will look at all of this and get back to you.  

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

I think you just over complicated things and that is the cause of the error messages.  For example you cannot select the variable RACE from the table referenced as alias B if RACE is not one of the variables in that table.

create TableC as
  select a.* ,b.gender ,b.race 
  from TableA as a
  left join TableB as b
  on a.uniqueID = b.uniqueID
;

Another source of error will be if GENDER and RACE already exist in the TABLEA.  Since the dataset TABLEC cannot have two variables named GENDER you will need to either rename one or remove one.

 

Also if TABLEB has repeated observations for the same value of UNIQUEID then you could end up with more observations out than exist in TABLEA.  If that is the case you need to figure out how to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA.

 

Note that for this type of many to one merge it will normally be much easier and faster to just use normal SAS syntax instead of PROC SQL.

data tablec;
  merge tablea(in=in1) tableb(keep=uniqueid gender race);
  by uniqueid;
  if in1;
run;

 

Schtroumpfette
Obsidian | Level 7

Thanks Tom for your quick response.

 

Race and gender are in TableB only and not in TableA.   My issue is one more observations than exist in TableA.  The code works, just with more observations. 

 

Is there a code to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA?

 

I will also try the data step code below and update you (the merge one), thanks.

 

Nisrine

 

ballardw
Super User

@Schtroumpfette wrote:

Thanks Tom for your quick response.

 

Race and gender are in TableB only and not in TableA.   My issue is one more observations than exist in TableA.  The code works, just with more observations. 

 

Is there a code to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA?

 

I will also try the data step code below and update you (the merge one), thanks.

 

Nisrine

 


Make sure that your TAbleB only has one value of Gender and Race for a unique id first.

Without examples of your data the "best" way to do that is next to impossible. I have a data source that has changes of race, gender, ethnicity and when required to make a report will typically use either the first or last recorded(most recent) depending on the purpose of the report.

Since SQL doesn't have a nature concept of "order" I might do something like:

Proc sort data=Tableb;
   by uniqueid;
run;

data tomerge;
   set TableB;
   if first.uniqueid;
run;

and use the Tomerge set instead of TableB. If there were another variable such as date of information include that in the sort and use First for earliest or Last for lastest record in the Tomerge. Such a date variable could also use a HAVING clause with the max or min value of the date.

Schtroumpfette
Obsidian | Level 7

This code worked.  Thank you all for your help.

 


data tomerge;
set Tableb;
by uniqueID;
if first.uniqueID;
run;

Kurt_Bremser
Super User

Run this:

proc sort
  data=tableb (keep=uniqueid)
  out=test
  nodupkey
;
by uniqueid;
run;

and look at the log. If you find a message about deleted duplicates, you have your culprit.

Schtroumpfette
Obsidian | Level 7

Thanks again Tom, 

 

I will look at all of this and get back to you.  

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 7899 views
  • 2 likes
  • 4 in conversation