Hello,
Here is my issue. I am linking TableA to TableB using inner join to acquire additional two variables from TableB.
TableA: Master cohort table has uniqueID (unique identifier, each identifier representing 1 patient), encounterID (each uniqueID can have multiple encounters, which means the same patient had multiple hospital admissions), age, and 55 other variables such as DrugA (binary), diagnosisID, date of cohort entry, etc. It was prepared linking different other tables together.
TableB has uniqueID (unique identifier), gender and race, two additional variables I would like to add to TableA.
Common identifier for TableA and TableB is uniqueID-nothing else in common.
Two issues:
1. Due to database errors, in rare instances, some patients may have different race or gender overtime. This is not the norm though.
2. The inner join is generating more rows or observations that what is now in TableA. Based on my research, an inner join can generate more observations since it repeats matching row in Table B for each row in Table A. (a*b for each row).
The inner join works but I cannot open TableC because it is too big, but the uniqueID and EncounterID distinct count is the same as TableA.
Total in Table C would be 57 variables.
Note these are huge tables:
-Table A : 1.7 million uniqueID and 3.4 million encounters (distinct counts)
-TableB: Over 60 million unique patients, but only interested in the common 1.7 million in TableA.
TableA has duplicate IDs and encounters and at this point, the decision is not to delete these duplicates for different reasons, and to ensure we don’t loose any diagnosed cases.
I found I can use the Select Function to just limit to TableA and use left join. So here is my code but it is generating errors and I tried several changes based on what I found online and it is still the same issue.
proc sql;
create TableC as
select a.*, b.gender, b.race
from TableA as a left join (select distinct uniqueID from TableB) TableB as b
on a.uniqueID = b.uniqueID;
quit;
Your help will be much appreciated!
Thanks,
Shtroumpfette
I think you just over complicated things and that is the cause of the error messages. For example you cannot select the variable RACE from the table referenced as alias B if RACE is not one of the variables in that table.
create TableC as
select a.* ,b.gender ,b.race
from TableA as a
left join TableB as b
on a.uniqueID = b.uniqueID
;
Another source of error will be if GENDER and RACE already exist in the TABLEA. Since the dataset TABLEC cannot have two variables named GENDER you will need to either rename one or remove one.
Also if TABLEB has repeated observations for the same value of UNIQUEID then you could end up with more observations out than exist in TABLEA. If that is the case you need to figure out how to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA.
Note that for this type of many to one merge it will normally be much easier and faster to just use normal SAS syntax instead of PROC SQL.
data tablec;
merge tablea(in=in1) tableb(keep=uniqueid gender race);
by uniqueid;
if in1;
run;
Thanks Tom for your quick response.
Race and gender are in TableB only and not in TableA. My issue is one more observations than exist in TableA. The code works, just with more observations.
Is there a code to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA?
I will also try the data step code below and update you (the merge one), thanks.
Nisrine
@Schtroumpfette wrote:
Thanks Tom for your quick response.
Race and gender are in TableB only and not in TableA. My issue is one more observations than exist in TableA. The code works, just with more observations.
Is there a code to select only one GENDER and RACE from TABLEB before trying to merge it onto TABLEA?
I will also try the data step code below and update you (the merge one), thanks.
Nisrine
Make sure that your TAbleB only has one value of Gender and Race for a unique id first.
Without examples of your data the "best" way to do that is next to impossible. I have a data source that has changes of race, gender, ethnicity and when required to make a report will typically use either the first or last recorded(most recent) depending on the purpose of the report.
Since SQL doesn't have a nature concept of "order" I might do something like:
Proc sort data=Tableb; by uniqueid; run; data tomerge; set TableB; if first.uniqueid; run;
and use the Tomerge set instead of TableB. If there were another variable such as date of information include that in the sort and use First for earliest or Last for lastest record in the Tomerge. Such a date variable could also use a HAVING clause with the max or min value of the date.
This code worked. Thank you all for your help.
data tomerge;
set Tableb;
by uniqueID;
if first.uniqueID;
run;
Run this:
proc sort
data=tableb (keep=uniqueid)
out=test
nodupkey
;
by uniqueid;
run;
and look at the log. If you find a message about deleted duplicates, you have your culprit.
Thanks again Tom,
I will look at all of this and get back to you.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.