Hi all
I imported an excel sheet into SAS and wanted to add additional information from my database. The idea was to add the column Race to the imported data using a table labeled RACE. I created the code, however, I am getting more lines back then the original import. The RACE table unfortunately has more than one RACE for some individuals so that's why I get more lines back than there should be. The way around that is to have a column called LineNBR which gives a count number of the races assigned to an individual and we just want to take line 1 (which we accept as the definitive race for an individual). When I run the code I get back all LineNBR 1's like I am supposed to, but I am still getting multiple races for a lot of people. I was hoping someone could tell me the reason for that. I am attaching the code I wrote and hoping some one could tell me if they can see anything wrong with what I did. Thanks in advance. (I hope I explained this sufficiently).
LIBNAME pat odbc complete="driver=SQL Server; database=Epic; server=PHSSQL2195" schema=Patient;
PROC IMPORT DATAFILE= "\\Cifs2\pop_lab$\P4P\BCBSMA\FY21\Actively Managed\Working Files\HighRisk_01MAY2020_31MAR2021.xlsx"
OUT= work.cohort REPLACE DBMS= XLSX;
RUN;
PROC SQL;
CREATE TABLE work.cohort_race AS
SELECT DISTINCT
a.*
,R.LineNBR
,CASE r.PatientRaceDSC
WHEN 'Asian' THEN 'Asian'
WHEN 'Black or African American' THEN 'Black or African American'
WHEN 'Hispanic or Latino' THEN 'Hispanic or Latino'
WHEN 'White or Caucasian' THEN 'White or Caucasian'
WHEN 'White' THEN 'White or Caucasian'
WHEN 'Declined' THEN 'Unknown'
WHEN 'Unavailable' THEN 'Unknown'
WHEN 'NULL' THEN 'Unknown'
ELSE 'Other'
END as RACE
FROM work.cohort a
LEFT JOIN Pat.Identity I ON a.MRN=I.PatientIdentityID
LEFT JOIN Pat.Race R on I.PatientID = R.PatientID
WHERE
(R.LineNBR = 1 OR R.LineNBR IS NULL)
Order By a.MRN
;quit;
ANY join on a variable can create duplicates if the value is duplicated in one data set. If it is duplicated in both data sets you get multiples. Example if value A is on 3 records in set one and value A is on 2 records in set the join will generally create 6 output records.
data one; input x $ y; datalines; a 1 a 2 a 3 ; data two; input x $ z; datalines; a 10 a 20 ; proc sql; create table example as select a.x, a.y, b.z from one as a left join two as b on a.x = b.x ; quit;
Since you are joining on something that looks like PatientId from 3 different sets, and some patients tend to repeat anything: tests, visits, diagnosis, prescriptions, then I suggest you look closely at your PatientID variables for duplicates within each set and decide if what you are doing makes sense in that scenario.
Perhaps somewhere you want Distinct patients.
And depending on your data you may also find that "Race" changes from visit to visit for the same patient. Depending on your record database and how consistent your data entry people are you may find this happening more often than you expect.
I deal with some medical testing. I have people that change race 2 or 3 times in a year. Ethnicity, sex and date of birth as well.
I have had data with the same patient id that change sex and date of birth on the same day (different tests). So there can be very sloppy data entry compounding the multiple visit/ test/ diagnosis/ prescription paradigm.
Know your data.
You have duplicates somewhere. Look at each table individually to find them.
ChrisNZ, I do know my data. That response was less than helpful and in fact rather rude and condescending. I came here for help not not to be patronized.
Sorry you take it personally.
This is a data issue; We do not know your data..
Look at one patient for which there are duplicates in all tables and show us that data if you can't find what the issue is.
ANY join on a variable can create duplicates if the value is duplicated in one data set. If it is duplicated in both data sets you get multiples. Example if value A is on 3 records in set one and value A is on 2 records in set the join will generally create 6 output records.
data one; input x $ y; datalines; a 1 a 2 a 3 ; data two; input x $ z; datalines; a 10 a 20 ; proc sql; create table example as select a.x, a.y, b.z from one as a left join two as b on a.x = b.x ; quit;
Since you are joining on something that looks like PatientId from 3 different sets, and some patients tend to repeat anything: tests, visits, diagnosis, prescriptions, then I suggest you look closely at your PatientID variables for duplicates within each set and decide if what you are doing makes sense in that scenario.
Perhaps somewhere you want Distinct patients.
And depending on your data you may also find that "Race" changes from visit to visit for the same patient. Depending on your record database and how consistent your data entry people are you may find this happening more often than you expect.
I deal with some medical testing. I have people that change race 2 or 3 times in a year. Ethnicity, sex and date of birth as well.
I have had data with the same patient id that change sex and date of birth on the same day (different tests). So there can be very sloppy data entry compounding the multiple visit/ test/ diagnosis/ prescription paradigm.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.