- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all
I imported an excel sheet into SAS and wanted to add additional information from my database. The idea was to add the column Race to the imported data using a table labeled RACE. I created the code, however, I am getting more lines back then the original import. The RACE table unfortunately has more than one RACE for some individuals so that's why I get more lines back than there should be. The way around that is to have a column called LineNBR which gives a count number of the races assigned to an individual and we just want to take line 1 (which we accept as the definitive race for an individual). When I run the code I get back all LineNBR 1's like I am supposed to, but I am still getting multiple races for a lot of people. I was hoping someone could tell me the reason for that. I am attaching the code I wrote and hoping some one could tell me if they can see anything wrong with what I did. Thanks in advance. (I hope I explained this sufficiently).
LIBNAME pat odbc complete="driver=SQL Server; database=Epic; server=PHSSQL2195" schema=Patient;
PROC IMPORT DATAFILE= "\\Cifs2\pop_lab$\P4P\BCBSMA\FY21\Actively Managed\Working Files\HighRisk_01MAY2020_31MAR2021.xlsx"
OUT= work.cohort REPLACE DBMS= XLSX;
RUN;
PROC SQL;
CREATE TABLE work.cohort_race AS
SELECT DISTINCT
a.*
,R.LineNBR
,CASE r.PatientRaceDSC
WHEN 'Asian' THEN 'Asian'
WHEN 'Black or African American' THEN 'Black or African American'
WHEN 'Hispanic or Latino' THEN 'Hispanic or Latino'
WHEN 'White or Caucasian' THEN 'White or Caucasian'
WHEN 'White' THEN 'White or Caucasian'
WHEN 'Declined' THEN 'Unknown'
WHEN 'Unavailable' THEN 'Unknown'
WHEN 'NULL' THEN 'Unknown'
ELSE 'Other'
END as RACE
FROM work.cohort a
LEFT JOIN Pat.Identity I ON a.MRN=I.PatientIdentityID
LEFT JOIN Pat.Race R on I.PatientID = R.PatientID
WHERE
(R.LineNBR = 1 OR R.LineNBR IS NULL)
Order By a.MRN
;quit;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
ANY join on a variable can create duplicates if the value is duplicated in one data set. If it is duplicated in both data sets you get multiples. Example if value A is on 3 records in set one and value A is on 2 records in set the join will generally create 6 output records.
data one;
input x $ y;
datalines;
a 1
a 2
a 3
;
data two;
input x $ z;
datalines;
a 10
a 20
;
proc sql;
create table example as
select a.x, a.y, b.z
from one as a
left join
two as b
on a.x = b.x
;
quit;
Since you are joining on something that looks like PatientId from 3 different sets, and some patients tend to repeat anything: tests, visits, diagnosis, prescriptions, then I suggest you look closely at your PatientID variables for duplicates within each set and decide if what you are doing makes sense in that scenario.
Perhaps somewhere you want Distinct patients.
And depending on your data you may also find that "Race" changes from visit to visit for the same patient. Depending on your record database and how consistent your data entry people are you may find this happening more often than you expect.
I deal with some medical testing. I have people that change race 2 or 3 times in a year. Ethnicity, sex and date of birth as well.
I have had data with the same patient id that change sex and date of birth on the same day (different tests). So there can be very sloppy data entry compounding the multiple visit/ test/ diagnosis/ prescription paradigm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Know your data.
You have duplicates somewhere. Look at each table individually to find them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
ChrisNZ, I do know my data. That response was less than helpful and in fact rather rude and condescending. I came here for help not not to be patronized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sorry you take it personally.
This is a data issue; We do not know your data..
Look at one patient for which there are duplicates in all tables and show us that data if you can't find what the issue is.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
ANY join on a variable can create duplicates if the value is duplicated in one data set. If it is duplicated in both data sets you get multiples. Example if value A is on 3 records in set one and value A is on 2 records in set the join will generally create 6 output records.
data one;
input x $ y;
datalines;
a 1
a 2
a 3
;
data two;
input x $ z;
datalines;
a 10
a 20
;
proc sql;
create table example as
select a.x, a.y, b.z
from one as a
left join
two as b
on a.x = b.x
;
quit;
Since you are joining on something that looks like PatientId from 3 different sets, and some patients tend to repeat anything: tests, visits, diagnosis, prescriptions, then I suggest you look closely at your PatientID variables for duplicates within each set and decide if what you are doing makes sense in that scenario.
Perhaps somewhere you want Distinct patients.
And depending on your data you may also find that "Race" changes from visit to visit for the same patient. Depending on your record database and how consistent your data entry people are you may find this happening more often than you expect.
I deal with some medical testing. I have people that change race 2 or 3 times in a year. Ethnicity, sex and date of birth as well.
I have had data with the same patient id that change sex and date of birth on the same day (different tests). So there can be very sloppy data entry compounding the multiple visit/ test/ diagnosis/ prescription paradigm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content