I am trying to flag duplicates ('D'), carriers ('C') and multiple serotypes ('M') in a disease-based dataset (2005-2014) that I am working with. A duplicate is defined as a record that has a specimen collection date within 30 days from the original record with that PatientID and is the same pathogen and serotype combination. A carrier is defined as a record that has a specimen collection date (DtSpec) 31-365 days from the original record with that PatientID and is the same pathogen and serotype combination. If the record has the same pathogen, but different serotype then it would be classified as having multiple serotypes. If there are multiple records that have the same PatientID, but the Pathogen is different, then it would not be flagged. I also need to account for those records that were not serotyped or are missing DtSpec. If serotype was listed as not serotyped, then it is assumed that it is the same as the on that has a serotype. If a record is missing a DtSpec, but falls in the same year, it would get flagged as unknown (U).
PatientID | InvestigationID | Pathogen | Serotype | Year | DtSpec |
---|---|---|---|---|---|
N11674036 | CAS10828288TN01 | E. COLI | O157 | 2005 | 06/03/2005 |
N11674036 | CAS10831498TN01 | SALMONELLA | ENTERITIDIS | 2010 | |
N11734242 | CAS10856256TN01 | SHIGELLA | SONNEI | 2007 | 01/05/2007 |
N11734242 | CAS10865012TN01 | SHIGELLA | SONNEI | 2007 | |
N11770074 | CAS10867154TN01 | SALMONELLA | NOT SEROTYPED | 2013 | 03/15/2013 |
N11770074 | CAS10867158TN01 | SALMONELLA | TYPHIMURIUM | 2013 | 04/02/2013 |
N11770136 | CAS10857544TN01 | SALMONELLA | JAVIANA | 2009 | 10/02/2009 |
N11770136 | CAS10857677TN01 | SALMONELLA | BRAENDERUP | 2009 | 10/02/2009 |
N11770144 | CAS10858088TN01 | SALMONELLA | TYPHIMURIUM | 2006 | 05/01/2006 |
N11770144 | CAS10856216TN01 | SALMONELLA | TYPHIMURIUM | 2008 | 11/01/2008 |
N11770144 | CAS10859003TN01 | SALMONELLA | TYPHIMURIUM | 2008 | 12/31/2008 |
Ultimately, I would like to have a new variable (Flag) created that indicates why it has been flagged.
PatientID | InvestigationID | Pathogen | Serotype | Year | DtSpec | Flag |
---|---|---|---|---|---|---|
N11674036 | CAS10828288TN01 | E. COLI | O157 | 2005 | 06/03/2005 | |
N11674036 | CAS10831498TN01 | SALMONELLA | ENTERITIDIS | 2010 | ||
N11734242 | CAS10856256TN01 | SHIGELLA | SONNEI | 2007 | 01/05/2007 | |
N11734242 | CAS10865012TN01 | SHIGELLA | SONNEI | 2007 | U | |
N11770074 | CAS10867154TN01 | SALMONELLA | NOT SEROTYPED | 2013 | 03/15/2013 | |
N11770074 | CAS10867158TN01 | SALMONELLA | TYPHIMURIUM | 2013 | 04/02/2013 | D |
N11770136 | CAS10857544TN01 | SALMONELLA | JAVIANA | 2009 | 10/02/2009 | |
N11770136 | CAS10857677TN01 | SALMONELLA | BRAENDERUP | 2009 | 10/02/2009 | M |
N11770144 | CAS10858088TN01 | SALMONELLA | TYPHIMURIUM | 2006 | 05/01/2006 | |
N11770144 | CAS10856216TN01 | SALMONELLA | TYPHIMURIUM | 2008 | 11/01/2008 | |
N11770144 | CAS10859003TN01 | SALMONELLA | TYPHIMURIUM | 2008 | 12/31/2008 | C |
I had some code I used in Crystal Reports using a function called previous that would flag some of this. Now that I have moved my extract entirely into SAS, I'm not sure where to begin. Any help would be greatly appreciated.
sort duplicates? Sql ? using point (dataset) ?
What was the code in Chrystal reports?
So the dataset is original sorted by Pathogen=>PatientID=>DtSpec.
Someone had mentioned using Retain and Lag. I've never used those before in SAS and am reading up on them. These days, I spend more of my time coding with Proc SQL
The code in Crystal Reports only flagged some of what I need to do...
IF PREVIOUS({Case_Lab_Data_New.Patient_Local_ID})={Case_Lab_Data_New.Patient_Local_ID}
AND PREVIOUS({Case_Lab_Data_New.Condition})={Case_Lab_Data_New.Condition}
AND PREVIOUS({Case_Lab_Data_New.SeroType})<>{Case_Lab_Data_New.SeroType}
THEN "M"
ELSE IF PREVIOUS({Case_Lab_Data_New.Patient_Local_ID})={Case_Lab_Data_New.Patient_Local_ID}
AND PREVIOUS({Case_Lab_Data_New.Condition})={Case_Lab_Data_New.Condition}
AND PREVIOUS({Case_Lab_Data_New.SeroType})={Case_Lab_Data_New.SeroType}
THEN "D"
ELSE ""
I'm not sure if I am giving you the answer you really want.
Well exposed problem. This would do it. I gave the flag to every duplicate observation (not just the last one) :
data have;
infile datalines missover;
length PatientID InvestigationID Pathogen Serotype $20;
input PatientID & InvestigationID & Pathogen & Serotype & Year DtSpec :mmddyy.;
format dtSpec yymmdd10.;
datalines;
N11674036 CAS10828288TN01 E. COLI O157 2005 06/03/2005
N11674036 CAS10831498TN01 SALMONELLA ENTERITIDIS 2010
N11734242 CAS10856256TN01 SHIGELLA SONNEI 2007 01/05/2007
N11734242 CAS10865012TN01 SHIGELLA SONNEI 2007
N11770074 CAS10867154TN01 SALMONELLA NOT SEROTYPED 2013 03/15/2013
N11770074 CAS10867158TN01 SALMONELLA TYPHIMURIUM 2013 04/02/2013
N11770136 CAS10857544TN01 SALMONELLA JAVIANA 2009 10/02/2009
N11770136 CAS10857677TN01 SALMONELLA BRAENDERUP 2009 10/02/2009
N11770144 CAS10858088TN01 SALMONELLA TYPHIMURIUM 2006 05/01/2006
N11770144 CAS10856216TN01 SALMONELLA TYPHIMURIUM 2008 11/01/2008
N11770144 CAS10859003TN01 SALMONELLA TYPHIMURIUM 2008 12/31/2008
;
proc format;
value flag
0 = .
1 = "U"
2 = "M"
3 = "C"
4 = "D";
run;
proc sql;
create table want as
select
a.*,
max(
case
when a.pathogen=b.pathogen and
(a.serotype=b.serotype or a.serotype="NOT SEROTYPED" or b.serotype="NOT SEROTYPED") and
abs(intck("DAY",a.dtspec, b.dtspec)) between 1 and 30 then 4
when a.pathogen=b.pathogen and
(a.serotype=b.serotype or a.serotype="NOT SEROTYPED" or b.serotype="NOT SEROTYPED") and
abs(intck("DAY",a.dtspec, b.dtspec)) between 31 and 365 then 3
when a.pathogen=b.pathogen and a.dtspec is not missing and b.dtspec is not missing then 2
when (a.dtspec is missing or b.dtspec is missing) and a.year=b.year then 1
else 0
end) as flag format=flag.
from
have as a inner join
have as b on a.patientid=b.patientid and a.investigationid ne b.investigationid
group by a.patientid, a.investigationid, a.year, a.dtspec, a.pathogen, a.serotype
;
quit;
PG
Will do, I see a Cartesian product selection, When the data is not too big and that is not to be expected.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.