BookmarkSubscribeRSS Feed
gdaymte
Obsidian | Level 7

I am trying to flag duplicates ('D'), carriers ('C') and multiple serotypes ('M') in a disease-based dataset (2005-2014) that I am working with. A duplicate is defined as a record that has a specimen collection date within 30 days from the original record with that PatientID and is the same pathogen and serotype combination. A carrier is defined as a record that has a specimen collection date (DtSpec) 31-365 days from the original record with that PatientID and is the same pathogen and serotype combination. If the record has the same pathogen, but different serotype then it would be classified as having multiple serotypes. If there are multiple records that have the same PatientID, but the Pathogen is different, then it would not be flagged. I also need to account for those records that were not serotyped or are missing DtSpec. If serotype was listed as not serotyped, then it is assumed that it is the same as the on that has a serotype. If a record is missing a DtSpec, but falls in the same year, it would get flagged as unknown (U).

PatientIDInvestigationIDPathogenSerotypeYearDtSpec
N11674036CAS10828288TN01E. COLIO157200506/03/2005
N11674036CAS10831498TN01SALMONELLAENTERITIDIS2010
N11734242CAS10856256TN01SHIGELLASONNEI200701/05/2007
N11734242CAS10865012TN01SHIGELLASONNEI2007
N11770074CAS10867154TN01SALMONELLANOT SEROTYPED201303/15/2013
N11770074CAS10867158TN01SALMONELLATYPHIMURIUM201304/02/2013
N11770136CAS10857544TN01SALMONELLAJAVIANA200910/02/2009
N11770136CAS10857677TN01SALMONELLABRAENDERUP200910/02/2009
N11770144CAS10858088TN01SALMONELLATYPHIMURIUM200605/01/2006
N11770144CAS10856216TN01SALMONELLATYPHIMURIUM200811/01/2008
N11770144CAS10859003TN01SALMONELLATYPHIMURIUM200812/31/2008

Ultimately, I would like to have a new variable (Flag) created that indicates why it has been flagged.

PatientIDInvestigationIDPathogenSerotypeYearDtSpecFlag
N11674036CAS10828288TN01E. COLIO157200506/03/2005
N11674036CAS10831498TN01SALMONELLAENTERITIDIS2010
N11734242CAS10856256TN01SHIGELLASONNEI200701/05/2007
N11734242CAS10865012TN01SHIGELLASONNEI2007U
N11770074CAS10867154TN01SALMONELLANOT SEROTYPED201303/15/2013
N11770074CAS10867158TN01SALMONELLATYPHIMURIUM201304/02/2013D
N11770136CAS10857544TN01SALMONELLAJAVIANA200910/02/2009
N11770136CAS10857677TN01SALMONELLABRAENDERUP200910/02/2009M
N11770144CAS10858088TN01SALMONELLATYPHIMURIUM200605/01/2006
N11770144CAS10856216TN01SALMONELLATYPHIMURIUM200811/01/2008
N11770144CAS10859003TN01SALMONELLATYPHIMURIUM200812/31/2008C

I had some code I used in Crystal Reports using a function called previous that would flag some of this. Now that I have moved my extract entirely into SAS, I'm not sure where to begin. Any help would be greatly appreciated.

4 REPLIES 4
jakarman
Barite | Level 11

sort duplicates? Sql ? using point (dataset) ?

What was the code in Chrystal reports?

---->-- ja karman --<-----
gdaymte
Obsidian | Level 7

So the dataset is original sorted by Pathogen=>PatientID=>DtSpec.

Someone had mentioned using Retain and Lag. I've never used those before in SAS and am reading up on them. These days, I spend more of my time coding with Proc SQL

The code in Crystal Reports only flagged some of what I need to do...

IF PREVIOUS({Case_Lab_Data_New.Patient_Local_ID})={Case_Lab_Data_New.Patient_Local_ID}

      AND PREVIOUS({Case_Lab_Data_New.Condition})={Case_Lab_Data_New.Condition}

      AND PREVIOUS({Case_Lab_Data_New.SeroType})<>{Case_Lab_Data_New.SeroType}

          THEN "M"

ELSE IF PREVIOUS({Case_Lab_Data_New.Patient_Local_ID})={Case_Lab_Data_New.Patient_Local_ID}

     AND PREVIOUS({Case_Lab_Data_New.Condition})={Case_Lab_Data_New.Condition}

     AND PREVIOUS({Case_Lab_Data_New.SeroType})={Case_Lab_Data_New.SeroType}

          THEN "D"

ELSE ""

I'm not sure if I am giving you the answer you really want.

PGStats
Opal | Level 21

Well exposed problem. This would do it. I gave the flag to every duplicate observation (not just the last one) :

data have;

infile datalines missover;

length PatientID InvestigationID Pathogen Serotype $20;

input PatientID & InvestigationID & Pathogen & Serotype & Year DtSpec :mmddyy.;

format dtSpec yymmdd10.;

datalines;

N11674036  CAS10828288TN01  E. COLI  O157  2005  06/03/2005

N11674036  CAS10831498TN01  SALMONELLA  ENTERITIDIS  2010 

N11734242  CAS10856256TN01  SHIGELLA  SONNEI  2007  01/05/2007

N11734242  CAS10865012TN01  SHIGELLA  SONNEI  2007 

N11770074  CAS10867154TN01  SALMONELLA  NOT SEROTYPED  2013  03/15/2013

N11770074  CAS10867158TN01  SALMONELLA  TYPHIMURIUM  2013  04/02/2013

N11770136  CAS10857544TN01  SALMONELLA  JAVIANA  2009  10/02/2009

N11770136  CAS10857677TN01  SALMONELLA  BRAENDERUP  2009  10/02/2009

N11770144  CAS10858088TN01  SALMONELLA  TYPHIMURIUM  2006  05/01/2006

N11770144  CAS10856216TN01  SALMONELLA  TYPHIMURIUM  2008  11/01/2008

N11770144  CAS10859003TN01  SALMONELLA  TYPHIMURIUM  2008  12/31/2008

;

proc format;

value flag

0 = .

1 = "U"

2 = "M"

3 = "C"

4 = "D";

run;

proc sql;

create table want as

select

    a.*,

    max(

    case

        when a.pathogen=b.pathogen and

            (a.serotype=b.serotype or a.serotype="NOT SEROTYPED" or b.serotype="NOT SEROTYPED") and

             abs(intck("DAY",a.dtspec, b.dtspec)) between 1 and 30 then 4

        when a.pathogen=b.pathogen and

            (a.serotype=b.serotype or a.serotype="NOT SEROTYPED" or b.serotype="NOT SEROTYPED") and

             abs(intck("DAY",a.dtspec, b.dtspec)) between 31 and 365 then 3

        when a.pathogen=b.pathogen and a.dtspec is not missing and b.dtspec is not missing then 2

        when (a.dtspec is missing or b.dtspec is missing) and a.year=b.year then 1

        else 0

    end) as flag format=flag.

from

    have as a inner join

    have as b on a.patientid=b.patientid and a.investigationid ne b.investigationid

group by a.patientid, a.investigationid, a.year, a.dtspec, a.pathogen, a.serotype

;

quit;

PG

PG
jakarman
Barite | Level 11

Will do, I see a Cartesian product selection,  When the data is not too big and that is not to be expected.

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 862 views
  • 0 likes
  • 3 in conversation