Re: Assign value to all records based upon some criteria

gdaymte · Posted 04-17-2014 04:08 PM

We are working on a case-case comparison project.

What we are wanting to do is identify clusters of Salmonella cases that have the same serotype (variable=SeroSite) and PFGE pattern (variable=PfgePrimary) within a 45 day period (variable=DtSpec) and compare those to all other records in the dataset (excluding those that match by serotype). For those cases that do match by serotype and PFGE, we are assigning a value of 1 to the variable case. If a record matches by serotype but not pattern, we would like to assign the variable case to missing. If a record does not match by serotype or PFGE pattern, we would like to assign case a value of 0. Our problem is that the cluster will always be changing, so we can't code around specific values. We would like to automate it as much as possible.

Below is what the data looks like so far...

SeroSite	PfgePrimary	DtSpec	Case
Enteritidis	XBA.0004	01/01/2014	1
Enteritidis	XBA.0004	01/18/2014	1
Enteritidis	XBA.0004	02/28/2014	1
Typhimurium	XBA.1314	02/04/2014
Enteritidis	XBA.0005	03/01/2014
Cerro	XBA.2327	02/15/2014
Typhimurium	XBA.1212	02/01/2014

This is what we need it to look like...

SeroSite	PfgePrimary	DtSpec	Case
Enteritidis	XBA.0004	01/01/2014	1
Enteritidis	XBA.0004	01/18/2014	1
Enteritidis	XBA.0004	02/28/2014	1
Typhimurium	XBA.1314	02/04/2014	0
Enteritidis	XBA.0005	03/01/2014	.
Cerro	XBA.2327	02/15/2014	0
Typhimurium	XBA.1212	02/01/2014	0

We are unsure of how we can go about doing this...We could assign a value to case if we were able to somehow assign a dummy serotype variable (CaseSero) to all records in the dataset the value of the SeroSite variable where case=1. Any suggestions would be helpful.

ballardw · Posted 04-17-2014 06:33 PM

Do you need to manually assign the range on the DTSpec to determine the 45 day period or do you look at the date range recorded for the PfgePrimary and then determine the 45 day period?

Do you do this for each Serotype/PfgePrimary pair or only select ones?
What do you do if the PfgePrimary has a range of dates greater than 45 days? And likely a few more questions about order of rules and desired output.

I think a more detailed step-by-step description of a manual process will help us answer your question for automation.

Ksharp · Posted 04-18-2014 06:13 AM

"Our problem is that the cluster will always be changing,"

I don't understand. if you already have 1 in CASE , you can fix these cluster ? i.e. give it a initial value when case=1.

gdaymte · Posted 04-18-2014 10:11 AM

We interview all Salmonella cases about their exposure history in the 7 day period before their illness. We have over 3 years worth of exposure data to look at. If a cluster of 2 or more cases is identified within 45 day period, there is a greater chance that there is a common exposure of interest.

The 45-day period is based on "Today's Date"... whatever day you run the program. If within that 45-day range a cluster of 2 or more cases (serotype/PFGE match) are found, a new dataset is output with that cluster sorted at the top and assigned case=1. If a record has the same serotype/PFGE match, but falls outside of the 45 day window, they will not be considered a part of the cluster. This is a standard way of looking at things as we are trying to identify common exposures during outbreaks.

A new dataset is generated for each and every cluster identified. So one week, there may be one cluster identified with one dataset output. In another week, there may be 10 clusters identified...so 10 datasets will be output with one cluster sorted at the top in each dataset. Whichever cluster is at the top will be assigned a value of 1 for the variable case. All other records in the dataset will need to be assigned a value of case=. or case=0. Missing will need to be those records that have the same serotype as the case but different PFGE. Zeros will be those with a different serotype than the case.

Each cluster will be evaluated separately. We will compare exposures for those cases to all of the other records in the dataset (excluding those records that have the same serotype as the case but different PFGE). Everything else in the dataset, which is a moving target of records over the last 3 years, will be considered the "control" group. I hope this explains a bit better.

The part we are need help with is the assignment of missing and zeros to the variable case. Everything else has been completed. We need to figure out a way to do this based upon what the serotype and PFGE is for the case.

Ksharp · Posted 04-18-2014 10:53 AM

Still not get you, since I am not in pharmaceutical field. You said there is a 45 days window, but in your original data ,date is over that window for the category marked 1.

data have;
infile cards truncover expandtabs;
input serosite : $20. pfgeprimary : $20. dtspec : $20. case;
cards;
Enteritidis     XBA.0004     01/01/2014     1
Enteritidis     XBA.0004     01/18/2014     1
Enteritidis     XBA.0004     02/28/2014     1
Typhimurium     XBA.1314     02/04/2014     
Enteritidis     XBA.0005     03/01/2014     
Cerro     XBA.2327     02/15/2014     
Typhimurium     XBA.1212     02/01/2014     
;
run;
proc sql;
create table want as
 select a.serosite,a.pfgeprimary,a.dtspec, case 
                                       when a.serosite eq b.serosite and a.pfgeprimary eq b.pfgeprimary then 1
                                                when a.serosite ne b.serosite and a.pfgeprimary ne b.pfgeprimary then 0
                                       when a.serosite eq b.serosite and a.pfgeprimary ne b.pfgeprimary then .
                                       else 99999
                                              end as case
  from have as a,(select distinct serosite,pfgeprimary from have as h where h.case eq 1) as b ;
  quit;

Xia Keshan

Message was edited by: xia keshan

gdaymte · Posted 04-18-2014 12:37 PM

Sorry I just made a typo...The 3rd Enteritidis should say 01/28/2014. The year on the 4th Enteritidis should be 2013.

Assign value to all records based upon some criteria