We are working on a case-case comparison project.
What we are wanting to do is identify clusters of Salmonella cases that have the same serotype (variable=SeroSite) and PFGE pattern (variable=PfgePrimary) within a 45 day period (variable=DtSpec) and compare those to all other records in the dataset (excluding those that match by serotype). For those cases that do match by serotype and PFGE, we are assigning a value of 1 to the variable case. If a record matches by serotype but not pattern, we would like to assign the variable case to missing. If a record does not match by serotype or PFGE pattern, we would like to assign case a value of 0. Our problem is that the cluster will always be changing, so we can't code around specific values. We would like to automate it as much as possible.
Below is what the data looks like so far...
SeroSite | PfgePrimary | DtSpec | Case |
---|---|---|---|
Enteritidis | XBA.0004 | 01/01/2014 | 1 |
Enteritidis | XBA.0004 | 01/18/2014 | 1 |
Enteritidis | XBA.0004 | 02/28/2014 | 1 |
Typhimurium | XBA.1314 | 02/04/2014 | |
Enteritidis | XBA.0005 | 03/01/2014 | |
Cerro | XBA.2327 | 02/15/2014 | |
Typhimurium | XBA.1212 | 02/01/2014 |
This is what we need it to look like...
SeroSite | PfgePrimary | DtSpec | Case |
---|---|---|---|
Enteritidis | XBA.0004 | 01/01/2014 | 1 |
Enteritidis | XBA.0004 | 01/18/2014 | 1 |
Enteritidis | XBA.0004 | 02/28/2014 | 1 |
Typhimurium | XBA.1314 | 02/04/2014 | 0 |
Enteritidis | XBA.0005 | 03/01/2014 | . |
Cerro | XBA.2327 | 02/15/2014 | 0 |
Typhimurium | XBA.1212 | 02/01/2014 | 0 |
We are unsure of how we can go about doing this...We could assign a value to case if we were able to somehow assign a dummy serotype variable (CaseSero) to all records in the dataset the value of the SeroSite variable where case=1. Any suggestions would be helpful.
Do you need to manually assign the range on the DTSpec to determine the 45 day period or do you look at the date range recorded for the PfgePrimary and then determine the 45 day period?
Do you do this for each Serotype/PfgePrimary pair or only select ones?
What do you do if the PfgePrimary has a range of dates greater than 45 days? And likely a few more questions about order of rules and desired output.
I think a more detailed step-by-step description of a manual process will help us answer your question for automation.
"Our problem is that the cluster will always be changing,"
I don't understand. if you already have 1 in CASE , you can fix these cluster ? i.e. give it a initial value when case=1.
We interview all Salmonella cases about their exposure history in the 7 day period before their illness. We have over 3 years worth of exposure data to look at. If a cluster of 2 or more cases is identified within 45 day period, there is a greater chance that there is a common exposure of interest.
The 45-day period is based on "Today's Date"... whatever day you run the program. If within that 45-day range a cluster of 2 or more cases (serotype/PFGE match) are found, a new dataset is output with that cluster sorted at the top and assigned case=1. If a record has the same serotype/PFGE match, but falls outside of the 45 day window, they will not be considered a part of the cluster. This is a standard way of looking at things as we are trying to identify common exposures during outbreaks.
A new dataset is generated for each and every cluster identified. So one week, there may be one cluster identified with one dataset output. In another week, there may be 10 clusters identified...so 10 datasets will be output with one cluster sorted at the top in each dataset. Whichever cluster is at the top will be assigned a value of 1 for the variable case. All other records in the dataset will need to be assigned a value of case=. or case=0. Missing will need to be those records that have the same serotype as the case but different PFGE. Zeros will be those with a different serotype than the case.
Each cluster will be evaluated separately. We will compare exposures for those cases to all of the other records in the dataset (excluding those records that have the same serotype as the case but different PFGE). Everything else in the dataset, which is a moving target of records over the last 3 years, will be considered the "control" group. I hope this explains a bit better.
The part we are need help with is the assignment of missing and zeros to the variable case. Everything else has been completed. We need to figure out a way to do this based upon what the serotype and PFGE is for the case.
Still not get you, since I am not in pharmaceutical field. You said there is a 45 days window, but in your original data ,date is over that window for the category marked 1.
data have; infile cards truncover expandtabs; input serosite : $20. pfgeprimary : $20. dtspec : $20. case; cards; Enteritidis XBA.0004 01/01/2014 1 Enteritidis XBA.0004 01/18/2014 1 Enteritidis XBA.0004 02/28/2014 1 Typhimurium XBA.1314 02/04/2014 Enteritidis XBA.0005 03/01/2014 Cerro XBA.2327 02/15/2014 Typhimurium XBA.1212 02/01/2014 ; run; proc sql; create table want as select a.serosite,a.pfgeprimary,a.dtspec, case when a.serosite eq b.serosite and a.pfgeprimary eq b.pfgeprimary then 1 when a.serosite ne b.serosite and a.pfgeprimary ne b.pfgeprimary then 0 when a.serosite eq b.serosite and a.pfgeprimary ne b.pfgeprimary then . else 99999 end as case from have as a,(select distinct serosite,pfgeprimary from have as h where h.case eq 1) as b ; quit;
Xia Keshan
Message was edited by: xia keshan
Sorry I just made a typo...The 3rd Enteritidis should say 01/28/2014. The year on the 4th Enteritidis should be 2013.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.