We are working on a case-case comparison project.
What we are wanting to do is identify clusters of Salmonella cases that have the same serotype (variable=SeroSite) and PFGE pattern (variable=PfgePrimary) within a 45 day period (variable=DtSpec) and compare those to all other records in the dataset (excluding those that match by serotype). For those cases that do match by serotype and PFGE, we are assigning a value of 1 to the variable case. If a record matches by serotype but not pattern, we would like to assign the variable case to missing. If a record does not match by serotype or PFGE pattern, we would like to assign case a value of 0. Our problem is that the cluster will always be changing, so we can't code around specific values. We would like to automate it as much as possible.
Below is what the data looks like so far...
SeroSite | PfgePrimary | DtSpec | Case |
---|---|---|---|
Enteritidis | XBA.0004 | 01/01/2014 | 1 |
Enteritidis | XBA.0004 | 01/18/2014 | 1 |
Enteritidis | XBA.0004 | 02/28/2014 | 1 |
Typhimurium | XBA.1314 | 02/04/2014 | |
Enteritidis | XBA.0005 | 03/01/2014 | |
Cerro | XBA.2327 | 02/15/2014 | |
Typhimurium | XBA.1212 | 02/01/2014 |
This is what we need it to look like...
SeroSite | PfgePrimary | DtSpec | Case |
---|---|---|---|
Enteritidis | XBA.0004 | 01/01/2014 | 1 |
Enteritidis | XBA.0004 | 01/18/2014 | 1 |
Enteritidis | XBA.0004 | 02/28/2014 | 1 |
Typhimurium | XBA.1314 | 02/04/2014 | 0 |
Enteritidis | XBA.0005 | 03/01/2014 | . |
Cerro | XBA.2327 | 02/15/2014 | 0 |
Typhimurium | XBA.1212 | 02/01/2014 | 0 |
We are unsure of how we can go about doing this...We could assign a value to case if we were able to somehow assign a dummy serotype variable (CaseSero) to all records in the dataset the value of the SeroSite variable where case=1. Any suggestions would be helpful.
Do you need to manually assign the range on the DTSpec to determine the 45 day period or do you look at the date range recorded for the PfgePrimary and then determine the 45 day period?
Do you do this for each Serotype/PfgePrimary pair or only select ones?
What do you do if the PfgePrimary has a range of dates greater than 45 days? And likely a few more questions about order of rules and desired output.
I think a more detailed step-by-step description of a manual process will help us answer your question for automation.
"Our problem is that the cluster will always be changing,"
I don't understand. if you already have 1 in CASE , you can fix these cluster ? i.e. give it a initial value when case=1.
We interview all Salmonella cases about their exposure history in the 7 day period before their illness. We have over 3 years worth of exposure data to look at. If a cluster of 2 or more cases is identified within 45 day period, there is a greater chance that there is a common exposure of interest.
The 45-day period is based on "Today's Date"... whatever day you run the program. If within that 45-day range a cluster of 2 or more cases (serotype/PFGE match) are found, a new dataset is output with that cluster sorted at the top and assigned case=1. If a record has the same serotype/PFGE match, but falls outside of the 45 day window, they will not be considered a part of the cluster. This is a standard way of looking at things as we are trying to identify common exposures during outbreaks.
A new dataset is generated for each and every cluster identified. So one week, there may be one cluster identified with one dataset output. In another week, there may be 10 clusters identified...so 10 datasets will be output with one cluster sorted at the top in each dataset. Whichever cluster is at the top will be assigned a value of 1 for the variable case. All other records in the dataset will need to be assigned a value of case=. or case=0. Missing will need to be those records that have the same serotype as the case but different PFGE. Zeros will be those with a different serotype than the case.
Each cluster will be evaluated separately. We will compare exposures for those cases to all of the other records in the dataset (excluding those records that have the same serotype as the case but different PFGE). Everything else in the dataset, which is a moving target of records over the last 3 years, will be considered the "control" group. I hope this explains a bit better.
The part we are need help with is the assignment of missing and zeros to the variable case. Everything else has been completed. We need to figure out a way to do this based upon what the serotype and PFGE is for the case.
Still not get you, since I am not in pharmaceutical field. You said there is a 45 days window, but in your original data ,date is over that window for the category marked 1.
data have; infile cards truncover expandtabs; input serosite : $20. pfgeprimary : $20. dtspec : $20. case; cards; Enteritidis XBA.0004 01/01/2014 1 Enteritidis XBA.0004 01/18/2014 1 Enteritidis XBA.0004 02/28/2014 1 Typhimurium XBA.1314 02/04/2014 Enteritidis XBA.0005 03/01/2014 Cerro XBA.2327 02/15/2014 Typhimurium XBA.1212 02/01/2014 ; run; proc sql; create table want as select a.serosite,a.pfgeprimary,a.dtspec, case when a.serosite eq b.serosite and a.pfgeprimary eq b.pfgeprimary then 1 when a.serosite ne b.serosite and a.pfgeprimary ne b.pfgeprimary then 0 when a.serosite eq b.serosite and a.pfgeprimary ne b.pfgeprimary then . else 99999 end as case from have as a,(select distinct serosite,pfgeprimary from have as h where h.case eq 1) as b ; quit;
Xia Keshan
Message was edited by: xia keshan
Sorry I just made a typo...The 3rd Enteritidis should say 01/28/2014. The year on the 4th Enteritidis should be 2013.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.