Hi Folks:
I have separate data sets that all share 'date' variable and each one of them are aggregated by date of test, age, gender and province separately. Example data below is taken for a specific date: 21976. Actual data sets have many dates. Merging 4 data sets on the 'date' allows me to test the effect of a categorical variable at a time, i.e., by age and gender et.c. But I'd like to analyze this data by multi-level strata: age*gender*province instead.
The frequency tables by confirmed_date*age*gender*province is available for all time points of confirmed_date.
Is there any way to interleave these 4 separate data sets to a single data to date*sex*age*province structure given the data based on the frequency table above?
data age;
input date age confirmed deceased;
cards;
21976 1 32 0
21976 2 169 0
21976 3 1235 0
21976 4 506 1
21976 5 633 1
21976 6 834 5
21976 7 530 6
21976 8 192 6
21976 9 81 3
;
data gender;
input date sex confirmed deceased;
cards;
21976 female 2621 9
21976 male 1591 13
;
data test;
input date test negative confirmed released deceased;
cards;
21976 109591 71580 4212 31 22
;
data province;
input date province confirmed released deceased;
cards;
21976 1 88 1 0
21976 2 11 0 0
21976 3 78 0 0
21976 4 3081 7 15
21976 5 14 0 0
21976 6 19 0 0
21976 7 9 2 0
21976 8 92 8 1
21976 9 624 1 6
21976 10 64 1 0
21976 11 7 1 0
21976 12 2 0 0
21976 13 6 1 0
21976 14 5 0 0
21976 15 1 0 0
21976 16 91 12 0
21976 17 20 0 0
;
If 2,770 were the "total N disease," we'd know that 372,002>=D>=372,002-2,770=369,232 in the numerator, while 395,194-2,770=392,424 would be the "total no disease" in the denominator. Hence, 0.9409<=specificity<=0.9480.
However, in the text you wrote "true positive N=2,770," i.e., A=2,770. In this case we'd know B=9,661-2,770=6,891, but we could only compute a non-trivial upper bound: specificity<=372,002/(372,002+6,891)=0.9818.
If there was a possibility that people could be tested more than once, the "total N tests" might possibly exceed the total number of subjects found in demographic tables (with age, sex and province).
@Cruise wrote:
Is there any way to interleave these 4 separate data sets to a single data to date*sex*age*province structure?
Hi Cruise,
This is impossible, unless we make strong assumptions about the joint (empirical) distribution of these categorical variables (and the dependent variable), for example, we assume that these variables are independent so that, e.g., the male/female ratio is the same in all age groups and all provinces, similarly the age distribution is homogeneous among provinces, etc. These assumptions (as they are most likely incorrect) would invalidate the results of all statistical analyses that go beyond the univariate analyses which you can do with the existing four datasets (age, gender, test, province).
However, if you had the complete set of frequency tables of date*age*gender*province*(dependent variable) combinations, then the required information would be available. But I'm not sure if this is the case because the screenshot showing 5 (or at best 19) individuals from prov (=province?) 16, whereas the province dataset contains 91, is not convincing. Also, why would you bother reconstructing the information from aggregated tables if the detail table was available?
Great questions! My goal is to calculate the test specificity based on the confirmed, negative and disease (yes, no). The patient-level data contains only confirmed cases. However, these separate aggregated data sets include the cases whose tests turned out negative. That's where the inconsistency of 5 prov (province) vs 91 provinces come from that you pointed out. What a catch there. Your questions made me realize that I don't have a patient-level 'proc freq' data for the 'negative' test cases which can be systematically different from the 'confirmed' cases. Is there thinkable estimate work that can be done if perfect precision was not sought after?
@Kurt_Bremser @PeterClemmensen @Reeza @mkeintz @ChrisNZ @JerryV may interest you as well. I appreciate your help.
The estimate of the test specificity is the proportion of subjects with "negative" test result among the subjects with "disease: no". I don't see the disease variable in the data you've shown. Moreover, variable negative is not contained in datasets age, gender and province. (Why are the 91 cases in province 16 counted as "confirmed" if they "include the cases whose tests turned out negative," as you say?) If you had all this information, you could estimate the specificity by age, by sex and by province -- a fairly detailed analysis already. Would you really want to go beyond that and differentiate between estimated specificities in subgroups like "male subjects in age group 4" or "female subjects in age group 7"? The cross-table frequencies for age*sex*province in the relatively small set of confirmed cases would hardly be a reliable basis for estimates about negative cases (considering not only variability, but also potential dependencies).
Disease variable is implicit in the patient-level data that I have consisted of true disease cases (N=2,770). The variable negative is only available in the 'test' data and test data can be linked to the other datasets by the 'date' variable that they all share. the reason why province data has "confirmed" is that these separate and linkable datasets are inclusive of confirmed cases that can be found in the patient-level data, I realize. Yes, I can see now that I don't have to do three way or four way strata.
As you pointed out: if you know all these information...
I'm staring at the data I have vs data I may not have required to calculate test specificity.
I calculated 1. Total N of test (N=395,194); 2. Total N of negative (N=372,002); 3. Total N of positive results (N=9,661, labelled as confirmed) compared to Total N of patients (true positive N=2,770) in the patient-level data. I just noted that the variable 'confirmed' in aggregated data sets are described as: Total N of positive results in the data dictionary. Which makes sense since it's few times higher than true positive cases N=2,770 which is the endpoints of the patient-level data. There is an inconsistency between these numbers. For example, the sum of 9,661 and 372,002 don't add up to 395,194 as shown in the screen-shot. Probably, because of missing data which I have no way of tracing down, I think.
Thanks for brainstorming with me and your questions. They helped.
Please let me know if any thoughts on what I shared here in the text and the screenshot below.
If 2,770 were the "total N disease," we'd know that 372,002>=D>=372,002-2,770=369,232 in the numerator, while 395,194-2,770=392,424 would be the "total no disease" in the denominator. Hence, 0.9409<=specificity<=0.9480.
However, in the text you wrote "true positive N=2,770," i.e., A=2,770. In this case we'd know B=9,661-2,770=6,891, but we could only compute a non-trivial upper bound: specificity<=372,002/(372,002+6,891)=0.9818.
If there was a possibility that people could be tested more than once, the "total N tests" might possibly exceed the total number of subjects found in demographic tables (with age, sex and province).
@FreelanceReinh Thanks a lot. the solution you provided totally makes sense to me. More I think of and read on the subject 'confirmed' is highly likely means that both possibilities of True Positive and False Negative have been verified, hence, hospitalized. So, I'm strongly inclined to take 2,770 for a Total Disease (yes). I'll work on to calculate age, gender and province specific specificity tomorrow hoping that the multiple-testing per person scenario would not cause too much of a problem. Thanks again.
Would you agree with me that calculating gender, age and province specific specificity is not possible. Let me show you for a gender and the rest is the same. I have number of 'total positive' and ' total disease' by gender as shown below.
From 'gender' dataset, I can pull out:
Total Positive (A+B)=9,661. Male (N=3,834) and Female (N=5,827);
Total Disease (yes) A+C=2,770. Male (N=1,218), Female (N=1,546) and Missing (N=6);
But I don't know how many of Total Negative (C+D) N=372,002 and Total N of tests done N=395,194 are male vs female because the source data test is only at the date level. All demographic can be linked to the 'test' dataset only by 'date' variable.
Thank you.
@Cruise wrote:
Would you agree with me that calculating gender, age and province specific specificity is not possible.
Yes, I agree. There are too many unknowns in the formula.
@Cruise wrote:
Off the topic. Noticed that you have a freelance in your title. I wonder if you consult doctoral students on their dissertations?
I did so in an earlier job (at a German university), more than ten years ago, but since I switched to freelancing in 2010, I've only had commercial clients (clinical research). Of course, I chose this user name deliberately. :-)
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.