BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

Hi Folks: 

I have separate data sets that all share 'date' variable and each one of them are aggregated by date of test, age, gender and province separately. Example data below is taken for a specific date: 21976. Actual data sets have many dates. Merging 4 data sets on the 'date' allows me to test the effect of a categorical variable at a time, i.e., by age and gender et.c. But I'd like to analyze this data by multi-level strata: age*gender*province instead.

The frequency tables by confirmed_date*age*gender*province is available for all time points of confirmed_date. 

proc freq.png

Is there any way to interleave these 4 separate data sets to a single data to date*sex*age*province structure given the data based on the frequency table above? 

 

data age; 
input date age confirmed deceased;
cards;
21976	1	32	0
21976	2	169	0
21976	3	1235 0
21976	4	506	1
21976	5	633	1
21976	6	834	5
21976	7	530	6
21976	8	192	6
21976	9	81	3
;

data gender;
input date sex confirmed deceased;
cards;
21976	female	2621	9
21976	male	1591	13
;

data test;
input date test	negative confirmed released deceased;
cards;
21976	109591	71580	4212	31	22
;

data province;
input date	province confirmed released deceased;
cards;
21976	1	88	1	0
21976	2	11	0	0
21976	3	78	0	0
21976	4	3081 7	15
21976	5	14	0	0
21976	6	19	0	0
21976	7	9	2	0
21976	8	92	8	1
21976	9	624	1	6
21976	10	64	1	0
21976	11	7	1	0
21976	12	2	0	0
21976	13	6	1	0
21976	14	5	0	0
21976	15	1	0	0
21976	16	91	12	0
21976	17	20	0	0
; 


 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

If 2,770 were the "total N disease," we'd know that 372,002>=D>=372,002-2,770=369,232 in the numerator, while 395,194-2,770=392,424 would be the "total no disease" in the denominator. Hence, 0.9409<=specificity<=0.9480.

 

However, in the text you wrote "true positive N=2,770," i.e., A=2,770. In this case we'd know B=9,661-2,770=6,891, but we could only compute a non-trivial upper bound: specificity<=372,002/(372,002+6,891)=0.9818.

 

If there was a possibility that people could be tested more than once, the "total N tests" might possibly exceed the total number of subjects found in demographic tables (with age, sex and province).

View solution in original post

12 REPLIES 12
FreelanceReinh
Jade | Level 19

@Cruise wrote:

Is there any way to interleave these 4 separate data sets to a single data to date*sex*age*province structure?   


Hi Cruise,

 

This is impossible, unless we make strong assumptions about the joint (empirical) distribution of these categorical variables (and the dependent variable), for example, we assume that these variables are independent so that, e.g., the male/female ratio is the same in all age groups and all provinces, similarly the age distribution is homogeneous among provinces, etc. These assumptions (as they are most likely incorrect) would invalidate the results of all statistical analyses that go beyond the univariate analyses which you can do with the existing four datasets (age, gender, test, province).

 

However, if you had the complete set of frequency tables of date*age*gender*province*(dependent variable) combinations, then the required information would be available. But I'm not sure if this is the case because the screenshot showing 5 (or at best 19) individuals from prov (=province?) 16, whereas the province dataset contains 91, is not convincing. Also, why would you bother reconstructing the information from aggregated tables if the detail table was available?

 

 

Cruise
Ammonite | Level 13

@FreelanceReinh 

Great questions! My goal is to calculate the test specificity based on the confirmed, negative and disease (yes, no). The patient-level data contains only confirmed cases. However, these separate aggregated data sets include the cases whose tests turned out negative. That's where the inconsistency of 5 prov (province) vs 91 provinces come from that you pointed out. What a catch there. Your questions made me realize that I don't have a patient-level 'proc freq' data for the 'negative' test cases which can be systematically different from the 'confirmed' cases. Is there thinkable estimate work that can be done if perfect precision was not sought after?   

FreelanceReinh
Jade | Level 19

The estimate of the test specificity is the proportion of subjects with "negative" test result among the subjects with "disease: no". I don't see the disease variable in the data you've shown. Moreover, variable negative is not contained in datasets age, gender and province. (Why are the 91 cases in province 16 counted as "confirmed" if they "include the cases whose tests turned out negative," as you say?) If you had all this information, you could estimate the specificity by age, by sex and by province -- a fairly detailed analysis already. Would you really want to go beyond that and differentiate between estimated specificities in subgroups like "male subjects in age group 4" or "female subjects in age group 7"? The cross-table frequencies for age*sex*province in the relatively small set of confirmed cases would hardly be a reliable basis for estimates about negative cases (considering not only variability, but also potential dependencies).

Cruise
Ammonite | Level 13

Disease variable is implicit in the patient-level data that I have consisted of true disease cases (N=2,770). The variable negative is only available in the 'test' data and test data can be linked to the other datasets by the 'date' variable that they all share. the reason why province data has "confirmed" is that these separate and linkable datasets are inclusive of confirmed cases that can be found in the patient-level data, I realize. Yes, I can see now that I don't have to do three way or four way strata. 

As you pointed out: if you know all these information...

I'm staring at the data I have vs data I may not have required to calculate test specificity. 

I calculated 1. Total N of test (N=395,194); 2. Total N of negative (N=372,002); 3. Total N of positive results (N=9,661, labelled as confirmed) compared to Total N of patients (true positive N=2,770) in the patient-level data. I just noted that the variable 'confirmed' in aggregated data sets are described as: Total N of positive results in the data dictionary. Which makes sense since it's few times higher than true positive cases N=2,770 which is the endpoints of the patient-level data. There is an inconsistency between these numbers. For example, the sum of 9,661 and 372,002 don't add up to 395,194 as shown in the screen-shot. Probably, because of missing data which I have no way of tracing down, I think. 

Thanks for brainstorming with me and your questions. They helped. 

Please let me know if any thoughts on what I shared here in the text and the screenshot below. 

 

test specRain.png

FreelanceReinh
Jade | Level 19

If 2,770 were the "total N disease," we'd know that 372,002>=D>=372,002-2,770=369,232 in the numerator, while 395,194-2,770=392,424 would be the "total no disease" in the denominator. Hence, 0.9409<=specificity<=0.9480.

 

However, in the text you wrote "true positive N=2,770," i.e., A=2,770. In this case we'd know B=9,661-2,770=6,891, but we could only compute a non-trivial upper bound: specificity<=372,002/(372,002+6,891)=0.9818.

 

If there was a possibility that people could be tested more than once, the "total N tests" might possibly exceed the total number of subjects found in demographic tables (with age, sex and province).

Cruise
Ammonite | Level 13

@FreelanceReinh Thanks a lot. the solution you provided totally makes sense to me. More I think of and read on the subject 'confirmed' is highly likely means that both possibilities of True Positive and False Negative have been verified, hence, hospitalized. So, I'm strongly inclined to take 2,770 for a Total Disease (yes). I'll work on to calculate age, gender and province specific specificity tomorrow hoping that the multiple-testing per person scenario would not cause too much of a problem. Thanks again. 

Cruise
Ammonite | Level 13

@FreelanceReinh 

Would you agree with me that calculating gender, age and province specific specificity is not possible. Let me show you for a gender and the rest is the same. I have number of 'total positive' and ' total disease' by gender as shown below. 

From 'gender' dataset, I can pull out: 

Total Positive (A+B)=9,661. Male (N=3,834) and Female (N=5,827);

Total Disease (yes) A+C=2,770. Male (N=1,218), Female (N=1,546) and Missing (N=6);

 

But I don't know how many of Total Negative (C+D) N=372,002 and Total N of tests done N=395,194 are male vs female because the source data test is only at the date level. All demographic can be linked to the 'test' dataset only by 'date' variable. 

 

Thank you. 

 

ask Rein again.png

 

FreelanceReinh
Jade | Level 19

@Cruise wrote:

@FreelanceReinh 

Would you agree with me that calculating gender, age and province specific specificity is not possible.


Yes, I agree. There are too many unknowns in the formula.

Cruise
Ammonite | Level 13
Off the topic. Noticed that you have a freelance in your title. I wonder if you consult doctoral students on their dissertations?
FreelanceReinh
Jade | Level 19

@Cruise wrote:
Off the topic. Noticed that you have a freelance in your title. I wonder if you consult doctoral students on their dissertations?

I did so in an earlier job (at a German university), more than ten years ago, but since I switched to freelancing in 2010, I've only had commercial clients (clinical research). Of course, I chose this user name deliberately. :-)

Cruise
Ammonite | Level 13
Nice spot that I'm aiming for. But I gotta graduate first at least 🙂 I just completed a 6-month Co-Op at a large pharma and loved it.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1213 views
  • 7 likes
  • 2 in conversation