Solved: Re: How to interleave multiple data sets one-way aggregated by categor...

Cruise · Posted 04-04-2020 09:08 AM

Hi Folks:

I have separate data sets that all share 'date' variable and each one of them are aggregated by date of test, age, gender and province separately. Example data below is taken for a specific date: 21976. Actual data sets have many dates. Merging 4 data sets on the 'date' allows me to test the effect of a categorical variable at a time, i.e., by age and gender et.c. But I'd like to analyze this data by multi-level strata: age*gender*province instead.

The frequency tables by confirmed_date*age*gender*province is available for all time points of confirmed_date.

Is there any way to interleave these 4 separate data sets to a single data to date*sex*age*province structure given the data based on the frequency table above?

data age; 
input date age confirmed deceased;
cards;
21976	1	32	0
21976	2	169	0
21976	3	1235 0
21976	4	506	1
21976	5	633	1
21976	6	834	5
21976	7	530	6
21976	8	192	6
21976	9	81	3
;

data gender;
input date sex confirmed deceased;
cards;
21976	female	2621	9
21976	male	1591	13
;

data test;
input date test	negative confirmed released deceased;
cards;
21976	109591	71580	4212	31	22
;

data province;
input date	province confirmed released deceased;
cards;
21976	1	88	1	0
21976	2	11	0	0
21976	3	78	0	0
21976	4	3081 7	15
21976	5	14	0	0
21976	6	19	0	0
21976	7	9	2	0
21976	8	92	8	1
21976	9	624	1	6
21976	10	64	1	0
21976	11	7	1	0
21976	12	2	0	0
21976	13	6	1	0
21976	14	5	0	0
21976	15	1	0	0
21976	16	91	12	0
21976	17	20	0	0
;

FreelanceReinh · Posted 04-04-2020 04:35 PM

If 2,770 were the "total N disease," we'd know that 372,002>=D>=372,002-2,770=369,232 in the numerator, while 395,194-2,770=392,424 would be the "total no disease" in the denominator. Hence, 0.9409<=specificity<=0.9480.

However, in the text you wrote "true positive N=2,770," i.e., A=2,770. In this case we'd know B=9,661-2,770=6,891, but we could only compute a non-trivial upper bound: specificity<=372,002/(372,002+6,891)=0.9818.

If there was a possibility that people could be tested more than once, the "total N tests" might possibly exceed the total number of subjects found in demographic tables (with age, sex and province).

View solution in original post

FreelanceReinh · Posted 04-04-2020 09:56 AM

@Cruise wrote:

Is there any way to interleave these 4 separate data sets to a single data to date*sex*age*province structure?

Hi Cruise,

This is impossible, unless we make strong assumptions about the joint (empirical) distribution of these categorical variables (and the dependent variable), for example, we assume that these variables are independent so that, e.g., the male/female ratio is the same in all age groups and all provinces, similarly the age distribution is homogeneous among provinces, etc. These assumptions (as they are most likely incorrect) would invalidate the results of all statistical analyses that go beyond the univariate analyses which you can do with the existing four datasets (age, gender, test, province).

However, if you had the complete set of frequency tables of date*age*gender*province*(dependent variable) combinations, then the required information would be available. But I'm not sure if this is the case because the screenshot showing 5 (or at best 19) individuals from prov (=province?) 16, whereas the province dataset contains 91, is not convincing. Also, why would you bother reconstructing the information from aggregated tables if the detail table was available?

Cruise · Posted 04-04-2020 11:30 AM

@FreelanceReinh

Great questions! My goal is to calculate the test specificity based on the confirmed, negative and disease (yes, no). The patient-level data contains only confirmed cases. However, these separate aggregated data sets include the cases whose tests turned out negative. That's where the inconsistency of 5 prov (province) vs 91 provinces come from that you pointed out. What a catch there. Your questions made me realize that I don't have a patient-level 'proc freq' data for the 'negative' test cases which can be systematically different from the 'confirmed' cases. Is there thinkable estimate work that can be done if perfect precision was not sought after?

Cruise · Posted 04-04-2020 11:55 AM

@Kurt_Bremser @PeterClemmensen @Reeza @mkeintz @ChrisNZ @JerryV may interest you as well. I appreciate your help.

FreelanceReinh · Posted 04-04-2020 12:25 PM

The estimate of the test specificity is the proportion of subjects with "negative" test result among the subjects with "disease: no". I don't see the disease variable in the data you've shown. Moreover, variable negative is not contained in datasets age, gender and province. (Why are the 91 cases in province 16 counted as "confirmed" if they "include the cases whose tests turned out negative," as you say?) If you had all this information, you could estimate the specificity by age, by sex and by province -- a fairly detailed analysis already. Would you really want to go beyond that and differentiate between estimated specificities in subgroups like "male subjects in age group 4" or "female subjects in age group 7"? The cross-table frequencies for age*sex*province in the relatively small set of confirmed cases would hardly be a reliable basis for estimates about negative cases (considering not only variability, but also potential dependencies).

Cruise · Posted 04-04-2020 01:56 PM

Disease variable is implicit in the patient-level data that I have consisted of true disease cases (N=2,770). The variable negative is only available in the 'test' data and test data can be linked to the other datasets by the 'date' variable that they all share. the reason why province data has "confirmed" is that these separate and linkable datasets are inclusive of confirmed cases that can be found in the patient-level data, I realize. Yes, I can see now that I don't have to do three way or four way strata.

As you pointed out: if you know all these information...

I'm staring at the data I have vs data I may not have required to calculate test specificity.

I calculated 1. Total N of test (N=395,194); 2. Total N of negative (N=372,002); 3. Total N of positive results (N=9,661, labelled as confirmed) compared to Total N of patients (true positive N=2,770) in the patient-level data. I just noted that the variable 'confirmed' in aggregated data sets are described as: Total N of positive results in the data dictionary. Which makes sense since it's few times higher than true positive cases N=2,770 which is the endpoints of the patient-level data. There is an inconsistency between these numbers. For example, the sum of 9,661 and 372,002 don't add up to 395,194 as shown in the screen-shot. Probably, because of missing data which I have no way of tracing down, I think.

Thanks for brainstorming with me and your questions. They helped.

Please let me know if any thoughts on what I shared here in the text and the screenshot below.

test specRain.png

FreelanceReinh · Posted 04-04-2020 04:35 PM

If 2,770 were the "total N disease," we'd know that 372,002>=D>=372,002-2,770=369,232 in the numerator, while 395,194-2,770=392,424 would be the "total no disease" in the denominator. Hence, 0.9409<=specificity<=0.9480.

However, in the text you wrote "true positive N=2,770," i.e., A=2,770. In this case we'd know B=9,661-2,770=6,891, but we could only compute a non-trivial upper bound: specificity<=372,002/(372,002+6,891)=0.9818.

If there was a possibility that people could be tested more than once, the "total N tests" might possibly exceed the total number of subjects found in demographic tables (with age, sex and province).

Cruise · Posted 04-04-2020 09:13 PM

@FreelanceReinh Thanks a lot. the solution you provided totally makes sense to me. More I think of and read on the subject 'confirmed' is highly likely means that both possibilities of True Positive and False Negative have been verified, hence, hospitalized. So, I'm strongly inclined to take 2,770 for a Total Disease (yes). I'll work on to calculate age, gender and province specific specificity tomorrow hoping that the multiple-testing per person scenario would not cause too much of a problem. Thanks again.

Cruise · Posted 04-05-2020 11:30 AM

@FreelanceReinh

Would you agree with me that calculating gender, age and province specific specificity is not possible. Let me show you for a gender and the rest is the same. I have number of 'total positive' and ' total disease' by gender as shown below.

From 'gender' dataset, I can pull out:

Total Positive (A+B)=9,661. Male (N=3,834) and Female (N=5,827);

Total Disease (yes) A+C=2,770. Male (N=1,218), Female (N=1,546) and Missing (N=6);

But I don't know how many of Total Negative (C+D) N=372,002 and Total N of tests done N=395,194 are male vs female because the source data test is only at the date level. All demographic can be linked to the 'test' dataset only by 'date' variable.

Thank you.

ask Rein again.png

FreelanceReinh · Posted 04-05-2020 12:01 PM

@Cruise wrote:

@FreelanceReinh

Would you agree with me that calculating gender, age and province specific specificity is not possible.

Yes, I agree. There are too many unknowns in the formula.

Cruise · Posted 04-05-2020 12:23 PM

Off the topic. Noticed that you have a freelance in your title. I wonder if you consult doctoral students on their dissertations?

FreelanceReinh · Posted 04-05-2020 02:43 PM

@Cruise wrote:
Off the topic. Noticed that you have a freelance in your title. I wonder if you consult doctoral students on their dissertations?

I did so in an earlier job (at a German university), more than ten years ago, but since I switched to freelancing in 2010, I've only had commercial clients (clinical research). Of course, I chose this user name deliberately. :-)

Cruise · Posted 04-05-2020 03:09 PM

Nice spot that I'm aiming for. But I gotta graduate first at least 🙂 I just completed a 6-month Co-Op at a large pharma and loved it.

How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Re: How to interleave multiple data sets one-way aggregated by categorical variables

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away