Re: Data Difference Proportion

mayasak · Posted 07-19-2022 12:52 AM

I have 2 data sets (below are partial datasets) for persons that are exposed to an "xyz" disease and want to find the proportion of those who were tested within 5-10 days from there exposure:

data Exp;
input Subject_ID Exp_Date : mmddyy10. Name $14-22 Case_ID Exposure $;
format exp_date date9.;
datalines;
1 12/20/2019 Joe Smith 122 Home
1 09/30/2019 Joe Smith 435 Home
2 01/05/2020 Mary Mill 432 Work
3 11/04/2019 John Doe 129 Work
3 11/04/2019 John Doe 655 School
3 01/25/2020 John Doe 721 Work
;;;
run;

data Test;
input Subject_ID Test_Date : mmddyy10. Name $14-23;
format test_date date9.;
datalines;
1 12/29/2019 Joe Smith
2 09/30/2019 Mary Mill
2 01/07/2020 Mary Mill
8 06/30/2019 Mark Roger
8 07/15/2019 Mark Roger
8 08/29/2019 Mark Roger
8 12/03/2019 Mark Roger
;;;;
run;

This is what I've done:

I merged the 2 datasets using the following code:

Proc SQL;
Create Table Merge AS
Select
E.Exp_Date,E.Case_ID,E.Exposure,
coalesce(E.Subject_ID, T.Subject_ID) as Subject_ID,
T.Test_Date
from exp E FULL JOIN test T on (E.Subject_ID = T.Subject_ID);
quit;

Then I found the date difference between exposure and test dates:

data HW;
set merge;
days_diff = intck('day', exp_date, test_date);
run;

The result was as below. It shows an issue that the dates were filled in for matched persons such as Joe Smith and Mary Mill:

My questions are:

How to get rid of the dates filled for the matched data.
How to get the proportion of those tested in the 5-10 days difference window. -Specifically the numerator and denominator.
Is there a better approach for getting the proportion?

Thank you

Renoux · Posted 07-19-2022 08:03 AM

Hi!

I am not sure to understand all the words you use, but I will try to answer.

[Question 1]

If you want to get rid of people who are in both work.Exp and work.Test, you add to your proc sql the condition [where E.Subject_ID=. or T.Subject_ID=.] :

Proc SQL;
Create Table Merge2 AS
Select
E.Exp_Date,E.Case_ID,E.Exposure,
coalesce(E.Subject_ID, T.Subject_ID) as Subject_ID,
T.Test_Date
from exp E FULL JOIN test T on (E.Subject_ID = T.Subject_ID)
where E.Subject_ID=. or T.Subject_ID=.;
quit;

if you want the opposite, to keep only subject who are in both table, you shall do an inner join instead of a full join :

Proc SQL;
Create Table Merge3 AS
Select
E.Exp_Date,E.Case_ID,E.Exposure,
coalesce(E.Subject_ID, T.Subject_ID) as Subject_ID,
T.Test_Date
from exp E INNER JOIN test T on (E.Subject_ID = T.Subject_ID);
quit;

Or left join for only subjects in table work.Exp or rigth join for only subjects in table work.Test.

[Question 2]

Is this the proportion you want ?

Proc SQL;
Create Table proportion AS
Select
mean(((days_diff>=5)*(days_diff<=10))) as proportion_in_5_to_10_days
from HW;
quit;

[Question 3]

There are different approachs to do this. "Better" approach depends on your point of view.

Regards.

Axel Renoux

mayasak · Posted 07-19-2022 10:58 AM

Thank you Axel,

For question 1, sorry for not being clear. What happened is:

Joe smith had exp_date from table one as 20DEC2019 and 30SEP2019 and only one test_date from table two as 29DEC2019. When I did the merge, there were two test_date for this person, both 29DEC2019 so he had two date_diff 9 and 90, and 90 is not a true date_diff.

Mary Mill, on the other hand, had only one exp_date and two tests, but after the merge, she had 2 exp_date and diff_date 2 and -97.

What I needed is to have only the 9 and 2 diff_date.

One thought I have is to deduplicate the data only when there's no corresponding test_date for the exp_date and vice versa but I have no clue ob how to do it.

Thank you

Renoux · Posted 07-19-2022 11:25 AM

To answer, I would need to know what information you would like in the following fictional case :

data Exp;
input Subject_ID Exp_Date : mmddyy10. Name $14-22 Case_ID Exposure $;
format exp_date date9.;
datalines;
9 12/20/2019 Mike Avery 122 Home
9 09/30/2019 Mike Avery 435 Home
;;;
run;

data Test;
input Subject_ID Test_Date : mmddyy10. Name $14-23;
format test_date date9.;
datalines;
9 10/29/2019 Mike Avery
9 01/07/2020 Mike Avery
;;;;
run;

Matching only (exp 12/20/2019) with (test 01/07/2020)? Or matching also (exp 09/30/2019) with (test 10/29/2019)? Or matching none of those two?

mayasak · Posted 07-19-2022 12:12 PM

In your factual example, we should match both because each time the person was exposed, he/she had a corresponding test after this exposure. But if a person had exposure with no corresponding test such as below, only 09/30/2019 and 10/29/2019 should be matched:

data Exp;
input Subject_ID Exp_Date : mmddyy10. Name $14-22 Case_ID Exposure $;
format exp_date date9.;
datalines;
9 12/20/2019 Mike Avery 122 Home
9 09/30/2019 Mike Avery 435 Home
;;;
run;

data Test;
input Subject_ID Test_Date : mmddyy10. Name $14-23;
format test_date date9.;
datalines;
9 10/29/2019 Mike Avery
;;;;
run;

The same thing for those who had one exposure and two tests such as below, only 09/30/2019 and 10/29/2019 should be matched:

data Exp;
input Subject_ID Exp_Date : mmddyy10. Name $14-22 Case_ID Exposure $;
format exp_date date9.;
datalines;
9 09/30/2019 Mike Avery 435 Home
;;;
run;

data Test;
input Subject_ID Test_Date : mmddyy10. Name $14-23;
format test_date date9.;
datalines;
9 10/29/2019 Mike Avery
9 01/07/2020 Mike Avery
;;;;
run;

Thank you

Data Difference Proportion