Solved: Compare three tables based on two variables or all variables

ModeratelyWise · Posted 07-25-2024 06:37 PM

Hello all,

I have three tables, all with the same 11 variables (see below), that I need to compare. I need to make a list of all observations that are identical across all three tables (Row 1 in all three tables for example), observations that have a unique combination of Person_ID and Incident_ID that are not found in the other two tables(Row 5 in table B for example), and observations that unique combination of all variables for each table (the fourth and fifth observation of table C for example). Hopefully this makes sense.

Thanks,

Marc

Table A

Person_ID	Incident_ID	Family_ID	Incident_Date	Invest_Comp_date	Incident_Type	Incident_Type_Collapsed	Invest_Finding	Finding_Collapsed	Internal_Incident	Internal_Investigation
1	1	1	17-Jan-2024	29-Jan-2024	1	3	5	2	0	0
2	1	1	17-Jan-2024	29-Jan-2024	1	3	5	2	0	0
3	1	1	17-Jan-2024	29-Jan-2024	1	3	5	2	0	0
4	2	2	24-Jan-2024	3-Mar-2024	12	3	4	0	0	0
1	3	1	4-Feb-2024	16-Feb-2024	1	3	10	0	1	1
3	3	1	4-Feb-2024	16-Feb-2024	1	3	10	0	1	1
5	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
6	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
7	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
8	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
1	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1
2	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1
3	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1

Table B

Person_ID	Incident_ID	Family_ID	Incident_Date	Invest_Comp_date	Incident_Type	Incident_Type_Collapsed	Invest_Finding	Finding_Collapsed	Internal_Incident	Internal_Investigation
1	1	1	17-Jan-2024	29-Jan-2024	1	3	5	2	0	0
2	1	1	17-Jan-2024	29-Jan-2024	1	3	5	2	0	0
3	1	1	17-Jan-2024	29-Jan-2024	13	3	4	0	0	0
1	3	1	4-Feb-2024	16-Feb-2024	1	3	10	0	1	1
2	3	1	4-Feb-2024	16-Feb-2024	1	3	10	0	1	1
3	3	1	4-Feb-2024	16-Feb-2024	1	3	10	0	1	1
5	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
6	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
8	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
1	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	0	0
2	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1
3	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1
9	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1
10	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1

Table C

Person_ID	Incident_ID	Family_ID	Incident_Date	Invest_Comp_date	Incident_Type	Incident_Type_Collapsed	Invest_Finding	Finding_Collapsed	Internal_Incident	Internal_Investigation
1	1	1	17-Jan-2024	29-Jan-2024	1	3	5	2	0	0
2	1	1	17-Jan-2024	29-Jan-2024	1	3	5	2	0	0
3	1	1	17-Jan-2024	29-Jan-2024	1	3	5	2	0	0
4	2	2	24-Jan-2024	3-Jun-2024	12	3	4	0	0	0
11	2	2	24-Jan-2024	3-Jun-2024	12	3	4	0	0	0
1	3	1	4-Feb-2024	16-Feb-2024	1	3	10	0	1	1
3	3	1	4-Feb-2024	16-Feb-2024	1	3	10	0	1	1
5	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
6	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
7	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
8	4	3	15-Feb-2024	8-Apr-2024	13	3	7	1	1	1
1	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1
2	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1
3	5	1	6-Mar-2024	12-Apr-2024	3	3	5	2	1	1
12	6	4	7-Mar-2024	13-Apr-2024	1	3	4	0	0	0

mkeintz · Posted 07-25-2024 11:53 PM

If sorting your datasets is not too expensive, then after sorting each dataset, a single data step can make a dataset for each combination of intersections.

Untested, in the absence of a working sample data set.

proc sort data=a  out=a_sorted;
  by _all_;
run;
proc sort data=b  out=b_sorted;
  by _all_;
run;
proc sort data=c  out=c_sorted;
  by _all_;
run;

data all_three a_only b_only c_only a_and_b  a_and_c  b_and_c ;
  merge a_sorted (in=ina)  b_sorted (in=inb)  c_sorted (in=inc);
  by _all_;
  select (4*ina + 2*inb + 1*inc);
    when (7) output all_three;
    when (6) output a_and_b;
    when (5) output a_and_c;
    when (4) output a_only;
    when (3) output b_and_c;
    when (2) output b_only;
    when (1) output c_only;
  end;
run;

But really, why do this at all? Just make a single dataset with a permanent new variable abc=4*ina+2*inb+1*inc. Then you can use that variable as a filter for all subsequent processing.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Ksharp · Posted 07-25-2024 08:31 PM

If you want find the same obs from these three tables, try this :
proc sql;
create table want as
select * from A
intersect
select * from B
intersect
select * from C
;
quit;

P.S. Other questions you could get by my similar code.

mkeintz · Posted 07-25-2024 11:53 PM

If sorting your datasets is not too expensive, then after sorting each dataset, a single data step can make a dataset for each combination of intersections.

Untested, in the absence of a working sample data set.

proc sort data=a  out=a_sorted;
  by _all_;
run;
proc sort data=b  out=b_sorted;
  by _all_;
run;
proc sort data=c  out=c_sorted;
  by _all_;
run;

data all_three a_only b_only c_only a_and_b  a_and_c  b_and_c ;
  merge a_sorted (in=ina)  b_sorted (in=inb)  c_sorted (in=inc);
  by _all_;
  select (4*ina + 2*inb + 1*inc);
    when (7) output all_three;
    when (6) output a_and_b;
    when (5) output a_and_c;
    when (4) output a_only;
    when (3) output b_and_c;
    when (2) output b_only;
    when (1) output c_only;
  end;
run;

But really, why do this at all? Just make a single dataset with a permanent new variable abc=4*ina+2*inb+1*inc. Then you can use that variable as a filter for all subsequent processing.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ModeratelyWise · Posted 07-29-2024 06:30 PM

Thanks.
Forgive my ignorance, but how do create that variable in the combined dataset?

mkeintz · Posted 07-29-2024 11:07 PM

@ModeratelyWise wrote:
Thanks.
Forgive my ignorance, but how do create that variable in the combined dataset?

You could just use

from_abc=4*ina + 2*inb + 1*inc;

But it might be better just to keep three dummy variables as below. That would be easier to remember.

data all_groupings;
  merge a_sorted (in=ina)  b_sorted (in=inb)  c_sorted (in=inc);
  by _all_;
  from_a=ina;
  from_b=inb;
  from_c=inc;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

Re: Compare three tables based on two variables or all variables

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!