How to COUNT the overlap and compute its ratio?

yanshuai · Posted 10-15-2018 09:33 PM

Hello,

I have asked similar question before but I found I only achieved partial goal.

So I have data like this

DATA HAVE1;
    INPUT (NAME1 NAME2) (:$8.);
    CARDS;
Z J
Z K
X J
X K
;


DATA HAVE2;
    INPUT (NAME X1 X2) (:$8.);
    CARDS;
Z A 1
Z D 2
X C 3
X D 4
J A 1
J B 1
K E 3
K D 4
;

What I want is like this

DATA WANT;
    INPUT (NAME1 NAME2 OVERLAPX1 OVERLAPX1_1 OVERLAPX1_2 OVERLAPX1_3 OVERLAPX2 OVERLAPX2_1 OVERLAPX2_2 OVERLAPX2_3) (:$8.);
    CARDS;
Z J 1 0.5 0.5 1/3 1 0.5 1 0.5
Z K 1 0.5 0.5 1/3 0 0 0 0
X J 0 0 0 0 0 0 0 0
X K 1 0.5 0.5 1/3 2 1 1 1
;

First step, compute the number of overlap between NAME1 and NAME2.

For example (using first row in WANT as example)

So OVERLAPX1 is the number of overlap in X1 between NAME1 and NAME2 e.g. Only A is the overlap between Z and J, so it is 1.

Second, repeat first step, and divided it by distinct value of X1 for NAME1

OVERLAPX1_1 is the OVERLAPX1 divided by total distinct value of X1 for NAME1.

For example. There is 1 overlap (A) between Z and J, but Z has 2 unique values in X1 (A, D). So the ratio is 1/2 = 0.5

Third, repeat first step, and divided it by distinct value of X1 for NAME2

OVERLAPX1_2 is the OVERLAPX1 divided by total distinct value of X1 for NAME2.

For example. There is 1 overlap (A) between Z and J, but J has 2 unique value in X1 (A, B). So the ratio is 1/2 = 0.5

Fourth, repeat first step, and divided it by total distinct value of X1 for both NAME1 and NAME2

OVERLAPX1_3 is the OVERLAPX1 divided by total distinct value of X1 for NAME1 & NAME2.

For example, There is 1 overlap (A) between Z and J, but Z and J totally has 3 unique value in X1 (A, B, D). So the ratio is 1/3

Till now, I can achieve it with you guys help last time.

But, I would like to repeat the above steps by using X2, instead of X1.

I tried left join (select unique NAME1, X2). But it is confusing. The trouble is all the source data are in the same table and NAME1 & NAME2 have to come in pair. It is kinda a headache.

I also tried such codes

proc sql;
select sum(a.X1=b.X1) as OVERLAPX1,
		sum(a.X1=b.X1)/COUNT(DISTINCT a.X1) as OVERLAPX1_NAME1,
		sum(a.X1=b.X1)/COUNT(DISTINCT b.X1) as OVERLAPX1_NAME2,
		sum(a.X1=b.X1)/SUM(COUNT(DISTINCT a.X1)+ COUNT(DISTINCT b.X1) - sum(a.X1=b.X1)) as OVERLAPX1_NAME1_NAME2
		,sum(c.X2=d.X2) as OVERLAPX2,
		sum(c.X2=d.X2)/COUNT(DISTINCT c.X2) as OVERLAPX2_NAME1,
		sum(c.X2=d.X2)/COUNT(DISTINCT d.X2) as OVERLAPX2_NAME2,
		sum(c.X2=d.X2)/SUM(COUNT(DISTINCT c.X2)+ COUNT(DISTINCT d.X2) - sum(c.X2=d.X2)) as OVERLAPX2_NAME1_NAME2
from HAVE1 n
left join
(select unique NAME,X1 from HAVE2) a on n.NAME1 = a.NAME
left join
(select unique NAME,X1 from HAVE2) b on n.NAME2 = b.NAME
left join
(select unique NAME,X2 from HAVE2) c on n.NAME1 = c.NAME
left join
(select unique NAME,X2 from HAVE2) d on n.NAME2 = d.NAME
group by n.NAME1, n.NAME2
order by n.NAME1, n.NAME2;
quit;

But this is giving me wrong value.

Would you please check where is wrong?

How to achieve this using SQL. It is always my favorite.

Thank you all.

Bless you.

Reeza · Posted 10-15-2018 09:40 PM

Is there a limit to the number of overlaps you can have? You'll only ever need that list of variables?

If this needs to scale the approaches would be different.

yanshuai · Posted 10-16-2018 01:41 PM

I only care the number of overlap between NAME1 and NAME2 in HAVE1 list

For example. In terms of variable X1,

Z and J have one overlap in X1, which is A.

yanshuai · Posted 10-16-2018 06:04 PM

Hello.

Is this too complicated?

Reeza · Posted 10-16-2018 06:10 PM

I don't think the logic is clear of how to go from your have to your want table.

yanshuai · Posted 10-16-2018 06:19 PM

Sorry to confuse you.

Start from the simple one.

All I want is the overlap between NAME1 and NAME2 in HAVE1.

For example.

In HAVE1, in first row, I have Z and J, and I would like to count their overlap in variable X1.

How to achieve that?

I go to HAVE2, and see what Z has in X1 : A and D;

Then I see what J has in X1: A and B.

Z and J has one overlap in X1: which is "A".

So the count of overlap is 1.

The logic is to count the number of value in common between Z and J in variable X1.

Does this make myself clear?

Thank you very much.

Reeza · Posted 10-16-2018 06:22 PM

I dont have time to code something, but not sure why you couldn't just do a join on table1 to table2 (name=name1) and table2 (name=name2) and then do a freq report on the results.

yanshuai · Posted 10-16-2018 06:37 PM

Yes.

I actually do.

Using SQL

proc sql;
select sum(a.X1=b.X1) as OVERLAPX1,
		sum(a.X1=b.X1)/COUNT(DISTINCT a.X1) as OVERLAPX1_NAME1,
		sum(a.X1=b.X1)/COUNT(DISTINCT b.X1) as OVERLAPX1_NAME2,
		sum(a.X1=b.X1)/SUM(COUNT(DISTINCT a.X1)+ COUNT(DISTINCT b.X1) - sum(a.X1=b.X1)) as OVERLAPX1_NAME1_NAME2
		,sum(c.X2=d.X2) as OVERLAPX2,
		sum(c.X2=d.X2)/COUNT(DISTINCT c.X2) as OVERLAPX2_NAME1,
		sum(c.X2=d.X2)/COUNT(DISTINCT d.X2) as OVERLAPX2_NAME2,
		sum(c.X2=d.X2)/SUM(COUNT(DISTINCT c.X2)+ COUNT(DISTINCT d.X2) - sum(c.X2=d.X2)) as OVERLAPX2_NAME1_NAME2
from HAVE1 n
left join
(select unique NAME,X1 from HAVE2) a on n.NAME1 = a.NAME
left join
(select unique NAME,X1 from HAVE2) b on n.NAME2 = b.NAME
left join
(select unique NAME,X2 from HAVE2) c on n.NAME1 = c.NAME
left join
(select unique NAME,X2 from HAVE2) d on n.NAME2 = d.NAME
group by n.NAME1, n.NAME2
order by n.NAME1, n.NAME2;
quit;

But it gives me wrong value.

Stuck here.

How to COUNT the overlap and compute its ratio?

Re: How to COUNT the overlap and compute its ratio?

Re: How to COUNT the overlap and compute its ratio?

Re: How to COUNT the overlap and compute its ratio?

Re: How to COUNT the overlap and compute its ratio?

Re: How to COUNT the overlap and compute its ratio?

Re: How to COUNT the overlap and compute its ratio?

Re: How to COUNT the overlap and compute its ratio?

SAS Innovate 2025: Call for Content

Classroom Training Available!