Hello, I have asked similar question before but I found I only achieved partial goal. So I have data like this DATA HAVE1;
INPUT (NAME1 NAME2) (:$8.);
CARDS;
Z J
Z K
X J
X K
;
DATA HAVE2;
INPUT (NAME X1 X2) (:$8.);
CARDS;
Z A 1
Z D 2
X C 3
X D 4
J A 1
J B 1
K E 3
K D 4
; What I want is like this DATA WANT;
INPUT (NAME1 NAME2 OVERLAPX1 OVERLAPX1_1 OVERLAPX1_2 OVERLAPX1_3 OVERLAPX2 OVERLAPX2_1 OVERLAPX2_2 OVERLAPX2_3) (:$8.);
CARDS;
Z J 1 0.5 0.5 1/3 1 0.5 1 0.5
Z K 1 0.5 0.5 1/3 0 0 0 0
X J 0 0 0 0 0 0 0 0
X K 1 0.5 0.5 1/3 2 1 1 1
; First step, compute the number of overlap between NAME1 and NAME2. For example (using first row in WANT as example) So OVERLAPX1 is the number of overlap in X1 between NAME1 and NAME2 e.g. Only A is the overlap between Z and J, so it is 1. Second, repeat first step, and divided it by distinct value of X1 for NAME1 OVERLAPX1_1 is the OVERLAPX1 divided by total distinct value of X1 for NAME1. For example. There is 1 overlap (A) between Z and J, but Z has 2 unique values in X1 (A, D). So the ratio is 1/2 = 0.5 Third, repeat first step, and divided it by distinct value of X1 for NAME2 OVERLAPX1_2 is the OVERLAPX1 divided by total distinct value of X1 for NAME2. For example. There is 1 overlap (A) between Z and J, but J has 2 unique value in X1 (A, B). So the ratio is 1/2 = 0.5 Fourth, repeat first step, and divided it by total distinct value of X1 for both NAME1 and NAME2 OVERLAPX1_3 is the OVERLAPX1 divided by total distinct value of X1 for NAME1 & NAME2. For example, There is 1 overlap (A) between Z and J, but Z and J totally has 3 unique value in X1 (A, B, D). So the ratio is 1/3 Till now, I can achieve it with you guys help last time. But, I would like to repeat the above steps by using X2, instead of X1. I tried left join (select unique NAME1, X2). But it is confusing. The trouble is all the source data are in the same table and NAME1 & NAME2 have to come in pair. It is kinda a headache. I also tried such codes proc sql;
select sum(a.X1=b.X1) as OVERLAPX1,
sum(a.X1=b.X1)/COUNT(DISTINCT a.X1) as OVERLAPX1_NAME1,
sum(a.X1=b.X1)/COUNT(DISTINCT b.X1) as OVERLAPX1_NAME2,
sum(a.X1=b.X1)/SUM(COUNT(DISTINCT a.X1)+ COUNT(DISTINCT b.X1) - sum(a.X1=b.X1)) as OVERLAPX1_NAME1_NAME2
,sum(c.X2=d.X2) as OVERLAPX2,
sum(c.X2=d.X2)/COUNT(DISTINCT c.X2) as OVERLAPX2_NAME1,
sum(c.X2=d.X2)/COUNT(DISTINCT d.X2) as OVERLAPX2_NAME2,
sum(c.X2=d.X2)/SUM(COUNT(DISTINCT c.X2)+ COUNT(DISTINCT d.X2) - sum(c.X2=d.X2)) as OVERLAPX2_NAME1_NAME2
from HAVE1 n
left join
(select unique NAME,X1 from HAVE2) a on n.NAME1 = a.NAME
left join
(select unique NAME,X1 from HAVE2) b on n.NAME2 = b.NAME
left join
(select unique NAME,X2 from HAVE2) c on n.NAME1 = c.NAME
left join
(select unique NAME,X2 from HAVE2) d on n.NAME2 = d.NAME
group by n.NAME1, n.NAME2
order by n.NAME1, n.NAME2;
quit; But this is giving me wrong value. Would you please check where is wrong? How to achieve this using SQL. It is always my favorite. Thank you all. Bless you.
... View more