Hello,
I have asked similar question before but I found I only achieved partial goal.
So I have data like this
DATA HAVE1;
INPUT (NAME1 NAME2) (:$8.);
CARDS;
Z J
Z K
X J
X K
;
DATA HAVE2;
INPUT (NAME X1 X2) (:$8.);
CARDS;
Z A 1
Z D 2
X C 3
X D 4
J A 1
J B 1
K E 3
K D 4
;
What I want is like this
DATA WANT;
INPUT (NAME1 NAME2 OVERLAPX1 OVERLAPX1_1 OVERLAPX1_2 OVERLAPX1_3 OVERLAPX2 OVERLAPX2_1 OVERLAPX2_2 OVERLAPX2_3) (:$8.);
CARDS;
Z J 1 0.5 0.5 1/3 1 0.5 1 0.5
Z K 1 0.5 0.5 1/3 0 0 0 0
X J 0 0 0 0 0 0 0 0
X K 1 0.5 0.5 1/3 2 1 1 1
;
First step, compute the number of overlap between NAME1 and NAME2.
For example (using first row in WANT as example)
So OVERLAPX1 is the number of overlap in X1 between NAME1 and NAME2 e.g. Only A is the overlap between Z and J, so it is 1.
Second, repeat first step, and divided it by distinct value of X1 for NAME1
OVERLAPX1_1 is the OVERLAPX1 divided by total distinct value of X1 for NAME1.
For example. There is 1 overlap (A) between Z and J, but Z has 2 unique values in X1 (A, D). So the ratio is 1/2 = 0.5
Third, repeat first step, and divided it by distinct value of X1 for NAME2
OVERLAPX1_2 is the OVERLAPX1 divided by total distinct value of X1 for NAME2.
For example. There is 1 overlap (A) between Z and J, but J has 2 unique value in X1 (A, B). So the ratio is 1/2 = 0.5
Fourth, repeat first step, and divided it by total distinct value of X1 for both NAME1 and NAME2
OVERLAPX1_3 is the OVERLAPX1 divided by total distinct value of X1 for NAME1 & NAME2.
For example, There is 1 overlap (A) between Z and J, but Z and J totally has 3 unique value in X1 (A, B, D). So the ratio is 1/3
Till now, I can achieve it with you guys help last time.
But, I would like to repeat the above steps by using X2, instead of X1.
I tried left join (select unique NAME1, X2). But it is confusing. The trouble is all the source data are in the same table and NAME1 & NAME2 have to come in pair. It is kinda a headache.
I also tried such codes
proc sql;
select sum(a.X1=b.X1) as OVERLAPX1,
sum(a.X1=b.X1)/COUNT(DISTINCT a.X1) as OVERLAPX1_NAME1,
sum(a.X1=b.X1)/COUNT(DISTINCT b.X1) as OVERLAPX1_NAME2,
sum(a.X1=b.X1)/SUM(COUNT(DISTINCT a.X1)+ COUNT(DISTINCT b.X1) - sum(a.X1=b.X1)) as OVERLAPX1_NAME1_NAME2
,sum(c.X2=d.X2) as OVERLAPX2,
sum(c.X2=d.X2)/COUNT(DISTINCT c.X2) as OVERLAPX2_NAME1,
sum(c.X2=d.X2)/COUNT(DISTINCT d.X2) as OVERLAPX2_NAME2,
sum(c.X2=d.X2)/SUM(COUNT(DISTINCT c.X2)+ COUNT(DISTINCT d.X2) - sum(c.X2=d.X2)) as OVERLAPX2_NAME1_NAME2
from HAVE1 n
left join
(select unique NAME,X1 from HAVE2) a on n.NAME1 = a.NAME
left join
(select unique NAME,X1 from HAVE2) b on n.NAME2 = b.NAME
left join
(select unique NAME,X2 from HAVE2) c on n.NAME1 = c.NAME
left join
(select unique NAME,X2 from HAVE2) d on n.NAME2 = d.NAME
group by n.NAME1, n.NAME2
order by n.NAME1, n.NAME2;
quit;
But this is giving me wrong value.
Would you please check where is wrong?
How to achieve this using SQL. It is always my favorite.
Thank you all.
Bless you.
Is there a limit to the number of overlaps you can have? You'll only ever need that list of variables?
If this needs to scale the approaches would be different.
I only care the number of overlap between NAME1 and NAME2 in HAVE1 list
For example. In terms of variable X1,
Z and J have one overlap in X1, which is A.
Hello.
Is this too complicated?
Sorry to confuse you.
Start from the simple one.
All I want is the overlap between NAME1 and NAME2 in HAVE1.
For example.
In HAVE1, in first row, I have Z and J, and I would like to count their overlap in variable X1.
How to achieve that?
I go to HAVE2, and see what Z has in X1 : A and D;
Then I see what J has in X1: A and B.
Z and J has one overlap in X1: which is "A".
So the count of overlap is 1.
The logic is to count the number of value in common between Z and J in variable X1.
Does this make myself clear?
Thank you very much.
Yes.
I actually do.
Using SQL
proc sql;
select sum(a.X1=b.X1) as OVERLAPX1,
sum(a.X1=b.X1)/COUNT(DISTINCT a.X1) as OVERLAPX1_NAME1,
sum(a.X1=b.X1)/COUNT(DISTINCT b.X1) as OVERLAPX1_NAME2,
sum(a.X1=b.X1)/SUM(COUNT(DISTINCT a.X1)+ COUNT(DISTINCT b.X1) - sum(a.X1=b.X1)) as OVERLAPX1_NAME1_NAME2
,sum(c.X2=d.X2) as OVERLAPX2,
sum(c.X2=d.X2)/COUNT(DISTINCT c.X2) as OVERLAPX2_NAME1,
sum(c.X2=d.X2)/COUNT(DISTINCT d.X2) as OVERLAPX2_NAME2,
sum(c.X2=d.X2)/SUM(COUNT(DISTINCT c.X2)+ COUNT(DISTINCT d.X2) - sum(c.X2=d.X2)) as OVERLAPX2_NAME1_NAME2
from HAVE1 n
left join
(select unique NAME,X1 from HAVE2) a on n.NAME1 = a.NAME
left join
(select unique NAME,X1 from HAVE2) b on n.NAME2 = b.NAME
left join
(select unique NAME,X2 from HAVE2) c on n.NAME1 = c.NAME
left join
(select unique NAME,X2 from HAVE2) d on n.NAME2 = d.NAME
group by n.NAME1, n.NAME2
order by n.NAME1, n.NAME2;
quit;
But it gives me wrong value.
Stuck here.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.