Hello,
Seems my last post is complicated. I am trying to simplify my case.
I have dataset like this
DATA HAVE1;
INPUT (NAME1 NAME2 X1 X2) (:$8.);
CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;
run;
1. I would like to compute the number of OVERLAP between distinct X1 and X2. Group by NAME1, NAME2.
2. Take the result of the 1st step, and divide it by the number of total distinct value in X1 and X2. Group by NAME1, NAME2.
Here is the example of what I want.
DATA WANT;
INPUT (NAME1 NAME2 OVERLAP OVERLAPALL) (:$8.);
CARDS;
Z J 2 2/3
X C 2 2/3
;
/* In first row, OVERLAP is 2 because X1 and X2 only have 'a' 'b' in common between Z and J. There is only 2 overlap. 2/3 means OVERLAP divided by the total # of distinct value of both X1 and X2 between Z and J*/
So it is like this:
OVERLAP : count(distinct X1 = distinct X2).
OVERLAPALL: count(distinct X1 = distinct X2) / count(distinct X1, X2).
All the computation above is based on GROUP BY NAME1, NAME2.
Would you please help me with this? Hopefully I make myself clear.
Thank you all.
Here's how:
DATA HAVE1;
INPUT (NAME1 NAME2 X1 X2) (:$8.);
CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;
proc sql;
select a.name1, a.name2, a.overlap, b.all, c.nbX1, c.nbX2
from
(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
intersect
select name1, name2, x2 as x from have1)
group by name1, name2) as a
inner join
(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
union
select name1, name2, x2 as x from have1)
group by name1, name2) as b
on a.name1=b.name1 and a.name2=b.name2
inner join
(select name1, name2,
count(distinct X1) as nbX1,
count(distinct X2) as nbX2
from have1
group by name1, name2) as c
on a.name1=c.name1 and a.name2=c.name2
;
quit;
For some nasty reason, my SAS dies when I try to do this with natural joins . That's why I went back to explicit joins.
Can you format the data have to make it readable plz
Sorry.
I modify it.
Hope this works.
for the group z j
Z J a a
Z J b a
Z J c b
a & b seems to be found? doesn't that make it 2/3?
My mistake.
I have changed it.
And you are right.
It is supposed to be 2/3. and 2 OVERLAP.
Hi for the just OVERLAP
it's easy
DATA HAVE1;
INPUT (NAME1 NAME2 X1 X2) (:$8.);
CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;
run;
proc sql;
create table want as
select a.name1,a.name2,count( distinct a.x1=b.x2 )as overlap
from have1 a, have1(keep=name: x2)b
where a.name1=b.name1 and a.name2=b.name2
group by a.name1,a.name2;
quit;
For OVERLAP ALL,
i. To get the distinct count of all values x1,x2 combined which is a,b,c is 3 will require some logic.
I do have a solution in my mind, but if it's not urgent, shall we wait for somebody else to take a shot first ?Coz I am expecting a better solution and robust solution from somebody else. Let's give it sometime
Thank you very much.
At first I though this is can be done in a few steps. But it seems harder than I thought. So I have to come here and ask for some help.
SQL set operations provide a straightforward way to do this:
DATA HAVE1;
INPUT (NAME1 NAME2 X1 X2) (:$8.);
CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;
proc sql;
select *
from
(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
intersect
select name1, name2, x2 as x from have1)
group by name1, name2)
natural join
(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
union
select name1, name2, x2 as x from have1)
group by name1, name2);
quit;
Notes: UNION and INTERSECT operations provide DISTINCT observations (unless the ALL keyword is specified). NATURAL JOIN performs a join on all same-named columns.
Fantastic.
First time know this command. Thank you very much.
Then overlap/all will be my goal, which is kinda easy to do.
But what if I would like to also compute the following?
1. COUNT(DISTINCT X1) Group by NAME1. So that I could know the ratio of OVERLAP compared with X1.
2. COUNT(DISTINCT X2) Group by NAME1, NAME2. So that I could know the ratio of OVERLAP compared with X2.
How to put those things in the same SQL?
Here's how:
DATA HAVE1;
INPUT (NAME1 NAME2 X1 X2) (:$8.);
CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;
proc sql;
select a.name1, a.name2, a.overlap, b.all, c.nbX1, c.nbX2
from
(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
intersect
select name1, name2, x2 as x from have1)
group by name1, name2) as a
inner join
(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
union
select name1, name2, x2 as x from have1)
group by name1, name2) as b
on a.name1=b.name1 and a.name2=b.name2
inner join
(select name1, name2,
count(distinct X1) as nbX1,
count(distinct X2) as nbX2
from have1
group by name1, name2) as c
on a.name1=c.name1 and a.name2=c.name2
;
quit;
For some nasty reason, my SAS dies when I try to do this with natural joins . That's why I went back to explicit joins.
DATA HAVE1;
INPUT (NAME1 NAME2 X1 X2) (:$8.);
CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;
proc transpose data=have1 out=_have;
by name: x: notsorted;
var x:;
run;
proc sort data=_have out=__have;
by name: col1;
run;
data want;
set __have;
by name: col1;
if first.name2 then do; overlap=0;overlap_all=0;end;
if not first.col1 and lag(_name_) ne _name_ then overlap+1;
if first.col1 then overlap_all+1;
if last.name2;
want=overlap/overlap_all;
drop x: _name_ col1;
run;
Thank you.
It seems it is a loop way to do this.
Also great.
Thanks.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.