BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
yanshuai
Quartz | Level 8

Hello,

 

Seems my last post is complicated. I am trying to simplify my case.

 

I have dataset like this

 

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b

;
run;

 

 

1. I would like to compute the number of OVERLAP between distinct X1 and X2. Group by NAME1, NAME2.

2. Take the result of the 1st step, and divide it by the number of total distinct value in X1 and X2. Group by NAME1, NAME2.

 

Here is the example of what I want.

 

DATA WANT;
    INPUT (NAME1 NAME2 OVERLAP OVERLAPALL) (:$8.);
    CARDS;
Z J 2 2/3     
X C 2 2/3
;
/* In first row, OVERLAP is 2 because X1 and X2 only have 'a' 'b' in common between Z and J. There is only 2 overlap. 2/3 means OVERLAP divided by the total # of distinct value of both X1 and X2 between Z and J*/

So it is like this:

OVERLAP : count(distinct X1 = distinct X2).

OVERLAPALL: count(distinct X1 = distinct X2) / count(distinct X1, X2).

All the computation above is based on GROUP BY NAME1, NAME2.

 

Would you please help me with this? Hopefully I make myself clear.

 

Thank you all.

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Here's how:

 

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;

proc sql;
select a.name1, a.name2, a.overlap, b.all, c.nbX1, c.nbX2
from

(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
 intersect
 select name1, name2, x2 as x from have1)
group by name1, name2) as a

inner join

(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
 union
 select name1, name2, x2 as x from have1)
group by name1, name2) as b 
    on a.name1=b.name1 and a.name2=b.name2

inner join

(select name1, name2, 
    count(distinct X1) as nbX1,
    count(distinct X2) as nbX2
 from have1
 group by name1, name2) as c 
    on a.name1=c.name1 and a.name2=c.name2

;

quit;

For some nasty reason, my SAS dies when I try to do this with natural joins Smiley Sad. That's why I went back to explicit joins.

 

PG

View solution in original post

13 REPLIES 13
novinosrin
Tourmaline | Level 20

Can you format the data have to make it readable plz

yanshuai
Quartz | Level 8

Sorry.

 

I modify it.

 

Hope this works.

novinosrin
Tourmaline | Level 20

for the group z j

 

Z J a a
Z J b a
Z J c b

 

a & b seems to be found? doesn't that make it 2/3?

yanshuai
Quartz | Level 8

My mistake.

I have changed it.

yanshuai
Quartz | Level 8

And you are right.

It is supposed to be 2/3. and 2 OVERLAP.

novinosrin
Tourmaline | Level 20

Hi for the just OVERLAP

it's easy

 

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b

;
run;
proc sql;
create table want as
select a.name1,a.name2,count( distinct a.x1=b.x2 )as overlap
from have1 a, have1(keep=name: x2)b
where a.name1=b.name1 and a.name2=b.name2
group by a.name1,a.name2;
quit;

For OVERLAP ALL, 

i. To get the distinct count of all values x1,x2 combined which is a,b,c is 3 will require some logic. 

 

I do have a solution in my mind, but if it's not urgent, shall we wait for somebody else to take a shot first ?Coz I am expecting a better solution and robust solution from somebody else. Let's give it sometime

yanshuai
Quartz | Level 8

Thank you very much.

 

At first I though this is can be done in a few steps. But it seems harder than I thought. So I have to come here and ask for some help.

PGStats
Opal | Level 21

SQL set operations provide a straightforward way to do this:

 

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;

proc sql;
select * 
from

(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
 intersect
 select name1, name2, x2 as x from have1)
group by name1, name2) 

natural join

(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
 union
 select name1, name2, x2 as x from have1)
group by name1, name2);

quit;

Notes: UNION and INTERSECT operations provide DISTINCT observations (unless the ALL keyword is specified). NATURAL JOIN performs a join on all same-named columns.

 

PG
yanshuai
Quartz | Level 8

Fantastic.

First time know this command. Thank you very much.

 

Then overlap/all will be my goal, which is kinda easy to do.

 

 

But what if I would like to also compute the following?

1. COUNT(DISTINCT X1) Group by NAME1.   So that I could know the ratio of OVERLAP compared with X1.

2. COUNT(DISTINCT X2) Group by NAME1, NAME2. So that I could know the ratio of OVERLAP compared with X2.

 

How to put those things in the same SQL?

PGStats
Opal | Level 21

Here's how:

 

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;

proc sql;
select a.name1, a.name2, a.overlap, b.all, c.nbX1, c.nbX2
from

(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
 intersect
 select name1, name2, x2 as x from have1)
group by name1, name2) as a

inner join

(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
 union
 select name1, name2, x2 as x from have1)
group by name1, name2) as b 
    on a.name1=b.name1 and a.name2=b.name2

inner join

(select name1, name2, 
    count(distinct X1) as nbX1,
    count(distinct X2) as nbX2
 from have1
 group by name1, name2) as c 
    on a.name1=c.name1 and a.name2=c.name2

;

quit;

For some nasty reason, my SAS dies when I try to do this with natural joins Smiley Sad. That's why I went back to explicit joins.

 

PG
novinosrin
Tourmaline | Level 20
DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;

proc transpose data=have1 out=_have;
by name: x: notsorted;
var x:;
run;

proc sort data=_have out=__have;
by name: col1;
run;

data want;
set __have;
by name: col1;
if first.name2 then do; overlap=0;overlap_all=0;end;
if not first.col1 and lag(_name_) ne _name_ then overlap+1;
if first.col1 then overlap_all+1;
if last.name2;
want=overlap/overlap_all;
drop x: _name_ col1;
run;
yanshuai
Quartz | Level 8

Thank you.

It seems it is a loop way to do this.

Also great.

 Thanks.

novinosrin
Tourmaline | Level 20

Hi @yanshuai Thank you acknowledging the solutions. Can you please mark the genie  @PGStats solution as answered and close the thread unless you want to wait for more responses. Thanks!

 

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 4002 views
  • 2 likes
  • 3 in conversation