Solved: Re: How to count match between two variables

yanshuai · Posted 10-17-2018 05:21 PM

Hello,

Seems my last post is complicated. I am trying to simplify my case.

I have dataset like this

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b

;
run;

1. I would like to compute the number of OVERLAP between distinct X1 and X2. Group by NAME1, NAME2.

2. Take the result of the 1st step, and divide it by the number of total distinct value in X1 and X2. Group by NAME1, NAME2.

Here is the example of what I want.

DATA WANT;
    INPUT (NAME1 NAME2 OVERLAP OVERLAPALL) (:$8.);
    CARDS;
Z J 2 2/3     
X C 2 2/3
;

/* In first row, OVERLAP is 2 because X1 and X2 only have 'a' 'b' in common between Z and J. There is only 2 overlap. 2/3 means OVERLAP divided by the total # of distinct value of both X1 and X2 between Z and J*/

So it is like this:

OVERLAP : count(distinct X1 = distinct X2).

OVERLAPALL: count(distinct X1 = distinct X2) / count(distinct X1, X2).

All the computation above is based on GROUP BY NAME1, NAME2.

Would you please help me with this? Hopefully I make myself clear.

Thank you all.

PGStats · Posted 10-18-2018 05:17 PM

Here's how:

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;

proc sql;
select a.name1, a.name2, a.overlap, b.all, c.nbX1, c.nbX2
from

(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
 intersect
 select name1, name2, x2 as x from have1)
group by name1, name2) as a

inner join

(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
 union
 select name1, name2, x2 as x from have1)
group by name1, name2) as b 
    on a.name1=b.name1 and a.name2=b.name2

inner join

(select name1, name2, 
    count(distinct X1) as nbX1,
    count(distinct X2) as nbX2
 from have1
 group by name1, name2) as c 
    on a.name1=c.name1 and a.name2=c.name2

;

quit;

For some nasty reason, my SAS dies when I try to do this with natural joins . That's why I went back to explicit joins.

PG

View solution in original post

novinosrin · Posted 10-17-2018 05:29 PM

Can you format the data have to make it readable plz

yanshuai · Posted 10-17-2018 05:32 PM

Sorry.

I modify it.

Hope this works.

novinosrin · Posted 10-17-2018 05:42 PM

for the group z j

Z J a a
Z J b a
Z J c b

a & b seems to be found? doesn't that make it 2/3?

yanshuai · Posted 10-17-2018 05:43 PM

My mistake.

I have changed it.

yanshuai · Posted 10-17-2018 05:43 PM

And you are right.

It is supposed to be 2/3. and 2 OVERLAP.

novinosrin · Posted 10-17-2018 06:09 PM

Hi for the just OVERLAP

it's easy

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b

;
run;
proc sql;
create table want as
select a.name1,a.name2,count( distinct a.x1=b.x2 )as overlap
from have1 a, have1(keep=name: x2)b
where a.name1=b.name1 and a.name2=b.name2
group by a.name1,a.name2;
quit;

For OVERLAP ALL,

i. To get the distinct count of all values x1,x2 combined which is a,b,c is 3 will require some logic.

I do have a solution in my mind, but if it's not urgent, shall we wait for somebody else to take a shot first ?Coz I am expecting a better solution and robust solution from somebody else. Let's give it sometime

yanshuai · Posted 10-17-2018 07:35 PM

Thank you very much.

At first I though this is can be done in a few steps. But it seems harder than I thought. So I have to come here and ask for some help.

PGStats · Posted 10-18-2018 01:02 AM

SQL set operations provide a straightforward way to do this:

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;

proc sql;
select * 
from

(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
 intersect
 select name1, name2, x2 as x from have1)
group by name1, name2) 

natural join

(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
 union
 select name1, name2, x2 as x from have1)
group by name1, name2);

quit;

Notes: UNION and INTERSECT operations provide DISTINCT observations (unless the ALL keyword is specified). NATURAL JOIN performs a join on all same-named columns.

PG

yanshuai · Posted 10-18-2018 03:01 PM

Fantastic.

First time know this command. Thank you very much.

Then overlap/all will be my goal, which is kinda easy to do.

But what if I would like to also compute the following?

1. COUNT(DISTINCT X1) Group by NAME1. So that I could know the ratio of OVERLAP compared with X1.

2. COUNT(DISTINCT X2) Group by NAME1, NAME2. So that I could know the ratio of OVERLAP compared with X2.

How to put those things in the same SQL?

PGStats · Posted 10-18-2018 05:17 PM

Here's how:

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;

proc sql;
select a.name1, a.name2, a.overlap, b.all, c.nbX1, c.nbX2
from

(select name1, name2, count(*) as overlap
from
(select name1, name2, x1 as x from have1
 intersect
 select name1, name2, x2 as x from have1)
group by name1, name2) as a

inner join

(select name1, name2, count(*) as all
from
(select name1, name2, x1 as x from have1
 union
 select name1, name2, x2 as x from have1)
group by name1, name2) as b 
    on a.name1=b.name1 and a.name2=b.name2

inner join

(select name1, name2, 
    count(distinct X1) as nbX1,
    count(distinct X2) as nbX2
 from have1
 group by name1, name2) as c 
    on a.name1=c.name1 and a.name2=c.name2

;

quit;

For some nasty reason, my SAS dies when I try to do this with natural joins . That's why I went back to explicit joins.

PG

novinosrin · Posted 10-18-2018 11:57 AM

DATA HAVE1;
    INPUT (NAME1 NAME2 X1 X2) (:$8.);
    CARDS;
Z J a a
Z J b a
Z J c b
X C b a
X C b c
X C c b
;

proc transpose data=have1 out=_have;
by name: x: notsorted;
var x:;
run;

proc sort data=_have out=__have;
by name: col1;
run;

data want;
set __have;
by name: col1;
if first.name2 then do; overlap=0;overlap_all=0;end;
if not first.col1 and lag(_name_) ne _name_ then overlap+1;
if first.col1 then overlap_all+1;
if last.name2;
want=overlap/overlap_all;
drop x: _name_ col1;
run;

yanshuai · Posted 10-18-2018 03:38 PM

Thank you.

It seems it is a loop way to do this.

Also great.

Thanks.

novinosrin · Posted 10-18-2018 05:07 PM

Hi @yanshuai Thank you acknowledging the solutions. Can you please mark the genie @PGStats solution as answered and close the thread unless you want to wait for more responses. Thanks!

Registration is open

SAS Training: Just a Click Away