Hello,
I am exploring how to use DISTINCT in SQL SUM(a=b)
Here is what I have
data have1 ;
input
COUNTRY1 YEAR COUNTRY2 $5.;
datalines;
001 2000 101
001 2000 002
001 2001 103
002 2000 201
002 2000 202
002 2001 203
003 2001 203
003 2000 301
003 2000 302
003 2001 303
;
run;
data have2 ;
input
COUNTRY PRODUCT STORE PRICE $5.;
datalines;
001 A1 B1 2
001 A2 B2 2
001 A3 B3 3
002 A3 B1 4
002 A4 B2 1
002 A5 B2 2
003 A5 B3 1
003 A6 B1 2
003 A7 B2 3
003 A8 B2 1
101 A1 B3 2
103 A3 B1 3
201 A3 B2 2
202 A4 B3 1
203 A5 B2 2
203 A8 B3 1
301 A6 B1 1
302 A4 B2 3
303 A6 B3 2
;
run;
Thanks to your help. I have figured out the code.
Here is what I am using
proc sql ;
create table want as
select a.*,
sum(b.product=c.product) as product_overlap,
SUM(b.STORE = c.STORE) as store_overlap,
SUM(b.PRICE = c.PRICE) as price_overlap
from have1 a
left join have2 b on a.COUNTRY1=b.COUNTRY
left join have2 c on a.COUNTRY2=c.COUNTRY
group by COUNTRY1, YEAR, COUNTRY2;
quit;
But since there could be duplicate STORE, PRICE, PRODUCT. How can I only use the distinct values in those variables?
For example, when I want to compute PRODUCT_OVERLAP, I only select the DISTINCT PRODUCT of country1 and country2 and count their overlap.
when I want to compute STORE_OVERLAP, I only select the DISTINCT STORE of country1 and country2 and count their overlap.
I tried to
SUM(DISTINCT b.STORE = c.STORE) as store_overlap
But I think it is not right.
Thank you allso.
It helps to also show the expected output from you example data with your rules.
I am not quite sure why you have year in one set but not the other. One seldom expects prices to stay the same across time.
Like this?
proc sql ;
select a.*
,count(distinct p1.PRODUCT) as PRODUCT_OVERLAP
,count(distinct s1.STORE ) as STORE_OVERLAP
from HAVE1 a
left join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT
left join (select unique COUNTRY ,STORE from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY
inner join (select unique COUNTRY ,STORE from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE
group by COUNTRY1, YEAR, COUNTRY2;
quit;
COUNTRY1 | YEAR | COUNTRY2 | PRODUCT_OVERLAP | STORE_OVERLAP |
---|---|---|---|---|
1 | 2000 | 2 | 1 | 2 |
1 | 2000 | 101 | 1 | 1 |
1 | 2001 | 103 | 1 | 1 |
2 | 2000 | 201 | 1 | 1 |
2 | 2001 | 203 | 1 | 1 |
3 | 2000 | 301 | 1 | 1 |
3 | 2001 | 203 | 2 | 2 |
3 | 2001 | 303 | 1 | 1 |
If you want to keep all the original rows.
proc sql ;
select r.*
,count(distinct p1.PRODUCT) as PRODUCT_OVERLAP
,count(distinct s1.STORE ) as STORE_OVERLAP
from HAVE1 a
left join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT
left join (select unique COUNTRY ,STORE from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY
inner join (select unique COUNTRY ,STORE from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE
right join HAVE1 r on a.COUNTRY1=r.COUNTRY1 & a.YEAR=r.YEAR & a.COUNTRY2=r.COUNTRY2
group by r.COUNTRY1, r.YEAR, r.COUNTRY2;
quit;
COUNTRY1 | YEAR | COUNTRY2 | PRODUCT_OVERLAP | STORE_OVERLAP |
---|---|---|---|---|
1 | 2000 | 2 | 1 | 2 |
1 | 2000 | 101 | 1 | 1 |
1 | 2001 | 103 | 1 | 1 |
2 | 2000 | 201 | 1 | 1 |
2 | 2000 | 202 | 0 | 0 |
2 | 2001 | 203 | 1 | 1 |
3 | 2000 | 301 | 1 | 1 |
3 | 2000 | 302 | 0 | 0 |
3 | 2001 | 203 | 2 | 2 |
3 | 2001 | 303 | 1 | 1 |
Mmm this wasn't quite right either.
I this this nails it:
proc sql ;
select a.*
,count(distinct p.PRODUCT) as PRODUCT_OVERLAP
,count(distinct s.STORE ) as STORE_OVERLAP
from HAVE1 a
left join (select a.*, p1.PRODUCT from HAVE1 a
left join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT) p
on a.COUNTRY1=p.COUNTRY1 & a.YEAR=p.YEAR & a.COUNTRY2=p.COUNTRY2
left join (select a.*, s1.STORE from HAVE1 a
left join (select unique COUNTRY ,STORE from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY
inner join (select unique COUNTRY ,STORE from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE) s
on a.COUNTRY1=s.COUNTRY1 & a.YEAR=s.YEAR & a.COUNTRY2=s.COUNTRY2
group by a.COUNTRY1, a.YEAR, a.COUNTRY2;
quit;
COUNTRY1 | YEAR | COUNTRY2 | PRODUCT_OVERLAP | STORE_OVERLAP |
---|---|---|---|---|
1 | 2000 | 2 | 1 | 2 |
1 | 2000 | 101 | 1 | 1 |
1 | 2001 | 103 | 1 | 1 |
2 | 2000 | 201 | 1 | 1 |
2 | 2000 | 202 | 1 | 0 |
2 | 2001 | 203 | 1 | 1 |
3 | 2000 | 301 | 1 | 1 |
3 | 2000 | 302 | 0 | 1 |
3 | 2001 | 203 | 2 | 2 |
3 | 2001 | 303 | 1 | 1 |
Hello,
Thank you very much. I think this works very well.
But I also have two issues with it.
1. It gives me a lot of duplicate rows. I am not sure where is going on.
2. Instead of the number of overlap, what if I would like to calculate the ratio of overlap.
i.e. OVERLAP of STORE for COUNTRY1, OVERLAP of STORE for COUNTRY2, and OVERLAP of STORE for both COUNTRY1 AND COUNTRY2
The numerator is still the OVERLAP, but denomenator will be UNIQUE STORE of COUNTRY1, UNIQUE STORE of COUNTRY2, UNIQUE STORE among COUNTRY1 AND COUNTRY2
Thank you very much.
1. Duplicates mean that the data is not as described.
Maybe adding the keyword unique solves this issue (hopefully in the subqueries rather than the main query, to reduce volume)?
2. Provide an example please.
Thank you. Sorry I recently have a cold and did not check replies.
So the basic thought is
proc sql ;
select a.*
,count(distinct p.PRODUCT) as PRODUCT_OVERLAP
,count(distinct s.STORE ) as STORE_OVERLAP
,count(distinct p.PRODUCT) / count(distinct COUNTRY1'S PRODUCT) as OVERLAP_COUNTRY1
,count(distinct p.PRODUCT) / count(distinct COUNTRY2'S PRODUCT) as OVERLAP_COUNTRY2
,count(distinct p.PRODUCT) / count(distinct COUNTRY1'S and COUNTRY2'S PRODUCT) as OVERLAP_ALL
Since I am trying to get overlap between them. Suppose now we already have the number of overlap. Now I would like to compute the ratio of overlap compared with COUNTRY1, COUNTRY2, COUNTRY1 and 2.
Using one example here:
count(distinct p.PRODUCT) / count(distinct COUNTRY1'S PRODUCT) as OVERLAP_COUNTRY1
For COUNTRY 001, its PRODUCT OVERLAP with COUNTRY 101 is 1. (Only A1 is overlap)
Its PRODUCT OVERLAP WITH COUNTRY 101 divided by itself unique PRODUCT = 1/3. (A1 / (A1 + A2 +A3))
Its PRODUCT OVERLAP WITH COUNTRY 101 divided by COUNTRY 101's unique PRODUCT = 1/1. (A1 / A1)
Its PRODUCT OVERLAP WITH COUNTRY 101 divided by COUNTRY 001 and 101's unique PRODUCT = 1/3. (A1 / A1 +A2 +A3) (Since 001 and 101 's total PRODUCT is A1,A1,A2,A3, but we only care unique value, so duplicated A1 is only counted once)
Thank you.
Keep warm.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: