How to COUNT distinct value of two colums

yanshuai · Posted 09-17-2018 05:38 PM

Hello,

I am exploring how to use DISTINCT in SQL SUM(a=b)

Here is what I have

data have1 ;
input
COUNTRY1 YEAR COUNTRY2 $5.;
datalines;
001  2000  101
001  2000  002
001  2001  103
002  2000  201
002  2000  202
002  2001  203
003  2001  203
003  2000  301
003  2000  302
003  2001  303
;
run;

data have2 ;
input
COUNTRY PRODUCT STORE PRICE $5.;
datalines;
001  A1  B1  2
001  A2  B2  2
001  A3  B3  3
002  A3  B1  4
002  A4  B2  1
002  A5  B2  2
003  A5  B3  1
003  A6  B1  2
003  A7  B2  3
003  A8  B2  1
101  A1  B3  2
103  A3  B1  3
201  A3  B2  2
202  A4  B3  1
203  A5  B2  2
203  A8  B3  1
301  A6  B1  1
302  A4  B2  3
303  A6  B3  2
;
run;

Thanks to your help. I have figured out the code.

Here is what I am using

proc sql ;
  create table want as 
    select a.*,
		sum(b.product=c.product) as product_overlap,
		SUM(b.STORE = c.STORE) as store_overlap,
                SUM(b.PRICE = c.PRICE) as price_overlap
    from have1 a 
    left join have2 b on a.COUNTRY1=b.COUNTRY
    left join have2 c on a.COUNTRY2=c.COUNTRY
    group by COUNTRY1, YEAR, COUNTRY2;
quit;

But since there could be duplicate STORE, PRICE, PRODUCT. How can I only use the distinct values in those variables?

For example, when I want to compute PRODUCT_OVERLAP, I only select the DISTINCT PRODUCT of country1 and country2 and count their overlap.

when I want to compute STORE_OVERLAP, I only select the DISTINCT STORE of country1 and country2 and count their overlap.

I tried to

SUM(DISTINCT b.STORE = c.STORE) as store_overlap

But I think it is not right.

Thank you allso.

ballardw · Posted 09-17-2018 07:24 PM

It helps to also show the expected output from you example data with your rules.

I am not quite sure why you have year in one set but not the other. One seldom expects prices to stay the same across time.

ChrisNZ · Posted 09-17-2018 10:37 PM

Like this?

proc sql ;
    select a.*                
         ,count(distinct p1.PRODUCT) as PRODUCT_OVERLAP
         ,count(distinct s1.STORE )  as STORE_OVERLAP
    from HAVE1 a 
    left  join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
    inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT
    left  join (select unique COUNTRY ,STORE   from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY 
    inner join (select unique COUNTRY ,STORE   from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE
   group by COUNTRY1, YEAR, COUNTRY2;
quit;

COUNTRY1	YEAR	COUNTRY2	PRODUCT_OVERLAP	STORE_OVERLAP
1	2000	2	1	2
1	2000	101	1	1
1	2001	103	1	1
2	2000	201	1	1
2	2001	203	1	1
3	2000	301	1	1
3	2001	203	2	2
3	2001	303	1	1

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 09-17-2018 10:43 PM

If you want to keep all the original rows.

proc sql ;
    select r.*                
         ,count(distinct p1.PRODUCT) as PRODUCT_OVERLAP
         ,count(distinct s1.STORE )  as STORE_OVERLAP
    from HAVE1 a 
    left  join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
    inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT
    left  join (select unique COUNTRY ,STORE   from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY 
    inner join (select unique COUNTRY ,STORE   from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE
    right join HAVE1 r on a.COUNTRY1=r.COUNTRY1 & a.YEAR=r.YEAR & a.COUNTRY2=r.COUNTRY2 
    group by r.COUNTRY1, r.YEAR, r.COUNTRY2;
quit;

COUNTRY1	YEAR	COUNTRY2	PRODUCT_OVERLAP	STORE_OVERLAP
1	2000	2	1	2
1	2000	101	1	1
1	2001	103	1	1
2	2000	201	1	1
2	2000	202	0	0
2	2001	203	1	1
3	2000	301	1	1
3	2000	302	0	0
3	2001	203	2	2
3	2001	303	1	1

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 09-17-2018 11:14 PM

Mmm this wasn't quite right either.

I this this nails it:


proc sql ;
    select a.*        
          ,count(distinct p.PRODUCT) as PRODUCT_OVERLAP
          ,count(distinct s.STORE )  as STORE_OVERLAP
    from HAVE1 a 

         left join (select a.*, p1.PRODUCT from HAVE1 a 
                    left  join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
                    inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT) p
         on a.COUNTRY1=p.COUNTRY1 & a.YEAR=p.YEAR & a.COUNTRY2=p.COUNTRY2 

         left join (select a.*, s1.STORE from HAVE1 a 
                    left  join (select unique COUNTRY ,STORE from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY
                    inner join (select unique COUNTRY ,STORE from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE) s
         on a.COUNTRY1=s.COUNTRY1 & a.YEAR=s.YEAR & a.COUNTRY2=s.COUNTRY2 

    group by a.COUNTRY1, a.YEAR, a.COUNTRY2;
quit;

COUNTRY1	YEAR	COUNTRY2	PRODUCT_OVERLAP	STORE_OVERLAP
1	2000	2	1	2
1	2000	101	1	1
1	2001	103	1	1
2	2000	201	1	1
2	2000	202	1	0
2	2001	203	1	1
3	2000	301	1	1
3	2000	302	0	1
3	2001	203	2	2
3	2001	303	1	1

High-Performance SAS Coding - Third Edition

yanshuai · Posted 09-20-2018 04:47 PM

Hello,

Thank you very much. I think this works very well.

But I also have two issues with it.

1. It gives me a lot of duplicate rows. I am not sure where is going on.

2. Instead of the number of overlap, what if I would like to calculate the ratio of overlap.

i.e. OVERLAP of STORE for COUNTRY1, OVERLAP of STORE for COUNTRY2, and OVERLAP of STORE for both COUNTRY1 AND COUNTRY2

The numerator is still the OVERLAP, but denomenator will be UNIQUE STORE of COUNTRY1, UNIQUE STORE of COUNTRY2, UNIQUE STORE among COUNTRY1 AND COUNTRY2

Thank you very much.

ChrisNZ · Posted 09-20-2018 07:45 PM

1. Duplicates mean that the data is not as described.

Maybe adding the keyword unique solves this issue (hopefully in the subqueries rather than the main query, to reduce volume)?

2. Provide an example please.

High-Performance SAS Coding - Third Edition

yanshuai · Posted 09-27-2018 08:40 PM

Thank you. Sorry I recently have a cold and did not check replies.

So the basic thought is

proc sql ;
    select a.*        
          ,count(distinct p.PRODUCT) as PRODUCT_OVERLAP
          ,count(distinct s.STORE )  as STORE_OVERLAP
          ,count(distinct p.PRODUCT) / count(distinct COUNTRY1'S PRODUCT) as OVERLAP_COUNTRY1
          ,count(distinct p.PRODUCT) / count(distinct COUNTRY2'S PRODUCT) as OVERLAP_COUNTRY2
          ,count(distinct p.PRODUCT) / count(distinct COUNTRY1'S and COUNTRY2'S PRODUCT) as OVERLAP_ALL

Since I am trying to get overlap between them. Suppose now we already have the number of overlap. Now I would like to compute the ratio of overlap compared with COUNTRY1, COUNTRY2, COUNTRY1 and 2.

Using one example here:

count(distinct p.PRODUCT) / count(distinct COUNTRY1'S PRODUCT) as OVERLAP_COUNTRY1

For COUNTRY 001, its PRODUCT OVERLAP with COUNTRY 101 is 1. (Only A1 is overlap)

Its PRODUCT OVERLAP WITH COUNTRY 101 divided by itself unique PRODUCT = 1/3. (A1 / (A1 + A2 +A3))

Its PRODUCT OVERLAP WITH COUNTRY 101 divided by COUNTRY 101's unique PRODUCT = 1/1. (A1 / A1)

Its PRODUCT OVERLAP WITH COUNTRY 101 divided by COUNTRY 001 and 101's unique PRODUCT = 1/3. (A1 / A1 +A2 +A3) (Since 001 and 101 's total PRODUCT is A1,A1,A2,A3, but we only care unique value, so duplicated A1 is only counted once)

Thank you.

Keep warm.

How to COUNT distinct value of two colums

Re: How to COUNT distinct value of two colums

Re: How to COUNT distinct value of two colums

Re: How to COUNT distinct value of two colums

Re: How to COUNT distinct value of two colums

Re: How to COUNT distinct value of two colums

Re: How to COUNT distinct value of two colums

Re: How to COUNT distinct value of two colums

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!