BookmarkSubscribeRSS Feed
yanshuai
Quartz | Level 8

Hello,

 

I am exploring how to use DISTINCT in SQL SUM(a=b)

Here is what I have

data have1 ;
input
COUNTRY1 YEAR COUNTRY2 $5.;
datalines;
001  2000  101
001  2000  002
001  2001  103
002  2000  201
002  2000  202
002  2001  203
003  2001  203
003  2000  301
003  2000  302
003  2001  303
;
run;

data have2 ;
input
COUNTRY PRODUCT STORE PRICE $5.;
datalines;
001  A1  B1  2
001  A2  B2  2
001  A3  B3  3
002  A3  B1  4
002  A4  B2  1
002  A5  B2  2
003  A5  B3  1
003  A6  B1  2
003  A7  B2  3
003  A8  B2  1
101  A1  B3  2
103  A3  B1  3
201  A3  B2  2
202  A4  B3  1
203  A5  B2  2
203  A8  B3  1
301  A6  B1  1
302  A4  B2  3
303  A6  B3  2
;
run;

Thanks to your help. I have figured out the code.

Here is what I am using

proc sql ;
  create table want as 
    select a.*,
		sum(b.product=c.product) as product_overlap,
		SUM(b.STORE = c.STORE) as store_overlap,
SUM(b.PRICE = c.PRICE) as price_overlap from have1 a left join have2 b on a.COUNTRY1=b.COUNTRY left join have2 c on a.COUNTRY2=c.COUNTRY group by COUNTRY1, YEAR, COUNTRY2; quit;

But since there could be duplicate STORE, PRICE, PRODUCT. How can I only use the distinct values in those variables?

For example, when I want to compute PRODUCT_OVERLAP, I only select the DISTINCT PRODUCT of country1 and country2 and count their overlap.

when I want to compute STORE_OVERLAP, I only select the DISTINCT STORE of country1 and country2 and count their overlap.

 

I tried to

SUM(DISTINCT b.STORE = c.STORE) as store_overlap

But I think it is not right.

 

Thank you allso.

7 REPLIES 7
ballardw
Super User

It helps to also show the expected output from you example data with your rules.

 

I am not quite sure why you have year in one set but not the other. One seldom expects prices to stay the same across time.

ChrisNZ
Tourmaline | Level 20

Like this?

proc sql ;
    select a.*                
         ,count(distinct p1.PRODUCT) as PRODUCT_OVERLAP
         ,count(distinct s1.STORE )  as STORE_OVERLAP
    from HAVE1 a 
    left  join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
    inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT
    left  join (select unique COUNTRY ,STORE   from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY 
    inner join (select unique COUNTRY ,STORE   from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE
   group by COUNTRY1, YEAR, COUNTRY2;
quit;
COUNTRY1 YEAR COUNTRY2 PRODUCT_OVERLAP STORE_OVERLAP
1 2000 2 1 2
1 2000 101 1 1
1 2001 103 1 1
2 2000 201 1 1
2 2001 203 1 1
3 2000 301 1 1
3 2001 203 2 2
3 2001 303 1 1

 

ChrisNZ
Tourmaline | Level 20

If you want to keep all the original rows.

proc sql ;
    select r.*                
         ,count(distinct p1.PRODUCT) as PRODUCT_OVERLAP
         ,count(distinct s1.STORE )  as STORE_OVERLAP
    from HAVE1 a 
    left  join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
    inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT
    left  join (select unique COUNTRY ,STORE   from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY 
    inner join (select unique COUNTRY ,STORE   from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE
    right join HAVE1 r on a.COUNTRY1=r.COUNTRY1 & a.YEAR=r.YEAR & a.COUNTRY2=r.COUNTRY2 
    group by r.COUNTRY1, r.YEAR, r.COUNTRY2;
quit;
COUNTRY1 YEAR COUNTRY2 PRODUCT_OVERLAP STORE_OVERLAP
1 2000 2 1 2
1 2000 101 1 1
1 2001 103 1 1
2 2000 201 1 1
2 2000 202 0 0
2 2001 203 1 1
3 2000 301 1 1
3 2000 302 0 0
3 2001 203 2 2
3 2001 303 1 1

 

 

ChrisNZ
Tourmaline | Level 20

Mmm this wasn't quite right either.

I this this nails it:


proc sql ;
    select a.*        
          ,count(distinct p.PRODUCT) as PRODUCT_OVERLAP
          ,count(distinct s.STORE )  as STORE_OVERLAP
    from HAVE1 a 

         left join (select a.*, p1.PRODUCT from HAVE1 a 
                    left  join (select unique COUNTRY ,PRODUCT from HAVE2) p1 on a.COUNTRY1 = p1.COUNTRY
                    inner join (select unique COUNTRY ,PRODUCT from HAVE2) p2 on a.COUNTRY2 = p2.COUNTRY & p1.PRODUCT=p2.PRODUCT) p
         on a.COUNTRY1=p.COUNTRY1 & a.YEAR=p.YEAR & a.COUNTRY2=p.COUNTRY2 

         left join (select a.*, s1.STORE from HAVE1 a 
                    left  join (select unique COUNTRY ,STORE from HAVE2) s1 on a.COUNTRY1 = s1.COUNTRY
                    inner join (select unique COUNTRY ,STORE from HAVE2) s2 on a.COUNTRY2 = s2.COUNTRY & s1.STORE=s2.STORE) s
         on a.COUNTRY1=s.COUNTRY1 & a.YEAR=s.YEAR & a.COUNTRY2=s.COUNTRY2 

    group by a.COUNTRY1, a.YEAR, a.COUNTRY2;
quit;

 

COUNTRY1 YEAR COUNTRY2 PRODUCT_OVERLAP STORE_OVERLAP
1 2000 2 1 2
1 2000 101 1 1
1 2001 103 1 1
2 2000 201 1 1
2 2000 202 1 0
2 2001 203 1 1
3 2000 301 1 1
3 2000 302 0 1
3 2001 203 2 2
3 2001 303 1 1

 

 

 

yanshuai
Quartz | Level 8

Hello,

 

Thank you very much. I think this works very well.

 

But I also have two issues with it.

1. It gives me a lot of duplicate rows. I am not sure where is going on.

2. Instead of the number of overlap, what if I would like to calculate the ratio of overlap.

i.e. OVERLAP of STORE for COUNTRY1, OVERLAP of STORE for COUNTRY2, and OVERLAP of STORE for both COUNTRY1 AND COUNTRY2

The numerator is still the OVERLAP, but denomenator will be UNIQUE STORE of COUNTRY1, UNIQUE STORE of COUNTRY2, UNIQUE STORE among COUNTRY1 AND COUNTRY2

 

Thank you very much.

ChrisNZ
Tourmaline | Level 20

1. Duplicates mean that the data is not as described.

Maybe adding the keyword unique solves this issue (hopefully in the subqueries rather than the main query, to reduce volume)?

 

 

2. Provide an example please.

 

yanshuai
Quartz | Level 8

Thank you. Sorry I recently have a cold and did not check replies.

 

So the basic thought is

proc sql ;
    select a.*        
          ,count(distinct p.PRODUCT) as PRODUCT_OVERLAP
          ,count(distinct s.STORE )  as STORE_OVERLAP
          ,count(distinct p.PRODUCT) / count(distinct COUNTRY1'S PRODUCT) as OVERLAP_COUNTRY1
          ,count(distinct p.PRODUCT) / count(distinct COUNTRY2'S PRODUCT) as OVERLAP_COUNTRY2
          ,count(distinct p.PRODUCT) / count(distinct COUNTRY1'S and COUNTRY2'S PRODUCT) as OVERLAP_ALL

Since I am trying to get overlap between them. Suppose now we already have the number of overlap. Now I would like to compute the ratio of overlap compared with COUNTRY1, COUNTRY2, COUNTRY1 and 2.

Using one example here:

count(distinct p.PRODUCT) / count(distinct COUNTRY1'S PRODUCT) as OVERLAP_COUNTRY1

For COUNTRY 001, its PRODUCT OVERLAP with COUNTRY 101 is 1. (Only A1 is overlap)

Its PRODUCT OVERLAP WITH COUNTRY 101 divided by itself unique PRODUCT = 1/3. (A1 / (A1 + A2 +A3))

Its PRODUCT OVERLAP WITH COUNTRY 101 divided by COUNTRY 101's unique PRODUCT = 1/1. (A1 / A1)

Its PRODUCT OVERLAP WITH COUNTRY 101 divided by COUNTRY 001 and 101's unique PRODUCT = 1/3. (A1 / A1 +A2 +A3) (Since 001 and 101 's total PRODUCT is A1,A1,A2,A3, but we only care unique value, so duplicated A1 is only counted once)

 

Thank you.

 

Keep warm.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 953 views
  • 0 likes
  • 3 in conversation