BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
saskapa
Quartz | Level 8

Hi All,

 

 

Could someone explain how to perform a count on a inner join  to meet a specific condition ?

 

My understanding about using  HAVING  statement when doing a merge (for instance) is that it evaluates the condition or expression on the output of the proc sql. In other words on the ouput result of my merge .

 

 

So I am performing a inner join merge and I would like to keep only ID having at least 3 observations with  difference > 10.   For simplicity I will use only one ID.

 


data a;
input  id $  var1 ;
cards;
1 5
1 12
1 8
1 20
;
run;



data b;
input  id $  var2 ;
cards;
1 15  
1 1
1 4
1 2
;
run;


proc sql;
create table case1 as
select a.id,a.var1,var1-var2 as diff
from a as a inner join b as b
on a.id=b.id
having( diff gt 10);
quit;

 I need to add a count in proc sql. So I have done this :

 

 

 

proc sql;
create table case2 as
select a.id,count(*)
from a as a inner join b as b
on a.id=b.id
group by a.id
having( count(var1-var2 gt 10) gt 3 );
quit;

But the issue is that count is eqaul to 10  while it should be equal to 4 ! How should I do to have a row with a count value of  4 ?

 

 

Any clue ?

 

Thanks

 

saskap

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @saskapa,

 

Perhaps what you want is something like this:

 

proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

 

You wrote "I would like to keep only ID having at least 3 observations with  difference > 10." It is not quite clear in which table you would count these 3 observations and why you increased this lower bound to 4 later on.

 

When you perform a 4-to-4 join ("many-to-many join") as with your sample datasets, PROC SQL will internally create a temporary table with 4*4=16 observations (all combinations of one obs. from dataset A and one from dataset B). All of these observations satisfy the ON condition and fall into the group defined by a.id=1. Therefore, counts of observations will refer to this table (and hence can range from 0 to 16).

 

The expression

count(var1-var2 gt 10)

in your HAVING clause counts observations for which the Boolean expression var1-var2 gt 10 has a non-missing value (see documentation). Regardless of the (numeric, possibly missing) values of VAR1 and VAR2, this expresssion is 0 or 1, but never missing. So, the count equals 16. However, you want to count the observations where the Boolean expression is 1, i.e. where var1-var2 gt 10 is true. You can achieve this with the SUM function (see my suggested code above), which adds the 16 (0 or 1)-values of the Boolean expression.

 

By selecting a.id and the desired frequency count (named CNT) from each group (defined by a.id) you get one observation per group.

 

Finally, you need the HAVING clause (and not a WHERE clause) to specify a selection criterion that involves the summary function SUM (through CNT).

View solution in original post

7 REPLIES 7
LinusH
Tourmaline | Level 20

Having is for filtering on aggregate values, it has nothing to do with merge/join operations.

If you add count() just to make use of having, your are out of bounds.

Try to use calculated in the where instead, it's a filter on row level you want, no?

Data never sleeps
saskapa
Quartz | Level 8

Hi LinusH !

 

Thanks yes it s a filtering on the rows. I guess could probably use a where statment for diff > 10 and then using a having (count).

 

Howver If making use of a count() with HAVING I'll be out of bounds .I do not undertand why here I can use a count () with HAVING  ?  :

 

 


data c;
input  id $  var2 ;
cards;
1 15  
2 1
2 3
3 2
;
run;
proc sql;
select id
from c
group by id
having (count( distinct var2) gt 1);
quit;

 

 

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Could you post example output required from your first two test datasets, its not clear to me which one needs to have four records, and what is being diff'd.  It should be straightforward enough to update:

data a;
  input  id $  var1 ;
cards;
1 5
1 12
1 8
1 20
;
run;
data b;
  input  id $  var2 ;
cards;
1 15  
1 1
1 4
1 2
;
run;
proc sql;
  create table WANT as
  select  A.ID,
          A.VAR1,
          B.VAR_COUNT
  from    A A
  left join (select ID,count(*) as VAR_COUNT from B group by ID) B
  on      A.ID=B.ID;
quit;
saskapa
Quartz | Level 8

Hi Rw9,

 

Basically, I need to compute the difference between all the rows of the  data set a and  dataset b. In order to do so I use a join to do so, I 'll have  a a table of 4*4 =16 rows. I compute the difference between var1 and var2 .

 

What I wanted  is to run a query that returns me counts of observations for which the difference is gt than 10.  In my example,  we know that there are 4 observetion for which the difference between var1 and var2 is gt 10.

 

the solution provided by FreelanceReinhard works fine :

 


proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a inner join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

Cheers

 

saskap

 

 

 

FreelanceReinh
Jade | Level 19

Hi @saskapa,

 

Perhaps what you want is something like this:

 

proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

 

You wrote "I would like to keep only ID having at least 3 observations with  difference > 10." It is not quite clear in which table you would count these 3 observations and why you increased this lower bound to 4 later on.

 

When you perform a 4-to-4 join ("many-to-many join") as with your sample datasets, PROC SQL will internally create a temporary table with 4*4=16 observations (all combinations of one obs. from dataset A and one from dataset B). All of these observations satisfy the ON condition and fall into the group defined by a.id=1. Therefore, counts of observations will refer to this table (and hence can range from 0 to 16).

 

The expression

count(var1-var2 gt 10)

in your HAVING clause counts observations for which the Boolean expression var1-var2 gt 10 has a non-missing value (see documentation). Regardless of the (numeric, possibly missing) values of VAR1 and VAR2, this expresssion is 0 or 1, but never missing. So, the count equals 16. However, you want to count the observations where the Boolean expression is 1, i.e. where var1-var2 gt 10 is true. You can achieve this with the SUM function (see my suggested code above), which adds the 16 (0 or 1)-values of the Boolean expression.

 

By selecting a.id and the desired frequency count (named CNT) from each group (defined by a.id) you get one observation per group.

 

Finally, you need the HAVING clause (and not a WHERE clause) to specify a selection criterion that involves the summary function SUM (through CNT).

saskapa
Quartz | Level 8

Thanks @FreelanceReinh!

 

It does exactly what I want.

 

Note that the following method seems to work as well :

 

 

proc sql;
 create table case2  as
 select  a.id
 from a as a inner join b as b
 on a.id=b.id
 where var1-var2  gt 10
 group by a.id
 having( count(*) gt  3 );
quit;

I use a where clause, which select only the match that meets the conditin var1-var2 >10.

 

Cheers

 

saskap

 

 

LinusH
Tourmaline | Level 20
You read me wrong.
Having is to filter with aggregate values (such as count ()).
But I understood your requirement that wished to filter on a value calculated on row level. Then count/having is inappropriate.
Data never sleeps

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 15008 views
  • 3 likes
  • 4 in conversation