DATA Step, Macro, Functions and more

PROC SQL having statement with count

Accepted Solution Solved
Reply
Contributor
Posts: 54
Accepted Solution

PROC SQL having statement with count

Hi All,

 

 

Could someone explain how to perform a count on a inner join  to meet a specific condition ?

 

My understanding about using  HAVING  statement when doing a merge (for instance) is that it evaluates the condition or expression on the output of the proc sql. In other words on the ouput result of my merge .

 

 

So I am performing a inner join merge and I would like to keep only ID having at least 3 observations with  difference > 10.   For simplicity I will use only one ID.

 


data a;
input  id $  var1 ;
cards;
1 5
1 12
1 8
1 20
;
run;



data b;
input  id $  var2 ;
cards;
1 15  
1 1
1 4
1 2
;
run;


proc sql;
create table case1 as
select a.id,a.var1,var1-var2 as diff
from a as a inner join b as b
on a.id=b.id
having( diff gt 10);
quit;

 I need to add a count in proc sql. So I have done this :

 

 

 

proc sql;
create table case2 as
select a.id,count(*)
from a as a inner join b as b
on a.id=b.id
group by a.id
having( count(var1-var2 gt 10) gt 3 );
quit;

But the issue is that count is eqaul to 10  while it should be equal to 4 ! How should I do to have a row with a count value of  4 ?

 

 

Any clue ?

 

Thanks

 

saskap


Accepted Solutions
Solution
‎03-07-2016 02:10 PM
Trusted Advisor
Posts: 1,117

Re: PROC SQL having statement with count

Hi @saskapa,

 

Perhaps what you want is something like this:

 

proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

 

You wrote "I would like to keep only ID having at least 3 observations with  difference > 10." It is not quite clear in which table you would count these 3 observations and why you increased this lower bound to 4 later on.

 

When you perform a 4-to-4 join ("many-to-many join") as with your sample datasets, PROC SQL will internally create a temporary table with 4*4=16 observations (all combinations of one obs. from dataset A and one from dataset B). All of these observations satisfy the ON condition and fall into the group defined by a.id=1. Therefore, counts of observations will refer to this table (and hence can range from 0 to 16).

 

The expression

count(var1-var2 gt 10)

in your HAVING clause counts observations for which the Boolean expression var1-var2 gt 10 has a non-missing value (see documentation). Regardless of the (numeric, possibly missing) values of VAR1 and VAR2, this expresssion is 0 or 1, but never missing. So, the count equals 16. However, you want to count the observations where the Boolean expression is 1, i.e. where var1-var2 gt 10 is true. You can achieve this with the SUM function (see my suggested code above), which adds the 16 (0 or 1)-values of the Boolean expression.

 

By selecting a.id and the desired frequency count (named CNT) from each group (defined by a.id) you get one observation per group.

 

Finally, you need the HAVING clause (and not a WHERE clause) to specify a selection criterion that involves the summary function SUM (through CNT).

View solution in original post


All Replies
Super User
Posts: 5,430

Re: PROC SQL having statement with count

Having is for filtering on aggregate values, it has nothing to do with merge/join operations.

If you add count() just to make use of having, your are out of bounds.

Try to use calculated in the where instead, it's a filter on row level you want, no?

Data never sleeps
Contributor
Posts: 54

Re: PROC SQL having statement with count

Hi LinusH !

 

Thanks yes it s a filtering on the rows. I guess could probably use a where statment for diff > 10 and then using a having (count).

 

Howver If making use of a count() with HAVING I'll be out of bounds .I do not undertand why here I can use a count () with HAVING  ?  :

 

 


data c;
input  id $  var2 ;
cards;
1 15  
2 1
2 3
3 2
;
run;
proc sql;
select id
from c
group by id
having (count( distinct var2) gt 1);
quit;

 

 

 

Super User
Super User
Posts: 7,963

Re: PROC SQL having statement with count

Could you post example output required from your first two test datasets, its not clear to me which one needs to have four records, and what is being diff'd.  It should be straightforward enough to update:

data a;
  input  id $  var1 ;
cards;
1 5
1 12
1 8
1 20
;
run;
data b;
  input  id $  var2 ;
cards;
1 15  
1 1
1 4
1 2
;
run;
proc sql;
  create table WANT as
  select  A.ID,
          A.VAR1,
          B.VAR_COUNT
  from    A A
  left join (select ID,count(*) as VAR_COUNT from B group by ID) B
  on      A.ID=B.ID;
quit;
Contributor
Posts: 54

Re: PROC SQL having statement with count

Hi Rw9,

 

Basically, I need to compute the difference between all the rows of the  data set a and  dataset b. In order to do so I use a join to do so, I 'll have  a a table of 4*4 =16 rows. I compute the difference between var1 and var2 .

 

What I wanted  is to run a query that returns me counts of observations for which the difference is gt than 10.  In my example,  we know that there are 4 observetion for which the difference between var1 and var2 is gt 10.

 

the solution provided by FreelanceReinhard works fine :

 


proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a inner join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

Cheers

 

saskap

 

 

 

Solution
‎03-07-2016 02:10 PM
Trusted Advisor
Posts: 1,117

Re: PROC SQL having statement with count

Hi @saskapa,

 

Perhaps what you want is something like this:

 

proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

 

You wrote "I would like to keep only ID having at least 3 observations with  difference > 10." It is not quite clear in which table you would count these 3 observations and why you increased this lower bound to 4 later on.

 

When you perform a 4-to-4 join ("many-to-many join") as with your sample datasets, PROC SQL will internally create a temporary table with 4*4=16 observations (all combinations of one obs. from dataset A and one from dataset B). All of these observations satisfy the ON condition and fall into the group defined by a.id=1. Therefore, counts of observations will refer to this table (and hence can range from 0 to 16).

 

The expression

count(var1-var2 gt 10)

in your HAVING clause counts observations for which the Boolean expression var1-var2 gt 10 has a non-missing value (see documentation). Regardless of the (numeric, possibly missing) values of VAR1 and VAR2, this expresssion is 0 or 1, but never missing. So, the count equals 16. However, you want to count the observations where the Boolean expression is 1, i.e. where var1-var2 gt 10 is true. You can achieve this with the SUM function (see my suggested code above), which adds the 16 (0 or 1)-values of the Boolean expression.

 

By selecting a.id and the desired frequency count (named CNT) from each group (defined by a.id) you get one observation per group.

 

Finally, you need the HAVING clause (and not a WHERE clause) to specify a selection criterion that involves the summary function SUM (through CNT).

Contributor
Posts: 54

Re: PROC SQL having statement with count

Posted in reply to FreelanceReinhard

Thanks @FreelanceReinhard!

 

It does exactly what I want.

 

Note that the following method seems to work as well :

 

 

proc sql;
 create table case2  as
 select  a.id
 from a as a inner join b as b
 on a.id=b.id
 where var1-var2  gt 10
 group by a.id
 having( count(*) gt  3 );
quit;

I use a where clause, which select only the match that meets the conditin var1-var2 >10.

 

Cheers

 

saskap

 

 

Super User
Posts: 5,430

Re: PROC SQL having statement with count

You read me wrong.
Having is to filter with aggregate values (such as count ()).
But I understood your requirement that wished to filter on a value calculated on row level. Then count/having is inappropriate.
Data never sleeps
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 2656 views
  • 2 likes
  • 4 in conversation