Solved: Re: PROC SQL having statement with count

saskapa · Posted 03-07-2016 08:05 AM

Hi All,

Could someone explain how to perform a count on a inner join to meet a specific condition ?

My understanding about using HAVING statement when doing a merge (for instance) is that it evaluates the condition or expression on the output of the proc sql. In other words on the ouput result of my merge .

So I am performing a inner join merge and I would like to keep only ID having at least 3 observations with difference > 10. For simplicity I will use only one ID.


data a;
input  id $  var1 ;
cards;
1 5
1 12
1 8
1 20
;
run;



data b;
input  id $  var2 ;
cards;
1 15  
1 1
1 4
1 2
;
run;


proc sql;
 create table case1 as
 select a.id,a.var1,var1-var2 as diff
 from a as a inner join b as b
 on a.id=b.id
 having( diff gt 10);
quit;

I need to add a count in proc sql. So I have done this :

proc sql;
 create table case2 as
 select a.id,count(*)
 from a as a inner join b as b
 on a.id=b.id
 group by a.id
 having( count(var1-var2 gt 10) gt 3 );
quit;

But the issue is that count is eqaul to 10 while it should be equal to 4 ! How should I do to have a row with a count value of 4 ?

Any clue ?

Thanks

saskap

FreelanceReinh · Posted 03-07-2016 10:56 AM

Hi @saskapa,

Perhaps what you want is something like this:

proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

You wrote "I would like to keep only ID having at least 3 observations with difference > 10." It is not quite clear in which table you would count these 3 observations and why you increased this lower bound to 4 later on.

When you perform a 4-to-4 join ("many-to-many join") as with your sample datasets, PROC SQL will internally create a temporary table with 4*4=16 observations (all combinations of one obs. from dataset A and one from dataset B). All of these observations satisfy the ON condition and fall into the group defined by a.id=1. Therefore, counts of observations will refer to this table (and hence can range from 0 to 16).

The expression

count(var1-var2 gt 10)

in your HAVING clause counts observations for which the Boolean expression var1-var2 gt 10 has a non-missing value (see documentation). Regardless of the (numeric, possibly missing) values of VAR1 and VAR2, this expresssion is 0 or 1, but never missing. So, the count equals 16. However, you want to count the observations where the Boolean expression is 1, i.e. where var1-var2 gt 10 is true. You can achieve this with the SUM function (see my suggested code above), which adds the 16 (0 or 1)-values of the Boolean expression.

By selecting a.id and the desired frequency count (named CNT) from each group (defined by a.id) you get one observation per group.

Finally, you need the HAVING clause (and not a WHERE clause) to specify a selection criterion that involves the summary function SUM (through CNT).

View solution in original post

LinusH · Posted 03-07-2016 08:28 AM

Having is for filtering on aggregate values, it has nothing to do with merge/join operations.

If you add count() just to make use of having, your are out of bounds.

Try to use calculated in the where instead, it's a filter on row level you want, no?

Data never sleeps

saskapa · Posted 03-07-2016 08:52 AM

Hi LinusH !

Thanks yes it s a filtering on the rows. I guess could probably use a where statment for diff > 10 and then using a having (count).

Howver If making use of a count() with HAVING I'll be out of bounds .I do not undertand why here I can use a count () with HAVING ? :


data c;
input  id $  var2 ;
cards;
1 15  
2 1
2 3
3 2
;
run;
proc sql;
select id
from c
group by id
having (count( distinct var2) gt 1);
quit;

RW9 · Posted 03-07-2016 09:26 AM

Could you post example output required from your first two test datasets, its not clear to me which one needs to have four records, and what is being diff'd. It should be straightforward enough to update:

data a;
  input  id $  var1 ;
cards;
1 5
1 12
1 8
1 20
;
run;
data b;
  input  id $  var2 ;
cards;
1 15  
1 1
1 4
1 2
;
run;
proc sql;
  create table WANT as
  select  A.ID,
          A.VAR1,
          B.VAR_COUNT
  from    A A
  left join (select ID,count(*) as VAR_COUNT from B group by ID) B
  on      A.ID=B.ID;
quit;

saskapa · Posted 03-07-2016 02:10 PM

Hi Rw9,

Basically, I need to compute the difference between all the rows of the data set a and dataset b. In order to do so I use a join to do so, I 'll have a a table of 4*4 =16 rows. I compute the difference between var1 and var2 .

What I wanted is to run a query that returns me counts of observations for which the difference is gt than 10. In my example, we know that there are 4 observetion for which the difference between var1 and var2 is gt 10.

the solution provided by FreelanceReinhard works fine :


proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a inner join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

Cheers

saskap

FreelanceReinh · Posted 03-07-2016 10:56 AM

Hi @saskapa,

Perhaps what you want is something like this:

proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;

You wrote "I would like to keep only ID having at least 3 observations with difference > 10." It is not quite clear in which table you would count these 3 observations and why you increased this lower bound to 4 later on.

When you perform a 4-to-4 join ("many-to-many join") as with your sample datasets, PROC SQL will internally create a temporary table with 4*4=16 observations (all combinations of one obs. from dataset A and one from dataset B). All of these observations satisfy the ON condition and fall into the group defined by a.id=1. Therefore, counts of observations will refer to this table (and hence can range from 0 to 16).

The expression

count(var1-var2 gt 10)

in your HAVING clause counts observations for which the Boolean expression var1-var2 gt 10 has a non-missing value (see documentation). Regardless of the (numeric, possibly missing) values of VAR1 and VAR2, this expresssion is 0 or 1, but never missing. So, the count equals 16. However, you want to count the observations where the Boolean expression is 1, i.e. where var1-var2 gt 10 is true. You can achieve this with the SUM function (see my suggested code above), which adds the 16 (0 or 1)-values of the Boolean expression.

By selecting a.id and the desired frequency count (named CNT) from each group (defined by a.id) you get one observation per group.

Finally, you need the HAVING clause (and not a WHERE clause) to specify a selection criterion that involves the summary function SUM (through CNT).

saskapa · Posted 03-07-2016 02:27 PM

Thanks @FreelanceReinh!

It does exactly what I want.

Note that the following method seems to work as well :

proc sql;
 create table case2  as
 select  a.id
 from a as a inner join b as b
 on a.id=b.id
 where var1-var2  gt 10
 group by a.id
 having( count(*) gt  3 );
quit;

I use a where clause, which select only the match that meets the conditin var1-var2 >10.

Cheers

saskap

LinusH · Posted 03-07-2016 11:55 AM

You read me wrong.
Having is to filter with aggregate values (such as count ()).
But I understood your requirement that wished to filter on a value calculated on row level. Then count/having is inappropriate.

Data never sleeps

Classroom Training Available!