- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Could someone explain how to perform a count on a inner join to meet a specific condition ?
My understanding about using HAVING statement when doing a merge (for instance) is that it evaluates the condition or expression on the output of the proc sql. In other words on the ouput result of my merge .
So I am performing a inner join merge and I would like to keep only ID having at least 3 observations with difference > 10. For simplicity I will use only one ID.
data a;
input id $ var1 ;
cards;
1 5
1 12
1 8
1 20
;
run;
data b;
input id $ var2 ;
cards;
1 15
1 1
1 4
1 2
;
run;
proc sql; create table case1 as select a.id,a.var1,var1-var2 as diff from a as a inner join b as b on a.id=b.id having( diff gt 10);quit;
I need to add a count in proc sql. So I have done this :
proc sql;
create table case2 as
select a.id,count(*)
from a as a inner join b as b
on a.id=b.id
group by a.id
having( count(var1-var2 gt 10) gt 3 );
quit;
But the issue is that count is eqaul to 10 while it should be equal to 4 ! How should I do to have a row with a count value of 4 ?
Any clue ?
Thanks
saskap
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @saskapa,
Perhaps what you want is something like this:
proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;
You wrote "I would like to keep only ID having at least 3 observations with difference > 10." It is not quite clear in which table you would count these 3 observations and why you increased this lower bound to 4 later on.
When you perform a 4-to-4 join ("many-to-many join") as with your sample datasets, PROC SQL will internally create a temporary table with 4*4=16 observations (all combinations of one obs. from dataset A and one from dataset B). All of these observations satisfy the ON condition and fall into the group defined by a.id=1. Therefore, counts of observations will refer to this table (and hence can range from 0 to 16).
The expression
count(var1-var2 gt 10)
in your HAVING clause counts observations for which the Boolean expression var1-var2 gt 10 has a non-missing value (see documentation). Regardless of the (numeric, possibly missing) values of VAR1 and VAR2, this expresssion is 0 or 1, but never missing. So, the count equals 16. However, you want to count the observations where the Boolean expression is 1, i.e. where var1-var2 gt 10 is true. You can achieve this with the SUM function (see my suggested code above), which adds the 16 (0 or 1)-values of the Boolean expression.
By selecting a.id and the desired frequency count (named CNT) from each group (defined by a.id) you get one observation per group.
Finally, you need the HAVING clause (and not a WHERE clause) to specify a selection criterion that involves the summary function SUM (through CNT).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Having is for filtering on aggregate values, it has nothing to do with merge/join operations.
If you add count() just to make use of having, your are out of bounds.
Try to use calculated in the where instead, it's a filter on row level you want, no?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi LinusH !
Thanks yes it s a filtering on the rows. I guess could probably use a where statment for diff > 10 and then using a having (count).
Howver If making use of a count() with HAVING I'll be out of bounds .I do not undertand why here I can use a count () with HAVING ? :
data c;
input id $ var2 ;
cards;
1 15
2 1
2 3
3 2
;
run;
proc sql;
select id
from c
group by id
having (count( distinct var2) gt 1);
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Could you post example output required from your first two test datasets, its not clear to me which one needs to have four records, and what is being diff'd. It should be straightforward enough to update:
data a; input id $ var1 ; cards; 1 5 1 12 1 8 1 20 ; run; data b; input id $ var2 ; cards; 1 15 1 1 1 4 1 2 ; run; proc sql; create table WANT as select A.ID, A.VAR1, B.VAR_COUNT from A A left join (select ID,count(*) as VAR_COUNT from B group by ID) B on A.ID=B.ID; quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rw9,
Basically, I need to compute the difference between all the rows of the data set a and dataset b. In order to do so I use a join to do so, I 'll have a a table of 4*4 =16 rows. I compute the difference between var1 and var2 .
What I wanted is to run a query that returns me counts of observations for which the difference is gt than 10. In my example, we know that there are 4 observetion for which the difference between var1 and var2 is gt 10.
the solution provided by FreelanceReinhard works fine :
proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a inner join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;
Cheers
saskap
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @saskapa,
Perhaps what you want is something like this:
proc sql;
create table case2 as
select a.id, sum(var1-var2 gt 10) as cnt
from a join b
on a.id=b.id
group by a.id
having cnt gt 3;
quit;
You wrote "I would like to keep only ID having at least 3 observations with difference > 10." It is not quite clear in which table you would count these 3 observations and why you increased this lower bound to 4 later on.
When you perform a 4-to-4 join ("many-to-many join") as with your sample datasets, PROC SQL will internally create a temporary table with 4*4=16 observations (all combinations of one obs. from dataset A and one from dataset B). All of these observations satisfy the ON condition and fall into the group defined by a.id=1. Therefore, counts of observations will refer to this table (and hence can range from 0 to 16).
The expression
count(var1-var2 gt 10)
in your HAVING clause counts observations for which the Boolean expression var1-var2 gt 10 has a non-missing value (see documentation). Regardless of the (numeric, possibly missing) values of VAR1 and VAR2, this expresssion is 0 or 1, but never missing. So, the count equals 16. However, you want to count the observations where the Boolean expression is 1, i.e. where var1-var2 gt 10 is true. You can achieve this with the SUM function (see my suggested code above), which adds the 16 (0 or 1)-values of the Boolean expression.
By selecting a.id and the desired frequency count (named CNT) from each group (defined by a.id) you get one observation per group.
Finally, you need the HAVING clause (and not a WHERE clause) to specify a selection criterion that involves the summary function SUM (through CNT).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @FreelanceReinh!
It does exactly what I want.
Note that the following method seems to work as well :
proc sql;
create table case2 as
select a.id
from a as a inner join b as b
on a.id=b.id
where var1-var2 gt 10
group by a.id
having( count(*) gt 3 );
quit;
I use a where clause, which select only the match that meets the conditin var1-var2 >10.
Cheers
saskap
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Having is to filter with aggregate values (such as count ()).
But I understood your requirement that wished to filter on a value calculated on row level. Then count/having is inappropriate.