DATA Step, Macro, Functions and more

Understanding sql count (*) as X, var1, var2, var3, and purpose of "group by".

Reply
Occasional Contributor
Posts: 16

Understanding sql count (*) as X, var1, var2, var3, and purpose of "group by".

I am trying to understand this code inherited from a coworker who is no longer here:

 

proc sql;
		create table mode_a_1 as
		select count(*) as num_obs, ndcnum, metqty, daysupp
		from collapse_drugs_2 (where=(daysupp^=.))
		group by ndcnum, daysupp, metqty
		order by ndcnum, metqty, num_obs desc;
	quit;

	proc means data=mode_a_1 noprint nway missing;
		class ndcnum num_obs metqty;
		var daysupp;
		output out=mode_a_2 (drop=_:) mean=;
		proc sort; by ndcnum metqty descending num_obs;
	run;

For context, collapse_drugs_2 is a dataset with drug claims. 

 

Questions: 

 

1. For the proc sql step, is this an accurate understanding of what the code is doing? 

 

Create a dataset called "mode_a_1" that is based off of all rows of collapse_drug_2 (where daysupp values are not missing), where we also create a new variable called "num_obs" that counts the number of observations for each unique combination of ndcnum, metqty, daysupp. 

 

Then order by ascending ndcnum, metqty, and then by descending num_obs. 

 

2. If the above is true, why is the "group by" line necessary? Isn't the combination of ndcnum, metqty, daysupp already specified in the select count (*) line? 

 

3. Why is the proc means step necessary? Aren't all daysupp (and therefore mean of daysupp) for each ndcnum, num_obs, metsqty combination the same, because that's how we by definition organized the mode_a_1 dataset? Isn't mode_a_1 the same dataset as mode_a_2? 

 

Thanks, 

Trusted Advisor
Posts: 1,128

Re: Understanding sql count (*) as X, var1, var2, var3, and purpose of "group by".

your understanding of 1 is correct.

 

2) We require the group by to get the count of observations based on the ndcnum, daysupp, metqty. If there is no group then we dont get the count by ndcnum, daysupp, metqty.

 

3) Yes the proc means is for the mean of daysupp.

 

To get the count and mean we could directly use the proc means as below

 

proc means data=mode_a_1 noprint nway missing;
		class ndcnum metqty;
		var daysupp;
		output out=mode_a_2 (drop=_:)  n=num_obs mean=;
run;

proc sort; 
by ndcnum metqty descending num_obs;
run;
 
Thanks,
Jag
Occasional Contributor
Posts: 16

Re: Understanding sql count (*) as X, var1, var2, var3, and purpose of "group by".

[ Edited ]

Thanks!

 

2. Re: why "group by" is necessary. In that case, what happens if our group by variables are less than our count variables? i.e. below, where we count by ndcnum, metqty, daysupp, but we only group by ndcnum and daysupp? 

 

proc sql;
		create table mode_b_1 as
		select count(*) as num_obs, ndcnum, metqty, daysupp
		from collapse_drugs_2 (where=(daysupp^=.))
		group by ndcnum, daysupp
		order by ndcnum, num_obs desc;
	quit;

3. Re: proc means step: Can you please explain the difference between mode_a_1 and mode_a_2? I feel like from my understanding of the code they are identical to each other? 

Super User
Super User
Posts: 6,499

Re: Understanding sql count (*) as X, var1, var2, var3, and purpose of "group by".

(2) 

It is generally less confusing to list the group by variables first in the SELECT statement.  So if we re-write the query from your original question:

create table mode_a_1 as
  select
    ndcnum
  , metqty
  , daysupp
  , count(*) as num_obs
  from (select * from collapse_drugs_2 where not missing(daysupp))
  group by ndcnum, metqty, daysupp
  order by ndcnum, metqty, num_obs desc
;

So we can see that it is counting how many records there are for each NDCNUM*METQTY*DAYSUPP combination and then sorting them so that the most frequent value of DAYSUPP will be first for each combination of NDCNUM and METQTY.  So we could see which DAYSUPP is the most common for say the 100mg METQTY version of Drug A by seeing which one is listed first.

 

If you list variables in your SELECT statement that are neither group by variables or aggregate functions then SAS will calculate the aggregate values and re-merge it onto all of the rows that have those same values for the group by variables. It will write a note to the log saying that it is doing this. So in your new query:

create table mode_b_1 as
  select
    ndcnum
  , daysupp
  , count(*) as num_obs
  , metqty
  from collapse_drugs_2 (where=(daysupp^=.))
  group by ndcnum, daysupp
  order by ndcnum, num_obs desc
;

If the group defined by NDCNUM='12345678901' and DAYSSUPP=30 has 10 observations then the all 10 observations in the output dataset will have 10 as the value of the new NUM_OBS variable.

 

You can now use the first observation per drug to see which value of DAYSUPP is most popular for a particular drug if you ignore the dose level.  You can still see all of the dose levels, but you would need to count them again to see which dose level is most common.

 

 

 

(3) In going from the input to the PROC MEANS step (mode_a_1) to the output of the PROC MEAN step (mode_b_1) you should be reducing the number of observations by replacing the multiple values of DAYSSUPP with the single mean value.

PROC Star
Posts: 63

Re: Understanding sql count (*) as X, var1, var2, var3, and purpose of "group by".

@christinakwang:

Ad 1: "Create a dataset called "mode_a_1" that is based off of all rows" - no not ALL rows. Because we have the "GROUP BY" clause, we get unique combinations of the group variables, not all rows - at least when no other variables are included.

 

Ad 2: The GROUP BY clause is necesssary because otherwise COUNT(*) would return the total number of rows that met the specifications in the WHERE clause, and we would get all observations from the input, not just the unique rows. So the NUMOBS would be larger, and the same for all observations, and all initial rows would be included, not just the unique ones.

 

Ad 3: The MEANS procedure looks a bit strange to me, too. But is does do something: DAYSUP was a group variable in PROC SQL, but it is now the summary variable. What looks strange to me is the fact that the previous summary variable (NUMOBS) is now used as a classification variable. But maybe there is a reason for this. Do you have some example data (e.g. as a datastep, so that we can easily create the test data) and perhaps a description of what the code is supposed to do?

 

A short description or the final output is something like: Produces average DAYSUP for each NCDNUM, METQTY and number of observations (is it possible to give a meaningful description to that number, such as e.g. "Number of injections"?). Is that a reasonable description of the desired output?

Occasional Contributor
Posts: 16

Re: Understanding sql count (*) as X, var1, var2, var3, and purpose of "group by".

[ Edited ]

@s_lassen

 

Thanks for your response! Answers to question 1 and 2 makes sense. 

 

Re: 3, I think this is Truven pharmacy claims data. 

 

data __drugs;
	set &do i=1 %to %sysfunc (countw (&pharm_files.));
		%let filename = %scan (&pharm_files., &i));
		raw.&filename.
	%end;
	; 
	
/* This was subsequently cleaned (i.e. removed negative values, etc) and
made into collapse_drugs_2; 

The purpose of the code in the OP was "get the mode of daysupp per NDC"

*/

I added in the comments. Not sure what else I can provide? I don't even think I have access to the actual data, though I do have the data dictionary + the standard data summary that Truven gives? 

 

Thanks!! 

Ask a Question
Discussion stats
  • 5 replies
  • 163 views
  • 3 likes
  • 4 in conversation