topic Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable? in SAS Programming

SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Thu, 01 Dec 2022 15:40:54 GMT

I have a table where I'm trying to populate 0's for all treatments. There are 3 possible treatments. I don't need all possible combinations of values for all variables, but I need all possible values of treatment. Can the SPARSE function handle this with an option I'm unfamiliar with? Is there a better way?

data have;
infile datalines dsd dlm=",";
	input trt $ color $ brand $;
datalines;
1, blue, chevy
2, blue, chevy
3, red, ford
2, yellow, chevy
1, red, toyota
;
run;

proc freq data=have noprint;
	table trt*color*brand/ sparse list nopercent nocum norow nocol out=count (drop=percent);
run;

desired output:

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

PeterClemmensen — Thu, 01 Dec 2022 19:21:58 GMT

I don't think this can be done directly in Proc Freq. Here is an approach in the Data Step.

data want(drop = t);

   if _N_ = 1 then do;
      dcl hash h(dataset : 'have');
      h.definekey(all : 'Y');
      h.definedone();
   end;

   set have;

   do t = 1 to 3;
      trt = put(t, 8.);
      if h.check() then do;
         output;
         h.add();
      end;
   end;

run;

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

FreelanceReinh — Thu, 01 Dec 2022 19:59:55 GMT

Hi @Hello_there,

I think you would need to apply SPARSE to a crosstabulation of comb*trt where comb is a concatenation of color and brand, created in a preliminary step (and eventually separated again in a post-processing step). But then it would be easier to get everything in one step as shown by PeterClemmensen or in the PROC SQL step below.

proc sql;
create table want as
select *, count(a.trt) as count from 
have a
natural right join
(select distinct * from have(keep=trt), have(drop=trt))
group by 2,3,1;
quit;

(This assumes that all treatments occur in dataset HAVE.)

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Fri, 02 Dec 2022 00:26:44 GMT

Thanks for the reply!

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Fri, 02 Dec 2022 00:31:04 GMT

Hi FreelanceReinhard,

I am familiar with PROC SQL, but this is kind of a more sophisticated query than I'm used to. If you don't mind would you be able to explain what is happening? Also, i know that 2, 3, 1 refers to the position of the selected variables, but I don't which specifically those are referencing.

Thanks for your reply, this is helping me learn

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

FreelanceReinh — Fri, 02 Dec 2022 21:18:23 GMT

You're welcome.

The first idea was to create a table combining all distinct treatments in dataset HAVE with all distinct (color, brand) pairs in dataset HAVE. This is a Cartesian product join of the form

proc sql;
select * from
  (select distinct trt from have),
  (select distinct color, brand from have);
quit;

But the same result can be obtained from the shorter (but probably less efficient) query

proc sql;
select distinct * from
  have(keep=trt), have(drop=trt);
quit;

which combines all treatments (including duplicates) in dataset HAVE with all (color, brand) pairs (again including duplicates) in dataset HAVE, eventually removing duplicates from this set of combinations.

I used the latter query as an inline view in the outer query, which has the form of a natural right join, i.e., a join using all like-named columns of the contributing tables as key variables: trt, color and brand. The right join ensures that the output dataset WANT will contain all (trt, color, brand) combinations of the inline view constructed above and, most importantly, it helps us to detect those combinations which are not contained in dataset HAVE (the left part of the join). This is necessary to get the zeros into the final table.

A (trt, color, brand) combination which is not contained in HAVE contributes missing values for all three variables to the join. These are normally invisible, though, because we only see the non-missing values contributed by the inline view. But the explicit reference a.trt (and similarly a.color or a.brand) allows us to work with the values contributed by HAVE (with alias a). The number of non-missing a.trt values per (trt, color, brand) combination is what we need in variable count of dataset WANT. So we compute count(a.trt) for each combination.

The variables created by the natural join are the same as in dataset HAVE, in the same order: trt, color, brand. We can refer to them by their positions 1, 2, 3, respectively, in the GROUP BY clause. Your desired output suggests a sort order by color, brand, trt, which is obtained by group by 2, 3, 1 (surprisingly not by group by color, brand, trt). To ensure the desired sort order of dataset WANT for your real input data you should insert an ORDER BY clause after the GROUP BY clause:

order by 2, 3, 1;

Edit / Addendum:

I was "surprised" that group by 2, 3, 1 produced a different sort order of dataset WANT than group by color, brand, trt, but instead I should have been surprised that the GROUP BY clause using the explicit variable names worked at all! In an ordinary (i.e., not natural) right join a GROUP BY clause like this would have caused the PROC SQL step to fail with error messages

ERROR: Ambiguous reference, column color is in more than one table.
ERROR: Ambiguous reference, column brand is in more than one table.
ERROR: Ambiguous reference, column trt is in more than one table.

because both dataset HAVE (with alias a) and the inline view contribute these same variable names to the join. Even if the errors were avoided by being more specific in the variable references in the GROUP BY clause, the "select *" of the outer query would at least cause warning messages

WARNING: Variable trt already exists on file WORK.WANT.
WARNING: Variable color already exists on file WORK.WANT.
WARNING: Variable brand already exists on file WORK.WANT.

reminding us of those name conflicts. Yet, all this doesn't happen and this appears to be a special (convenience) feature of natural joins.

The FEEDBACK option of the PROC SQL statement reveals what happens inside the natural join:

1141  proc sql feedback;
1142  create table want as
1143  select *, count(a.trt) as count from
1144  have a
1145  natural right join
1146  (select distinct * from have(keep=trt) b, have(drop=trt) c)
1147  group by color, brand, trt;
NOTE: Statement transforms to:

        select COALESCE(A.trt, B.trt) as trt, COALESCE(A.color, C.color) as color, COALESCE(A.brand, C.brand) as brand, COUNT(A.trt) as count
          from WORK.HAVE A right outer join
               ( select distinct B.trt, C.color, C.brand
                   from WORK.HAVE B(keep=trt), WORK.HAVE C(drop=trt)
               ) on (A.trt = B.trt) and (A.color = C.color) and (A.brand = C.brand)
      group by C.color, C.brand, B.trt;

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.WANT created, with 12 rows and 4 columns.

(For the above log I had also added aliases b and c to the source datasets of the inline view in order to clarify the code provided by the FEEDBACK option.)

So it turns out that with the natural join the compiler decided to (tacitly) use color, brand and trt from the inline view, thus avoiding the harmful ambiguities. The resolution of the asterisk in the "select *" of the outer query to COALESCE expressions as shown above is standard behavior of natural joins and avoids the warnings mentioned earlier.

The numbers 1, 2, 3 used in "group by 2, 3, 1", however, always refer to the items in the (outer) SELECT clause. In our case:

COALESCE(A.trt, B.trt)
COALESCE(A.color, C.color)
COALESCE(A.brand, C.brand)

While the natural join assigned the original variable names trt, color and brand to these three items, they are, of course, different entities than C.color, C.brand and B.trt. So, in this case "group by 2, 3, 1" is not an abbreviation of "group by color, brand, trt" (in the sense of "group by C.color, C.brand, B.trt"), but rather of

group by calculated color, calculated brand, calculated trt

which produces the same sort order of dataset WANT.

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Fri, 02 Dec 2022 14:25:25 GMT

Thanks for taking time to type this out and teach me, FreelanceReinhard! I will be studying this thoroughly.

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

FreelanceReinh — Fri, 02 Dec 2022 21:20:04 GMT

I've just expanded my previous post by an explanation why group by 2, 3, 1 is different from group by color, brand, trt.

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Sat, 03 Dec 2022 01:48:33 GMT

Thanks again for your help, FreelanceReinhard!

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Sat, 10 Dec 2022 09:11:44 GMT

@FreelanceReinh wrote:

The FEEDBACK option of the PROC SQL statement reveals what happens inside the natural join:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.WANT created, with 12 rows and 4 columns.

Hi FreelanceReinhard,

Do you know if there's any chance how i can avoid the top 2 notes?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

FreelanceReinh — Sat, 10 Dec 2022 13:05:05 GMT

@Hello_there wrote:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.WANT created, with 12 rows and 4 columns.

The second of these notes does not occur with the code from the accepted solution. It occurred with the PROC SQL step that I used to demonstrate that the correct GROUP BY clause

group by 2,3,1;

from the solution is not equivalent to

group by color, brand, trt;

So you may be using the wrong code if you see that note with your final program.

Generally speaking, if the goal is to have a clean log, then you don't need to do anything about these notes as they do not indicate problems. They just describe the internal workings of a PROC SQL step to a level of detail that we are not used to with other procedures like PROC FREQ or PROC SUMMARY.

If you must avoid this type of notes, you can either apply a cosmetic change to the log or replace the concise PROC SQL step with a more cumbersome multi-step solution, possibly avoiding PROC SQL altogether (at least Cartesian product joins and automatic remerging). Example: the approach using concatenated values in a variable comb as outlined in the first sentence of the accepted solution.

For a "cosmetic" solution there are two options, neither of which I would recommend:

Suppress all notes from the PROC SQL step by means of options nonotes and then write the desired notes to the log:

options nonotes;
proc sql;
create table want as
select *, count(a.trt) as count from 
have a
natural right join
(select distinct * from have(keep=trt), have(drop=trt))
group by 2,3,1;
quit;

options notes nosource;
%let _dsid=%sysfunc(open(&syslast));
%let _nvar=%sysfunc(attrn(&_dsid,NVARS));
%let _dsid=%sysfunc(close(&_dsid));
%put NOTE: Table %cmpres(&syslast) created, with &sqlobs rows and &_nvar columns.;
%symdel _dsid _nvar;
proc sql;
quit;
options source;

Log:

12    options nonotes;
13    proc sql;
14    create table want as
15    select *, count(a.trt) as count from
16    have a
17    natural right join
18    (select distinct * from have(keep=trt), have(drop=trt))
19    group by 2,3,1;
20    quit;
21
22    options notes nosource;
NOTE: Table WORK.WANT created, with 12 rows and 4 columns.
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Read, manipulate and rewrite the log as a text file.

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Sat, 10 Dec 2022 19:29:53 GMT

Thanks for explaining.
Learning about how useful natural joins are have been useful for my use cases so this was a definitely helpful.
You were right, i was running the wrong code before and that's where the one note came from bc i didn't use a group by statement.
As for having a clean log, I might have to get used to seeing "Note:" in my log. Usually we have a log checker, and it might flag for just warnings and errors and not notes.

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Wed, 18 Jan 2023 17:29:09 GMT

Hi @FreelanceReinh ,

I don't know if you have any experience w/ dealing w/ clinical trial data, but I'm trying to use this code to get counts for distinct trios of SOC term, PT term, and TRT01AN. Not every TRT01AN is in the main data set, so i had to pull them from another data set. When I run the below code, i get duplicates. After the right join, i'm trying to combine distinct TRT01AN values, with distinct pairs of SOC term and PT terms. Do you have a suggestion for how I can do that in one PROC SQL step? I know it's possible do a proc sort and use nodup to fix my issue, but i was wondering if i could just do it in PROC SQL.

proc sql;
create table want as
select pt_term, soc_term, trt01an, count(distinct a.subject) as count from 
adae a
natural right join
(select distinct * from adae(keep=soc_term pt_term), adsl(keep=trt01an))
group by soc_term, pt_term, trt01an;
quit;

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

FreelanceReinh — Wed, 18 Jan 2023 16:05:00 GMT

Hi @Hello_there,

Do you still get duplicates after changing the GROUP BY clause to

group by soc_term, pt_term, calculated trt01an;

(see this earlier post in this thread for the reason)?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Wed, 18 Jan 2023 16:37:43 GMT

Unfortunately that didn't work.

If i break it apart this code returns back the right number of rows (without the counts).

proc sql;
	create table look as
	select * from 
  		(select distinct trt01an from adsl),
  		(select distinct soc_term, pt_term from adae);
quit;

I was wondering if it was possible to do a right join with the counts to that. But it looks like the below code doesn't work.

proc sql;
create table want as
	select soc_term, pt_term, trt01an, count(distinct a.subject) as count
	from adae a
	natural right join
 		select * from
			(select distinct trt01an from adsl),
  			(select distinct soc_term, pt_term from adae)
	group by soc_term, pt_term, trt01an;
quit;

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

FreelanceReinh — Wed, 18 Jan 2023 17:03:16 GMT

@Hello_there wrote:

Unfortunately that didn't work.

So, you still got duplicates? Or an error in the log (if so, please show the log)?

If i break it apart this code returns back the right number of rows (without the counts).
proc sql;
	create table look as
	select * from 
  		(select distinct trt01an from adsl),
  		(select distinct soc_term, pt_term from adae);
quit;
I was wondering if it was possible to do a right join with the counts to that. But it looks like the below code doesn't work.
proc sql;
create table want as
	select soc_term, pt_term, trt01an, count(distinct a.subject) as count
	from adae a
	natural right join
 		select * from
			(select distinct trt01an from adsl),
  			(select distinct soc_term, pt_term from adae)
	group by soc_term, pt_term, trt01an;
quit;

To use the first query (creating dataset look) as a subquery in another query you must put it into parentheses: ... right join (select * from (...), (...)).

Your query has changed from your previous post, which used aebodsys and aedecod. Now that soc_term and pt_term come from both sides of the natural right join, you would need the "calculated" keyword before all three items in the GROUP BY clause or simply use the abbreviation

group by 1,2,3;

as discussed earlier in this thread.

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Hello_there — Wed, 18 Jan 2023 17:38:29 GMT

Your advice worked! Adding the () around the subquery and the calculated before each group by term worked. Thanks again!

I used soc_term/aebodsys and pc_term/aedecod interchangeably; i should have been more consistent, sorry about the confusion.

Also as for the first question, there were no errors in the log, but it was most likely due to the variable mix up on my part that caused the duplicates.