Solved: SPARSE option in proc freq. Is it possible to only include all possibl...

Hello_there · Posted 12-01-2022 10:40 AM

I have a table where I'm trying to populate 0's for all treatments. There are 3 possible treatments. I don't need all possible combinations of values for all variables, but I need all possible values of treatment. Can the SPARSE function handle this with an option I'm unfamiliar with? Is there a better way?

data have;
infile datalines dsd dlm=",";
	input trt $ color $ brand $;
datalines;
1, blue, chevy
2, blue, chevy
3, red, ford
2, yellow, chevy
1, red, toyota
;
run;

proc freq data=have noprint;
	table trt*color*brand/ sparse list nopercent nocum norow nocol out=count (drop=percent);
run;

desired output:

FreelanceReinh · Posted 12-01-2022 02:59 PM

Hi @Hello_there,

I think you would need to apply SPARSE to a crosstabulation of comb*trt where comb is a concatenation of color and brand, created in a preliminary step (and eventually separated again in a post-processing step). But then it would be easier to get everything in one step as shown by PeterClemmensen or in the PROC SQL step below.

proc sql;
create table want as
select *, count(a.trt) as count from 
have a
natural right join
(select distinct * from have(keep=trt), have(drop=trt))
group by 2,3,1;
quit;

(This assumes that all treatments occur in dataset HAVE.)

View solution in original post

PeterClemmensen · Posted 12-01-2022 02:21 PM

I don't think this can be done directly in Proc Freq. Here is an approach in the Data Step.

data want(drop = t);

   if _N_ = 1 then do;
      dcl hash h(dataset : 'have');
      h.definekey(all : 'Y');
      h.definedone();
   end;

   set have;

   do t = 1 to 3;
      trt = put(t, 8.);
      if h.check() then do;
         output;
         h.add();
      end;
   end;

run;

The DATA to DATA Step Macro
Blog: SASnrd

Hello_there · Posted 12-01-2022 07:26 PM

Thanks for the reply!

FreelanceReinh · Posted 12-01-2022 02:59 PM

Hi @Hello_there,

I think you would need to apply SPARSE to a crosstabulation of comb*trt where comb is a concatenation of color and brand, created in a preliminary step (and eventually separated again in a post-processing step). But then it would be easier to get everything in one step as shown by PeterClemmensen or in the PROC SQL step below.

proc sql;
create table want as
select *, count(a.trt) as count from 
have a
natural right join
(select distinct * from have(keep=trt), have(drop=trt))
group by 2,3,1;
quit;

(This assumes that all treatments occur in dataset HAVE.)

Hello_there · Posted 12-01-2022 07:31 PM

Hi FreelanceReinhard,

I am familiar with PROC SQL, but this is kind of a more sophisticated query than I'm used to. If you don't mind would you be able to explain what is happening? Also, i know that 2, 3, 1 refers to the position of the selected variables, but I don't which specifically those are referencing.

Thanks for your reply, this is helping me learn

FreelanceReinh · Posted 12-02-2022 06:02 AM

You're welcome.

The first idea was to create a table combining all distinct treatments in dataset HAVE with all distinct (color, brand) pairs in dataset HAVE. This is a Cartesian product join of the form

proc sql;
select * from
  (select distinct trt from have),
  (select distinct color, brand from have);
quit;

But the same result can be obtained from the shorter (but probably less efficient) query

proc sql;
select distinct * from
  have(keep=trt), have(drop=trt);
quit;

which combines all treatments (including duplicates) in dataset HAVE with all (color, brand) pairs (again including duplicates) in dataset HAVE, eventually removing duplicates from this set of combinations.

I used the latter query as an inline view in the outer query, which has the form of a natural right join, i.e., a join using all like-named columns of the contributing tables as key variables: trt, color and brand. The right join ensures that the output dataset WANT will contain all (trt, color, brand) combinations of the inline view constructed above and, most importantly, it helps us to detect those combinations which are not contained in dataset HAVE (the left part of the join). This is necessary to get the zeros into the final table.

A (trt, color, brand) combination which is not contained in HAVE contributes missing values for all three variables to the join. These are normally invisible, though, because we only see the non-missing values contributed by the inline view. But the explicit reference a.trt (and similarly a.color or a.brand) allows us to work with the values contributed by HAVE (with alias a). The number of non-missing a.trt values per (trt, color, brand) combination is what we need in variable count of dataset WANT. So we compute count(a.trt) for each combination.

The variables created by the natural join are the same as in dataset HAVE, in the same order: trt, color, brand. We can refer to them by their positions 1, 2, 3, respectively, in the GROUP BY clause. Your desired output suggests a sort order by color, brand, trt, which is obtained by group by 2, 3, 1 (surprisingly not by group by color, brand, trt). To ensure the desired sort order of dataset WANT for your real input data you should insert an ORDER BY clause after the GROUP BY clause:

order by 2, 3, 1;

Edit / Addendum:

I was "surprised" that group by 2, 3, 1 produced a different sort order of dataset WANT than group by color, brand, trt, but instead I should have been surprised that the GROUP BY clause using the explicit variable names worked at all! In an ordinary (i.e., not natural) right join a GROUP BY clause like this would have caused the PROC SQL step to fail with error messages

ERROR: Ambiguous reference, column color is in more than one table.
ERROR: Ambiguous reference, column brand is in more than one table.
ERROR: Ambiguous reference, column trt is in more than one table.

because both dataset HAVE (with alias a) and the inline view contribute these same variable names to the join. Even if the errors were avoided by being more specific in the variable references in the GROUP BY clause, the "select *" of the outer query would at least cause warning messages

WARNING: Variable trt already exists on file WORK.WANT.
WARNING: Variable color already exists on file WORK.WANT.
WARNING: Variable brand already exists on file WORK.WANT.

reminding us of those name conflicts. Yet, all this doesn't happen and this appears to be a special (convenience) feature of natural joins.

The FEEDBACK option of the PROC SQL statement reveals what happens inside the natural join:

1141  proc sql feedback;
1142  create table want as
1143  select *, count(a.trt) as count from
1144  have a
1145  natural right join
1146  (select distinct * from have(keep=trt) b, have(drop=trt) c)
1147  group by color, brand, trt;
NOTE: Statement transforms to:

        select COALESCE(A.trt, B.trt) as trt, COALESCE(A.color, C.color) as color, COALESCE(A.brand, C.brand) as brand, COUNT(A.trt) as count
          from WORK.HAVE A right outer join
               ( select distinct B.trt, C.color, C.brand
                   from WORK.HAVE B(keep=trt), WORK.HAVE C(drop=trt)
               ) on (A.trt = B.trt) and (A.color = C.color) and (A.brand = C.brand)
      group by C.color, C.brand, B.trt;

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.WANT created, with 12 rows and 4 columns.

(For the above log I had also added aliases b and c to the source datasets of the inline view in order to clarify the code provided by the FEEDBACK option.)

So it turns out that with the natural join the compiler decided to (tacitly) use color, brand and trt from the inline view, thus avoiding the harmful ambiguities. The resolution of the asterisk in the "select *" of the outer query to COALESCE expressions as shown above is standard behavior of natural joins and avoids the warnings mentioned earlier.

The numbers 1, 2, 3 used in "group by 2, 3, 1", however, always refer to the items in the (outer) SELECT clause. In our case:

COALESCE(A.trt, B.trt)
COALESCE(A.color, C.color)
COALESCE(A.brand, C.brand)

While the natural join assigned the original variable names trt, color and brand to these three items, they are, of course, different entities than C.color, C.brand and B.trt. So, in this case "group by 2, 3, 1" is not an abbreviation of "group by color, brand, trt" (in the sense of "group by C.color, C.brand, B.trt"), but rather of

group by calculated color, calculated brand, calculated trt

which produces the same sort order of dataset WANT.

Hello_there · Posted 12-02-2022 09:25 AM

Thanks for taking time to type this out and teach me, FreelanceReinhard! I will be studying this thoroughly.

FreelanceReinh · Posted 12-02-2022 04:20 PM

I've just expanded my previous post by an explanation why group by 2, 3, 1 is different from group by color, brand, trt.

Hello_there · Posted 12-02-2022 08:48 PM

Thanks again for your help, FreelanceReinhard!

Hello_there · Posted 12-10-2022 04:11 AM

@FreelanceReinh wrote:

The FEEDBACK option of the PROC SQL statement reveals what happens inside the natural join:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.WANT created, with 12 rows and 4 columns.

Hi FreelanceReinhard,

Do you know if there's any chance how i can avoid the top 2 notes?

FreelanceReinh · Posted 12-10-2022 08:05 AM

@Hello_there wrote:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.WANT created, with 12 rows and 4 columns.

The second of these notes does not occur with the code from the accepted solution. It occurred with the PROC SQL step that I used to demonstrate that the correct GROUP BY clause

group by 2,3,1;

from the solution is not equivalent to

group by color, brand, trt;

So you may be using the wrong code if you see that note with your final program.

Generally speaking, if the goal is to have a clean log, then you don't need to do anything about these notes as they do not indicate problems. They just describe the internal workings of a PROC SQL step to a level of detail that we are not used to with other procedures like PROC FREQ or PROC SUMMARY.

If you must avoid this type of notes, you can either apply a cosmetic change to the log or replace the concise PROC SQL step with a more cumbersome multi-step solution, possibly avoiding PROC SQL altogether (at least Cartesian product joins and automatic remerging). Example: the approach using concatenated values in a variable comb as outlined in the first sentence of the accepted solution.

For a "cosmetic" solution there are two options, neither of which I would recommend:

Suppress all notes from the PROC SQL step by means of options nonotes and then write the desired notes to the log:

options nonotes;
proc sql;
create table want as
select *, count(a.trt) as count from 
have a
natural right join
(select distinct * from have(keep=trt), have(drop=trt))
group by 2,3,1;
quit;

options notes nosource;
%let _dsid=%sysfunc(open(&syslast));
%let _nvar=%sysfunc(attrn(&_dsid,NVARS));
%let _dsid=%sysfunc(close(&_dsid));
%put NOTE: Table %cmpres(&syslast) created, with &sqlobs rows and &_nvar columns.;
%symdel _dsid _nvar;
proc sql;
quit;
options source;

Log:

12    options nonotes;
13    proc sql;
14    create table want as
15    select *, count(a.trt) as count from
16    have a
17    natural right join
18    (select distinct * from have(keep=trt), have(drop=trt))
19    group by 2,3,1;
20    quit;
21
22    options notes nosource;
NOTE: Table WORK.WANT created, with 12 rows and 4 columns.
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Read, manipulate and rewrite the log as a text file.

Hello_there · Posted 12-10-2022 02:29 PM

Thanks for explaining.
Learning about how useful natural joins are have been useful for my use cases so this was a definitely helpful.
You were right, i was running the wrong code before and that's where the one note came from bc i didn't use a group by statement.
As for having a clean log, I might have to get used to seeing "Note:" in my log. Usually we have a log checker, and it might flag for just warnings and errors and not notes.

Hello_there · Posted 01-18-2023 10:04 AM

Hi @FreelanceReinh ,

I don't know if you have any experience w/ dealing w/ clinical trial data, but I'm trying to use this code to get counts for distinct trios of SOC term, PT term, and TRT01AN. Not every TRT01AN is in the main data set, so i had to pull them from another data set. When I run the below code, i get duplicates. After the right join, i'm trying to combine distinct TRT01AN values, with distinct pairs of SOC term and PT terms. Do you have a suggestion for how I can do that in one PROC SQL step? I know it's possible do a proc sort and use nodup to fix my issue, but i was wondering if i could just do it in PROC SQL.

proc sql;
create table want as
select pt_term, soc_term, trt01an, count(distinct a.subject) as count from 
adae a
natural right join
(select distinct * from adae(keep=soc_term pt_term), adsl(keep=trt01an))
group by soc_term, pt_term, trt01an;
quit;

FreelanceReinh · Posted 01-18-2023 11:05 AM

Hi @Hello_there,

Do you still get duplicates after changing the GROUP BY clause to

group by soc_term, pt_term, calculated trt01an;

(see this earlier post in this thread for the reason)?

Hello_there · Posted 01-18-2023 11:36 AM

Unfortunately that didn't work.

If i break it apart this code returns back the right number of rows (without the counts).

proc sql;
	create table look as
	select * from 
  		(select distinct trt01an from adsl),
  		(select distinct soc_term, pt_term from adae);
quit;

I was wondering if it was possible to do a right join with the counts to that. But it looks like the below code doesn't work.

proc sql;
create table want as
	select soc_term, pt_term, trt01an, count(distinct a.subject) as count
	from adae a
	natural right join
 		select * from
			(select distinct trt01an from adsl),
  			(select distinct soc_term, pt_term from adae)
	group by soc_term, pt_term, trt01an;
quit;

SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

Re: SPARSE option in proc freq. Is it possible to only include all possible values of one variable?

SAS Innovate 2025: Call for Content

Classroom Training Available!