I have a table where I'm trying to populate 0's for all treatments. There are 3 possible treatments. I don't need all possible combinations of values for all variables, but I need all possible values of treatment. Can the SPARSE function handle this with an option I'm unfamiliar with? Is there a better way?
data have;
infile datalines dsd dlm=",";
input trt $ color $ brand $;
datalines;
1, blue, chevy
2, blue, chevy
3, red, ford
2, yellow, chevy
1, red, toyota
;
run;
proc freq data=have noprint;
table trt*color*brand/ sparse list nopercent nocum norow nocol out=count (drop=percent);
run;
desired output:
Hi @Hello_there,
I think you would need to apply SPARSE to a crosstabulation of comb*trt where comb is a concatenation of color and brand, created in a preliminary step (and eventually separated again in a post-processing step). But then it would be easier to get everything in one step as shown by PeterClemmensen or in the PROC SQL step below.
proc sql;
create table want as
select *, count(a.trt) as count from
have a
natural right join
(select distinct * from have(keep=trt), have(drop=trt))
group by 2,3,1;
quit;
(This assumes that all treatments occur in dataset HAVE.)
I don't think this can be done directly in Proc Freq. Here is an approach in the Data Step.
data want(drop = t);
if _N_ = 1 then do;
dcl hash h(dataset : 'have');
h.definekey(all : 'Y');
h.definedone();
end;
set have;
do t = 1 to 3;
trt = put(t, 8.);
if h.check() then do;
output;
h.add();
end;
end;
run;
Hi @Hello_there,
I think you would need to apply SPARSE to a crosstabulation of comb*trt where comb is a concatenation of color and brand, created in a preliminary step (and eventually separated again in a post-processing step). But then it would be easier to get everything in one step as shown by PeterClemmensen or in the PROC SQL step below.
proc sql;
create table want as
select *, count(a.trt) as count from
have a
natural right join
(select distinct * from have(keep=trt), have(drop=trt))
group by 2,3,1;
quit;
(This assumes that all treatments occur in dataset HAVE.)
You're welcome.
The first idea was to create a table combining all distinct treatments in dataset HAVE with all distinct (color, brand) pairs in dataset HAVE. This is a Cartesian product join of the form
proc sql;
select * from
(select distinct trt from have),
(select distinct color, brand from have);
quit;
But the same result can be obtained from the shorter (but probably less efficient) query
proc sql;
select distinct * from
have(keep=trt), have(drop=trt);
quit;
which combines all treatments (including duplicates) in dataset HAVE with all (color, brand) pairs (again including duplicates) in dataset HAVE, eventually removing duplicates from this set of combinations.
I used the latter query as an inline view in the outer query, which has the form of a natural right join, i.e., a join using all like-named columns of the contributing tables as key variables: trt, color and brand. The right join ensures that the output dataset WANT will contain all (trt, color, brand) combinations of the inline view constructed above and, most importantly, it helps us to detect those combinations which are not contained in dataset HAVE (the left part of the join). This is necessary to get the zeros into the final table.
A (trt, color, brand) combination which is not contained in HAVE contributes missing values for all three variables to the join. These are normally invisible, though, because we only see the non-missing values contributed by the inline view. But the explicit reference a.trt (and similarly a.color or a.brand) allows us to work with the values contributed by HAVE (with alias a). The number of non-missing a.trt values per (trt, color, brand) combination is what we need in variable count of dataset WANT. So we compute count(a.trt) for each combination.
The variables created by the natural join are the same as in dataset HAVE, in the same order: trt, color, brand. We can refer to them by their positions 1, 2, 3, respectively, in the GROUP BY clause. Your desired output suggests a sort order by color, brand, trt, which is obtained by group by 2, 3, 1 (surprisingly not by group by color, brand, trt). To ensure the desired sort order of dataset WANT for your real input data you should insert an ORDER BY clause after the GROUP BY clause:
order by 2, 3, 1;
Edit / Addendum:
I was "surprised" that group by 2, 3, 1 produced a different sort order of dataset WANT than group by color, brand, trt, but instead I should have been surprised that the GROUP BY clause using the explicit variable names worked at all! In an ordinary (i.e., not natural) right join a GROUP BY clause like this would have caused the PROC SQL step to fail with error messages
ERROR: Ambiguous reference, column color is in more than one table. ERROR: Ambiguous reference, column brand is in more than one table. ERROR: Ambiguous reference, column trt is in more than one table.
because both dataset HAVE (with alias a) and the inline view contribute these same variable names to the join. Even if the errors were avoided by being more specific in the variable references in the GROUP BY clause, the "select *" of the outer query would at least cause warning messages
WARNING: Variable trt already exists on file WORK.WANT. WARNING: Variable color already exists on file WORK.WANT. WARNING: Variable brand already exists on file WORK.WANT.
reminding us of those name conflicts. Yet, all this doesn't happen and this appears to be a special (convenience) feature of natural joins.
The FEEDBACK option of the PROC SQL statement reveals what happens inside the natural join:
1141 proc sql feedback; 1142 create table want as 1143 select *, count(a.trt) as count from 1144 have a 1145 natural right join 1146 (select distinct * from have(keep=trt) b, have(drop=trt) c) 1147 group by color, brand, trt; NOTE: Statement transforms to: select COALESCE(A.trt, B.trt) as trt, COALESCE(A.color, C.color) as color, COALESCE(A.brand, C.brand) as brand, COUNT(A.trt) as count from WORK.HAVE A right outer join ( select distinct B.trt, C.color, C.brand from WORK.HAVE B(keep=trt), WORK.HAVE C(drop=trt) ) on (A.trt = B.trt) and (A.color = C.color) and (A.brand = C.brand) group by C.color, C.brand, B.trt; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.WANT created, with 12 rows and 4 columns.
(For the above log I had also added aliases b and c to the source datasets of the inline view in order to clarify the code provided by the FEEDBACK option.)
So it turns out that with the natural join the compiler decided to (tacitly) use color, brand and trt from the inline view, thus avoiding the harmful ambiguities. The resolution of the asterisk in the "select *" of the outer query to COALESCE expressions as shown above is standard behavior of natural joins and avoids the warnings mentioned earlier.
The numbers 1, 2, 3 used in "group by 2, 3, 1", however, always refer to the items in the (outer) SELECT clause. In our case:
While the natural join assigned the original variable names trt, color and brand to these three items, they are, of course, different entities than C.color, C.brand and B.trt. So, in this case "group by 2, 3, 1" is not an abbreviation of "group by color, brand, trt" (in the sense of "group by C.color, C.brand, B.trt"), but rather of
group by calculated color, calculated brand, calculated trt
which produces the same sort order of dataset WANT.
I've just expanded my previous post by an explanation why group by 2, 3, 1 is different from group by color, brand, trt.
@FreelanceReinh wrote:
The FEEDBACK option of the PROC SQL statement reveals what happens inside the natural join:
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.WANT created, with 12 rows and 4 columns.
Hi FreelanceReinhard,
Do you know if there's any chance how i can avoid the top 2 notes?
@Hello_there wrote:
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.WANT created, with 12 rows and 4 columns.
The second of these notes does not occur with the code from the accepted solution. It occurred with the PROC SQL step that I used to demonstrate that the correct GROUP BY clause
group by 2,3,1;
from the solution is not equivalent to
group by color, brand, trt;
So you may be using the wrong code if you see that note with your final program.
Generally speaking, if the goal is to have a clean log, then you don't need to do anything about these notes as they do not indicate problems. They just describe the internal workings of a PROC SQL step to a level of detail that we are not used to with other procedures like PROC FREQ or PROC SUMMARY.
If you must avoid this type of notes, you can either apply a cosmetic change to the log or replace the concise PROC SQL step with a more cumbersome multi-step solution, possibly avoiding PROC SQL altogether (at least Cartesian product joins and automatic remerging). Example: the approach using concatenated values in a variable comb as outlined in the first sentence of the accepted solution.
For a "cosmetic" solution there are two options, neither of which I would recommend:
options nonotes; proc sql; create table want as select *, count(a.trt) as count from have a natural right join (select distinct * from have(keep=trt), have(drop=trt)) group by 2,3,1; quit; options notes nosource; %let _dsid=%sysfunc(open(&syslast)); %let _nvar=%sysfunc(attrn(&_dsid,NVARS)); %let _dsid=%sysfunc(close(&_dsid)); %put NOTE: Table %cmpres(&syslast) created, with &sqlobs rows and &_nvar columns.; %symdel _dsid _nvar; proc sql; quit; options source;
Log:
12 options nonotes; 13 proc sql; 14 create table want as 15 select *, count(a.trt) as count from 16 have a 17 natural right join 18 (select distinct * from have(keep=trt), have(drop=trt)) 19 group by 2,3,1; 20 quit; 21 22 options notes nosource; NOTE: Table WORK.WANT created, with 12 rows and 4 columns. NOTE: PROCEDURE SQL used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
Hi @FreelanceReinh ,
I don't know if you have any experience w/ dealing w/ clinical trial data, but I'm trying to use this code to get counts for distinct trios of SOC term, PT term, and TRT01AN. Not every TRT01AN is in the main data set, so i had to pull them from another data set. When I run the below code, i get duplicates. After the right join, i'm trying to combine distinct TRT01AN values, with distinct pairs of SOC term and PT terms. Do you have a suggestion for how I can do that in one PROC SQL step? I know it's possible do a proc sort and use nodup to fix my issue, but i was wondering if i could just do it in PROC SQL.
proc sql;
create table want as
select pt_term, soc_term, trt01an, count(distinct a.subject) as count from
adae a
natural right join
(select distinct * from adae(keep=soc_term pt_term), adsl(keep=trt01an))
group by soc_term, pt_term, trt01an;
quit;
Hi @Hello_there,
Do you still get duplicates after changing the GROUP BY clause to
group by soc_term, pt_term, calculated trt01an;
(see this earlier post in this thread for the reason)?
Unfortunately that didn't work.
If i break it apart this code returns back the right number of rows (without the counts).
proc sql;
create table look as
select * from
(select distinct trt01an from adsl),
(select distinct soc_term, pt_term from adae);
quit;
I was wondering if it was possible to do a right join with the counts to that. But it looks like the below code doesn't work.
proc sql;
create table want as
select soc_term, pt_term, trt01an, count(distinct a.subject) as count
from adae a
natural right join
select * from
(select distinct trt01an from adsl),
(select distinct soc_term, pt_term from adae)
group by soc_term, pt_term, trt01an;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.