PROC SQL: Auto Remerge when "FROM Table vs FROM (Inline view)"
Objective: Lazy me wants to use proc sql remerge as opposed to sort and DOW// proc summarycount and merge back
Here is the testdata I took from another thread used by draycut and Ksharp to experiment for my case:
data testdata(drop=i j);
array vars var1-var3;
do i=1 to 100;
do j=1 to dim(vars);
vars[j]=rand('integer', 1, 10);
end;
output;
end;
run;
The expected-->
Get the count of each by group
proc sql;
Create table the_expected as
select *, count(*) as c
from testdata
group by var1,var2,var3
order by var1,var2,var3;
quit;
Yeah right, so traditional and merge back with the original to get accomplish the objective.
Intent: Take advantage of auto emerge and why bother coding the second pass:
But since there is no open scope for autoremerge for the reason var1,var2 and var3 are the only variables in testdata that also forms the group by combination, a way out to force remerge would be is by having temp variable in select clause and then drop using dataset option
Cool as it seems, I thought, let me try
proc sql;
Create table the_expected(drop=j) as
select *,0 as j, count(*) as c/*purpose of j is to force remerge*/
from testdata
group by var1,var2,var3
order by var1,var2,var3;
quit;
However, my understanding was obviously incorrect. The precedence is select columns and group by and vice versa seem to go hither and wither. This also makes sense to why users many a times prefer datastep sequential processing if not direct access with point= or whatever that gives greater control.
Anyways, Either way, I still need the temp var to force remerge , so how about an inline view?
proc sql;
Create table inline_view_attempt(drop=j) as
select *,count(*) as c
from (select *,0 as j from testdata)/*purpose of j is to force remerge*/
group by var1,var2,var3
order by var1,var2,var3;
quit;
hmm, idea seems reasonable, but unfavorable result yet again
So, now what if I had the temp variable created in another query and attempt to read a table rather than a inline view-->
proc sql;
/*temp to hold the extra var j to force remerge*/
create table temp as
select *,0 as j from testdata;/*purpose of j is to force remerge*/
/*query the temp table that has the temp var j */
Create table From_temp_table_attmpt(drop=j) as
select *,count(*) as c
from temp
group by var1,var2,var3
order by var1,var2,var3;
quit;
This works fine but completely defeats the intent and objective.
Can somebody help me understand the nitty gritty of this behavior?
PS
Above all, the *, remerge et al in general are some features I love about proc sql and I believe I have been using proc sql quite effectively and if can get the understanding to have my mind and fingers to have control over it, all the more the better. Thank you!
You need another variable, otherwise there is nothing to "remerge".
So if your source table has an extra variable then it will remerge so that it can keep the values of that variable. If you drop that variable there is nothing to remerge with.
134 proc sql; 135 create table the_expected as 136 select *, count(*) as c 137 from testdata 138 group by var1,var2,var3 139 order by var1,var2,var3 140 ; NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.THE_EXPECTED created, with 100 rows and 5 columns. 141 create table want as 142 select *, count(*) as c 143 from testdata(drop=n) 144 group by var1,var2,var3 145 order by var1,var2,var3 146 ; NOTE: Table WORK.WANT created, with 96 rows and 4 columns. 147 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.02 seconds cpu time 0.01 seconds
If you keep the count you can always expand it later.
148 149 data want2; 150 set want ; 151 do n=1 to c; 152 output; 153 end; 154 run; NOTE: There were 96 observations read from the data set WORK.WANT. NOTE: The data set WORK.WANT2 has 100 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
Or better just use the count variable as a FREQ or WEIGHT option.
Thank you sir @Tom for your time as always a privilege to receive favors from you. My concern is
why the inline view creation of an extra var is not taken up by proc sql processor for remerge as opposed to clean remerge functionality when the extra var happens to be in source. Does that mean inline view in general is not to be considered an equivalent ?
EDIT: I am scratching my head as the comparison has gone from remerge to comparison of inline view vs table.
Hi @novinosrin,
This is interesting, thanks for sharing. I was surprised to see that the inline view doesn't trigger remerging.
Same with an ordinary PROC SQL view:
proc sql;
create view tempv as
select *, ranuni(2718) as j from testdata;
quit;
proc sql feedback _method;
create table from_view_attmpt as
select *, count(*)
from tempv
group by var1,var2,var3;
quit; /* No remerging: <100 obs., thus some of the 100 generated random numbers j were excluded. */
However, with a DATA step view remerging does occur:
data tempdsv / view=tempdsv;
set testdata;
j=ranuni(2718);
run;
proc sql feedback _method;
create table from_dsview_attmpt as
select *, count(*)
from tempdsv
group by var1,var2,var3;
quit; /* Remerging occurs: 100 obs. */
I found a related discussion from 2017 in https://communities.sas.com/t5/SAS-Programming/PROC-SQL-remerging/td-p/332784 where @Quentin opened a ticket with Tech Support and apparently this led to Usage Note 60119: Certain PROC SQL queries fail to remerge summary statistics in SAS® 9.4.
Thank you sir @FreelanceReinh Now this has started to really concern me. Lucky that I am in college and none of my courses had anything to do with SAS. But having said that, I am certain I and many others have offered the remerge based easy/lazy solution to many OPs in our home aka here(sas communities). I wonder what if the scenario happened to be something similar i.e for some reason the vars were dropped in the process and when it hits something of this kind. Lord have mercy 🙂
PS,
@Cruise I hope you take a look at this thread
Turns out that IMHO the SQL processor is very clever at detecting if a remerge is needed or not. Sometimes it gets incomprehensible. For example, compare:
data testdata(drop=i j);
call streaminit(87878);
array vars var1-var3;
do i=1 to 10;
do j=1 to dim(vars);
vars[j]=rand('integer', 1, 3);
end;
output;
end;
run;
proc sql;
/* Doesn't remerge */
create view junk as
select *, rand('uniform') as junk from testdata;
select
*,
count(*) as c
from junk
group by var1,var2,var3;
/* Does remerge */
create table junkTable as
select *, rand('uniform') as junk from testdata;
select
*,
count(*) as c
from junkTable
group by var1,var2,var3;
quit;
But doing your own remerge can be simple and natural
proc sql;
select a.*
from
(select
*,
sum(var3) as d
from testdata
group by var1,var2,var3) as a
natural join
(select * from testdata);
quit;
Sir, Does that mean we can blindly trust that embedded intelligence at all times? I can't trust myself. But If that's coming from you, Tom or Reinhard, that does make me feel comfortable. But , what about inline view vs table though? Coffee break!
@PGStats wrote:
Turns out that IMHO the SQL processor "is very clever at detecting if a remerge" is needed or not. Sometimes it gets incomprehensible. For example, compare:
data testdata(drop=i j); call streaminit(87878); array vars var1-var3; do i=1 to 10; do j=1 to dim(vars); vars[j]=rand('integer', 1, 3); end; output; end; run; proc sql; /* Doesn't remerge */ create view junk as select *, rand('uniform') as junk from testdata; select *, count(*) as c from junk group by var1,var2,var3; /* Does remerge */ create table junkTable as select *, rand('uniform') as junk from testdata; select *, count(*) as c from junkTable group by var1,var2,var3; quit;
But doing your own remerge can be simple and natural
proc sql; select a.* from (select *, sum(var3) as d from testdata group by var1,var2,var3) as a natural join (select * from testdata); quit;
This is surprising and a little scary. I've heard real SQL devotees suggest that remerges should be avoided as a rule, because they're such an oddity. I've often relied on remerge, because it's so handy. But I'm starting to question that approach. I too would have expected the inline view to trigger a remerge.
Thank you Sir @Quentin for chiming in. Hmm (sighs) ,something to deal with "extra caution" going forward
Thank you Sir @novinosrin for raising this issue. I suggest you bring it to tech support's attention, and see what they say.
It feels related to this note, but I don't think it quite fits within the criteria listed. http://support.sas.com/kb/60/119.html
Here's a simple example, where I would expect the in-line view to trigger remerge, but it does not:
data have ;
input id ;
cards ;
1
2
3
3
4
5
;
proc sql ;
select id,j,count(*)
from (select id, ranuni(0) as j from have)
group by id
;
quit ;
The general rule for when will (should) PROC SQL remerge is:
This note is generated when a column, listed on the select clause, is not being used in a summary function and the column is not listed on the GROUP BY clause. If the column selected is not used in a summary function, adding it to a GROUP BY clause prevents the note.
http://support.sas.com/techsup/notes/v8/4/308.html
Feels like a bug to me.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.