Re: PROC SQL: Auto Remerge when FROM Table vs FROM (Inline view)

novinosrin · Posted 01-17-2019 01:39 PM

PROC SQL: Auto Remerge when "FROM Table vs FROM (Inline view)"

Objective: Lazy me wants to use proc sql remerge as opposed to sort and DOW// proc summarycount and merge back

Here is the testdata I took from another thread used by draycut and Ksharp to experiment for my case:

data testdata(drop=i j);
   array vars var1-var3;
   do i=1 to 100;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

The expected-->

Get the count of each by group

proc sql;
Create table the_expected as
select *, count(*) as c
from testdata
group by var1,var2,var3
order by var1,var2,var3;
quit;

Yeah right, so traditional and merge back with the original to get accomplish the objective.

Intent: Take advantage of auto emerge and why bother coding the second pass:

But since there is no open scope for autoremerge for the reason var1,var2 and var3 are the only variables in testdata that also forms the group by combination, a way out to force remerge would be is by having temp variable in select clause and then drop using dataset option

Cool as it seems, I thought, let me try

proc sql;
Create table the_expected(drop=j) as
select *,0 as j, count(*) as c/*purpose of j is to force remerge*/
from testdata
group by var1,var2,var3
order by var1,var2,var3;
quit;

However, my understanding was obviously incorrect. The precedence is select columns and group by and vice versa seem to go hither and wither. This also makes sense to why users many a times prefer datastep sequential processing if not direct access with point= or whatever that gives greater control.

Anyways, Either way, I still need the temp var to force remerge , so how about an inline view?

proc sql;
Create table inline_view_attempt(drop=j) as
select *,count(*) as c
from (select *,0 as j from testdata)/*purpose of j is to force remerge*/
group by var1,var2,var3
order by var1,var2,var3;
quit;

hmm, idea seems reasonable, but unfavorable result yet again

So, now what if I had the temp variable created in another query and attempt to read a table rather than a inline view-->

proc sql;
/*temp to hold the extra var j to force remerge*/
create table temp as
select *,0 as j from testdata;/*purpose of j is to force remerge*/

/*query the temp table that has the temp var j */
Create table From_temp_table_attmpt(drop=j) as
select *,count(*) as c
from temp
group by var1,var2,var3
order by var1,var2,var3;
quit;

This works fine but completely defeats the intent and objective.

Can somebody help me understand the nitty gritty of this behavior?

PS

Above all, the *, remerge et al in general are some features I love about proc sql and I believe I have been using proc sql quite effectively and if can get the understanding to have my mind and fingers to have control over it, all the more the better. Thank you!

Tom · Posted 01-17-2019 02:39 PM

You need another variable, otherwise there is nothing to "remerge".

So if your source table has an extra variable then it will remerge so that it can keep the values of that variable. If you drop that variable there is nothing to remerge with.

134   proc sql;
135   create table the_expected as
136     select *, count(*) as c
137     from testdata
138     group by var1,var2,var3
139     order by var1,var2,var3
140   ;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.THE_EXPECTED created, with 100 rows and 5 columns.

141   create table want as
142     select *, count(*) as c
143     from testdata(drop=n)
144     group by var1,var2,var3
145     order by var1,var2,var3
146   ;
NOTE: Table WORK.WANT created, with 96 rows and 4 columns.

147   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

If you keep the count you can always expand it later.

148
149   data want2;
150     set want ;
151     do n=1 to c;
152       output;
153     end;
154   run;

NOTE: There were 96 observations read from the data set WORK.WANT.
NOTE: The data set WORK.WANT2 has 100 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Or better just use the count variable as a FREQ or WEIGHT option.

novinosrin · Posted 01-17-2019 05:24 PM

Thank you sir @Tom for your time as always a privilege to receive favors from you. My concern is

why the inline view creation of an extra var is not taken up by proc sql processor for remerge as opposed to clean remerge functionality when the extra var happens to be in source. Does that mean inline view in general is not to be considered an equivalent ?

EDIT: I am scratching my head as the comparison has gone from remerge to comparison of inline view vs table.

FreelanceReinh · Posted 01-17-2019 05:34 PM

Hi @novinosrin,

This is interesting, thanks for sharing. I was surprised to see that the inline view doesn't trigger remerging.

Same with an ordinary PROC SQL view:

proc sql;
create view tempv as
select *, ranuni(2718) as j from testdata;
quit;

proc sql feedback _method;
create table from_view_attmpt as
select *, count(*)
from tempv
group by var1,var2,var3;
quit; /* No remerging: <100 obs., thus some of the 100 generated random numbers j were excluded. */

However, with a DATA step view remerging does occur:

data tempdsv / view=tempdsv;
set testdata;
j=ranuni(2718);
run;

proc sql feedback _method;
create table from_dsview_attmpt as
select *, count(*)
from tempdsv
group by var1,var2,var3;
quit; /* Remerging occurs: 100 obs. */

I found a related discussion from 2017 in https://communities.sas.com/t5/SAS-Programming/PROC-SQL-remerging/td-p/332784 where @Quentin opened a ticket with Tech Support and apparently this led to Usage Note 60119: Certain PROC SQL queries fail to remerge summary statistics in SAS® 9.4.

novinosrin · Posted 01-17-2019 05:41 PM

Thank you sir @FreelanceReinh Now this has started to really concern me. Lucky that I am in college and none of my courses had anything to do with SAS. But having said that, I am certain I and many others have offered the remerge based easy/lazy solution to many OPs in our home aka here(sas communities). I wonder what if the scenario happened to be something similar i.e for some reason the vars were dropped in the process and when it hits something of this kind. Lord have mercy 🙂

PS,

@Cruise I hope you take a look at this thread

PGStats · Posted 01-17-2019 05:43 PM

Turns out that IMHO the SQL processor is very clever at detecting if a remerge is needed or not. Sometimes it gets incomprehensible. For example, compare:

data testdata(drop=i j);
   call streaminit(87878);
   array vars var1-var3;
   do i=1 to 10;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 3);
      end;
      output;
   end;
run;

proc sql;

/* Doesn't remerge */
create view junk as
select *, rand('uniform') as junk from testdata;
select 
    *,
    count(*) as c
from junk
group by var1,var2,var3;

/* Does remerge */
create table junkTable as
select *, rand('uniform') as junk from testdata;
select 
    *,
    count(*) as c
from junkTable
group by var1,var2,var3;

quit;

But doing your own remerge can be simple and natural

proc sql;
select a.* 
from
(select 
    *,
    sum(var3) as d
 from testdata
 group by var1,var2,var3) as a
natural join
(select * from testdata);
quit;

PG

novinosrin · Posted 01-17-2019 05:50 PM

Sir, Does that mean we can blindly trust that embedded intelligence at all times? I can't trust myself. But If that's coming from you, Tom or Reinhard, that does make me feel comfortable. But , what about inline view vs table though? Coffee break!

@PGStats wrote:

Turns out that IMHO the SQL processor "is very clever at detecting if a remerge" is needed or not. Sometimes it gets incomprehensible. For example, compare:

data testdata(drop=i j);
   call streaminit(87878);
   array vars var1-var3;
   do i=1 to 10;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 3);
      end;
      output;
   end;
run;

proc sql;

/* Doesn't remerge */
create view junk as
select *, rand('uniform') as junk from testdata;
select 
    *,
    count(*) as c
from junk
group by var1,var2,var3;

/* Does remerge */
create table junkTable as
select *, rand('uniform') as junk from testdata;
select 
    *,
    count(*) as c
from junkTable
group by var1,var2,var3;

quit;

But doing your own remerge can be simple and natural

proc sql;
select a.* 
from
(select 
    *,
    sum(var3) as d
 from testdata
 group by var1,var2,var3) as a
natural join
(select * from testdata);
quit;

Quentin · Posted 01-22-2019 08:26 AM

This is surprising and a little scary. I've heard real SQL devotees suggest that remerges should be avoided as a rule, because they're such an oddity. I've often relied on remerge, because it's so handy. But I'm starting to question that approach. I too would have expected the inline view to trigger a remerge.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.

novinosrin · Posted 01-22-2019 09:16 AM

Thank you Sir @Quentin for chiming in. Hmm (sighs) ,something to deal with "extra caution" going forward

Quentin · Posted 01-22-2019 12:03 PM

Thank you Sir @novinosrin for raising this issue. I suggest you bring it to tech support's attention, and see what they say.

It feels related to this note, but I don't think it quite fits within the criteria listed. http://support.sas.com/kb/60/119.html

Here's a simple example, where I would expect the in-line view to trigger remerge, but it does not:

data have ;
  input id ;
  cards ;
1
2
3
3
4
5
;

proc sql ;
  select id,j,count(*) 
  from (select id, ranuni(0) as j from have)
  group by id
  ;
quit ;

The general rule for when will (should) PROC SQL remerge is:

This note is generated when a column, listed on the select clause,
is not being used in a summary function and the column is not listed on
the GROUP BY clause.  If the column selected is not used in a summary
function, adding it to a GROUP BY clause prevents the note.

http://support.sas.com/techsup/notes/v8/4/308.html

Feels like a bug to me.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away