BookmarkSubscribeRSS Feed
novinosrin
Tourmaline | Level 20

PROC SQL: Auto Remerge when "FROM Table vs FROM (Inline view)"

 

Objective: Lazy me wants to use proc sql remerge as opposed to sort and DOW// proc summarycount and merge back

Here is the testdata I took from another thread used by draycut and Ksharp to experiment for my case:

data testdata(drop=i j);
   array vars var1-var3;
   do i=1 to 100;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

 

The expected-->

Get the count of each by group

 

proc sql;
Create table the_expected as
select *, count(*) as c
from testdata
group by var1,var2,var3
order by var1,var2,var3;
quit;

Yeah right, so traditional and merge back with the original to get accomplish the objective.

 

Intent: Take advantage of auto emerge and why bother coding the second pass:

But since there is no open scope for autoremerge for the reason var1,var2 and var3 are the only variables in testdata that also forms the group by combination, a way out to force remerge would be is by having temp variable in select clause and then drop using dataset option

 

Cool as it seems, I thought, let me try

 

proc sql;
Create table the_expected(drop=j) as
select *,0 as j, count(*) as c/*purpose of j is to force remerge*/
from testdata
group by var1,var2,var3
order by var1,var2,var3;
quit;

However, my understanding was obviously  incorrect. The precedence is select columns and group by and vice versa seem to go hither and wither. This also makes sense to why users many a times prefer datastep sequential processing if not direct access with point= or whatever that gives greater control. 

 

Anyways, Either way, I still need the temp var to force remerge , so how about an inline view?

 

proc sql;
Create table inline_view_attempt(drop=j) as
select *,count(*) as c
from (select *,0 as j from testdata)/*purpose of j is to force remerge*/
group by var1,var2,var3
order by var1,var2,var3;
quit;

hmm, idea seems reasonable, but unfavorable result yet again

 

So, now what if I had the temp variable created in another query and attempt to read a table rather than a inline view-->

proc sql;
/*temp to hold the extra var j to force remerge*/
create table temp as
select *,0 as j from testdata;/*purpose of j is to force remerge*/

/*query the temp table that has the temp var j */
Create table From_temp_table_attmpt(drop=j) as
select *,count(*) as c
from temp
group by var1,var2,var3
order by var1,var2,var3;
quit;

This works fine but completely defeats the intent and objective.

Can somebody help me understand the nitty gritty of this behavior?

 

PS

Above all, the *, remerge et al in general are some features I love about proc sql and I believe I have been using proc sql quite effectively and if can  get the understanding to have my mind and fingers to have control over it, all the more the better. Thank you!

 

 

 

 

 

 

 

 

 

 

 

9 REPLIES 9
Tom
Super User Tom
Super User

You need another variable, otherwise there is nothing to "remerge".

So if your source table has an extra variable then it will remerge so that it can keep the values of that variable. If you drop that variable there is nothing to remerge with.

134   proc sql;
135   create table the_expected as
136     select *, count(*) as c
137     from testdata
138     group by var1,var2,var3
139     order by var1,var2,var3
140   ;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.THE_EXPECTED created, with 100 rows and 5 columns.

141   create table want as
142     select *, count(*) as c
143     from testdata(drop=n)
144     group by var1,var2,var3
145     order by var1,var2,var3
146   ;
NOTE: Table WORK.WANT created, with 96 rows and 4 columns.

147   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds


If you keep the count you can always expand it later. 

148
149   data want2;
150     set want ;
151     do n=1 to c;
152       output;
153     end;
154   run;

NOTE: There were 96 observations read from the data set WORK.WANT.
NOTE: The data set WORK.WANT2 has 100 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Or better just use the count variable as a FREQ or WEIGHT option.

 

novinosrin
Tourmaline | Level 20

Thank you sir @Tom for your time as always a privilege to receive favors from you.  My concern is 

 

why the inline view creation of an extra var is not taken up by proc sql processor for remerge as opposed to clean remerge functionality when the extra var happens to be in source. Does that mean inline view in general is not to be considered an equivalent ?

 

EDIT: I am scratching my head as the comparison has gone from remerge to comparison of inline view vs table. 

FreelanceReinh
Jade | Level 19

Hi @novinosrin,

 

This is interesting, thanks for sharing. I was surprised to see that the inline view doesn't trigger remerging.

 

Same with an ordinary PROC SQL view:

proc sql;
create view tempv as
select *, ranuni(2718) as j from testdata;
quit;

proc sql feedback _method;
create table from_view_attmpt as
select *, count(*)
from tempv
group by var1,var2,var3;
quit; /* No remerging: <100 obs., thus some of the 100 generated random numbers j were excluded. */


However, with a DATA step view remerging does occur:

data tempdsv / view=tempdsv;
set testdata;
j=ranuni(2718);
run;

proc sql feedback _method;
create table from_dsview_attmpt as
select *, count(*)
from tempdsv
group by var1,var2,var3;
quit; /* Remerging occurs: 100 obs. */

I found a related discussion from 2017 in https://communities.sas.com/t5/SAS-Programming/PROC-SQL-remerging/td-p/332784 where @Quentin opened a ticket with Tech Support and apparently this led to Usage Note 60119: Certain PROC SQL queries fail to remerge summary statistics in SAS® 9.4.

novinosrin
Tourmaline | Level 20

Thank you sir @FreelanceReinh  Now this has started to really concern me. Lucky that I am in college and none of my courses had anything to do with SAS. But having said that, I am certain I and many others have offered the remerge based easy/lazy solution to many OPs in our home aka here(sas communities).  I wonder what if the scenario happened to be something similar i.e for some reason the vars were dropped in the process and when it hits something of this kind. Lord have mercy 🙂

 

PS,

 

@Cruise  I hope you take a look at this thread 

PGStats
Opal | Level 21

Turns out that IMHO the SQL processor is very clever at detecting if a remerge is needed or not. Sometimes it gets incomprehensible. For example, compare:

 

data testdata(drop=i j);
   call streaminit(87878);
   array vars var1-var3;
   do i=1 to 10;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 3);
      end;
      output;
   end;
run;

proc sql;

/* Doesn't remerge */
create view junk as
select *, rand('uniform') as junk from testdata;
select 
    *,
    count(*) as c
from junk
group by var1,var2,var3;

/* Does remerge */
create table junkTable as
select *, rand('uniform') as junk from testdata;
select 
    *,
    count(*) as c
from junkTable
group by var1,var2,var3;

quit;

But doing your own remerge can be simple and natural

 

proc sql;
select a.* 
from
(select 
    *,
    sum(var3) as d
 from testdata
 group by var1,var2,var3) as a
natural join
(select * from testdata);
quit;
PG
novinosrin
Tourmaline | Level 20

 

 

Sir, Does that mean we can blindly trust that embedded intelligence at all times? I can't trust myself. But If that's coming from you, Tom or Reinhard, that does make me feel comfortable. But , what about inline view vs table though? Coffee break!


@PGStats wrote:

Turns out that IMHO the SQL processor "is very clever at detecting if a remerge" is needed or not. Sometimes it gets incomprehensible. For example, compare:

 

data testdata(drop=i j);
   call streaminit(87878);
   array vars var1-var3;
   do i=1 to 10;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 3);
      end;
      output;
   end;
run;

proc sql;

/* Doesn't remerge */
create view junk as
select *, rand('uniform') as junk from testdata;
select 
    *,
    count(*) as c
from junk
group by var1,var2,var3;

/* Does remerge */
create table junkTable as
select *, rand('uniform') as junk from testdata;
select 
    *,
    count(*) as c
from junkTable
group by var1,var2,var3;

quit;

But doing your own remerge can be simple and natural

 

proc sql;
select a.* 
from
(select 
    *,
    sum(var3) as d
 from testdata
 group by var1,var2,var3) as a
natural join
(select * from testdata);
quit;

 

Quentin
Super User

This is surprising and a little scary.  I've heard real SQL devotees suggest that remerges should be avoided as a rule, because they're such an oddity.  I've often relied on remerge, because it's so handy.  But I'm starting to question that approach. I too would have expected the inline view to trigger a remerge.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
novinosrin
Tourmaline | Level 20

Thank you Sir @Quentin  for chiming in.  Hmm (sighs) ,something to deal with "extra caution" going forward 

Quentin
Super User

Thank you Sir @novinosrin for raising this issue.  I suggest you bring it to tech support's attention, and see what they say.

 

It feels related to this note, but I don't think it quite fits within the criteria listed.  http://support.sas.com/kb/60/119.html

 

Here's a simple example, where I would expect the in-line view to trigger remerge, but it does not:

 

data have ;
  input id ;
  cards ;
1
2
3
3
4
5
;

proc sql ;
  select id,j,count(*) 
  from (select id, ranuni(0) as j from have)
  group by id
  ;
quit ;

The general rule for when will (should) PROC SQL remerge is:

This note is generated when a column, listed on the select clause,
is not being used in a summary function and the column is not listed on
the GROUP BY clause.  If the column selected is not used in a summary
function, adding it to a GROUP BY clause prevents the note.

 

http://support.sas.com/techsup/notes/v8/4/308.html

 

Feels like a bug to me.

 

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1212 views
  • 6 likes
  • 5 in conversation