Re: Sql Outer Union Corr VS Data step Set statement

mcook · Posted 03-11-2022 01:47 PM

I have code similar to the following. I am trying to vertically join/concatenate about 370 tables.

From what i have read, the sql Outer Union Corr is equivalent to the data steps set statement.

However when i run the code, to join the same tables using the 2 different methods, the results differ.

The proc sql outer union corr results in roughly 100 more obs than the data step. (out of roughly 117,000).

is there an aspect I am overlooking with etiher approach?

*DATA STEP;

data Temp1;
	set lib1.Table_02
		lib1.Table_03
		..
		lib1.Table_10;
run;

data Temp2;
	set Lib2.Table_02
		Lib2.Table_03
		..
		lib2.Table_10;
run;
        ...
data temp40;
	set lib15.Table_02
		lib15.Table_03
		..
		lib15.Table_10;

data TempALL;
	set Temp1
		Temp2
		Temp3
		..
		Temp40;
run;



*PROC SQL;
proc sql noprint;
	create table Temp1 as 
		select * from 
			Lib1.Table_02 OUTER UNION CORR select * from  
			Lib1.Table_03	OUTER UNION CORR select * from 
			..
			Lib1.Table_100;

			quit;

proc sql noprint;
	create table Temp2 as 
		select * from 
			Lib2.Table_02 OUTER UNION CORR select * from  
			Lib2.Table_03	OUTER UNION CORR select * from 
			..
			Lib2.Table_100;
quit;

proc sql noprint;
	create table Temp3 as 
		select * from 
			Lib3.Table_02 OUTER UNION CORR select * from  
			Lib3.Table_03	OUTER UNION CORR select * from 
			..
			Lib3.Table_100;
quit;

proc sql noprint;
	create table Temp4 as 
		select * from 
			Lib4.Table_02 OUTER UNION CORR select * from  
			Lib4.Table_03	OUTER UNION CORR select * from 
			..
			Lib4.Table_100;
quit;

proc sql noprint;
	create table TempAll as 
		select * from 
			Temp1 OUTER UNION CORR select * from
			Temp2 OUTER UNION CORR select * from
			Temp3 OUTER UNION CORR select * from
			Temp4;
quit;

SASJedi · Posted 03-11-2022 03:55 PM

I tested with 100 datasets using a DATA step concatenation and SQL OUTER UNION CORR. The resulting datasets were identical. Can you give more detail on what you're doing? Here's my test code:

options dlcreatedir ;
libname lib1 "%qsysfunc(pathname(work))\lib1";
options nodlcreatedir nomprint nomlogic nosymbolgen; 

%macro MakeData(lib,num);
proc datasets library=&lib kill nolist nodetails;
run; quit;
%do i=1 %to &num;
%let n=%unquote(%putn(&i,z3.));
data lib1.Table_&n;
   retain Table "Table_&n";
   do ID_&n =1 to 5;
      output;
   end;
run;
%end;
%mend;

%macro ConcatenateDS(lib,num);
data work.tempAll_ds;
	set 
%do i=1 %to &num;
   %let n=%unquote(%putn(&i,z3.));
      lib1.Table_&n
%end;
;
run;
%end;
%mend;

%macro ConcatenateSQL(lib,num);
proc sql;
	create table work.tempALL_sql as 
%do i=1 %to %eval(&num-1);
   %let n=%unquote(%putn(&i,z3.));
		select * 
         from lib1.Table_&n
         OUTER UNION CORR 
%end;
   %let n=%unquote(%putn(&i,z3.));
		select * 
         from lib1.Table_&n
   ;
quit;
%mend;

%let no=100;
%makedata(lib1,&no)
%ConcatenateDS(lib1,&no)
%ConcatenateSQL(lib1,&no)

proc compare base=work.tempALL_ds
             compare=work.tempALL_sql;
run;

And the results:

The SAS System

                                        The COMPARE Procedure                                         
                         Comparison of WORK.TEMPALL_DS with WORK.TEMPALL_SQL                          
                                            (Method=EXACT)                                            
                                                                                                      
                                          Data Set Summary                                            
                                                                                                      
                  Dataset                    Created          Modified  NVar    NObs                  
                                                                                                      
                  WORK.TEMPALL_DS   11MAR22:15:50:50  11MAR22:15:50:50   101     500                  
                  WORK.TEMPALL_SQL  11MAR22:15:50:50  11MAR22:15:50:50   101     500                  
                                                                                                      
                                                                                                      
                                          Variables Summary                                           
                                                                                                      
                               Number of Variables in Common: 101.

                                                                                                      
                                                                                                      
                                         Observation Summary                                          
                                                                                                      
                                    Observation      Base  Compare                                    
                                                                                                      
                                    First Obs           1        1                                    
                                    Last  Obs         500      500                                    
                                                                                                      
                   Number of Observations in Common: 500.                                             
                   Total Number of Observations Read from WORK.TEMPALL_DS: 500.                       
                   Total Number of Observations Read from WORK.TEMPALL_SQL: 500.                      
                                                                                                      
                   Number of Observations with Some Compared Variables Unequal: 0.                    
                   Number of Observations with All Compared Variables Equal: 500.                     
                                                                                                      
                   NOTE: No unequal values were found. All values compared are exactly equal.

Check out my Jedi SAS Tricks for SAS Users

mkeintz · Posted 03-11-2022 04:02 PM

Log please, including the exact code that produce log messages.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 03-11-2022 04:29 PM

This note has nothing to do with the actual question you pose. But here are two comments that may substantially improve efficiency:

OPEN=DEFER. If, for a given SET statement, the input datasets all have the same variables, then you can save a lot of memory by using the OPEN=DEFER option. The default is OPEN=IMMEDIATE, which tells SAS to set up memory buffers for every dataset named in the SET statement. Using OPEN=DEFER opens just one buffer, which is then reused for each dataset. BUT ... this relies on the datasets having identical sets and types of variables.
Regardless of whether you can take advantage of OPEN=DEFER, define the datasets TEMP1, TEMP2, .... TEMP40 as dataset views, not (as you have done) dataset files. As views, they are merely rules for data access. These rules will not be activated (i.e. data will not be processed) until the view is called upon later in the program. As a result, the data are not written to disk, merely to be subsequently read back from disk. Instead, data from the views are transferred directly to the calling data (or proc) step.

Putting these two suggestions together, you would have:

data VTemp1 / view=vtemp1;
	set lib1.Table_02
		lib1.Table_03
		..
		lib1.Table_10
        open=defer;
run;

data VTemp2 / view=vtemp2;
	set Lib2.Table_02
		Lib2.Table_03
		..
		lib2.Table_10
        open=defer;
run;
        ...
data Vtemp40 / view=vtemp40;
	set lib15.Table_02
		lib15.Table_03
		..
		lib15.Table_10
        open=defer;
data TempALL;
	set Temp1
		Temp2
		Temp3
		..
		Temp40
        open=defer;
run;

Of course, if all 370 datasets have identical variables, you don't need to create intermediate data sets (TEMP1, TEMP2, ...). Instead you could


data TempALL;
	set lib1.Table_02
		lib1.Table_03
		..
		lib1.Table_10

	    Lib2.Table_02
		Lib2.Table_03
		..
		lib2.Table_10
        ..
	    lib15.Table_02
		lib15.Table_03
		..
		lib15.Table_10
        open=defer;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Reeza · Posted 03-11-2022 05:33 PM

And just a note that you can short cut reference these datasets if they do have sequential names.

data all;
set lib1.table_02-lib1.table_10
lib2.table_02-lib2.table_10
lib3.table_02-lib3.table_10;
run;

mcook · Posted 03-14-2022 08:29 AM

My apologies to everyone. It turned out to be a case of user error.

One of the tables was being added twice, but it somehow had gotten moved off to the right by about 20 Tabs, so it was far out of view, but still held the correct syntax. So in the dozens of times i looked through the list of tables, i never saw it until I did a Ctrl+F search for it.

At which point my wife comes running into my office to see what all the cursing was about.

Again. apologies.

Tom · Posted 03-14-2022 09:33 AM

There are some differences between the SQL unions and the data step SET statements when the structures are different.

The SQL code will set the length of character variables to the maximum length it has on any of the input datasets.

The SET statement will set the length of character variables to the length it has on the first dataset where it appears.

There is a limit on the number of datasets you can combine in a single SQL statement. I don't think there is a similar limit on the number of datasets you can reference in a SET statement.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!