Hi all,
I'm dealing with a case where analysts/modellers are dealing with Base SAS analytical tables that can get up-to 2.5TB in volume (compressed) with up-to 1000 variables.
Obviously such volumes are not ideal for adhoc analysis especially because there aren't many indexes on these tables.
I can't do anything about these source tables and though what I'm thinking:
Write a macro that allows these users to create a copy of such tables under their user folder with ideally only a sub-set of columns and added indexes of their choice.
Because even reading a 2.5TB table also takes really long, I'm trying to come-up with a table copy approach that performs as good as possible.
The environment is a SAS Grid, SAS 9.4M5, rhel with 32 cpu's per node (actual). The default for an EG session the users spawn are 4 nodes.
Below demo code is the best I could come up with and I'd love your advise if there is anything else I could do to improve performance both for creating the copy as well as query performance using where clauses.
I assume for the copy and index creation process it's all about minimizing i/o and passes through the data.
For this reason, I've added creation of new indexes as data set option to the append base table ...but I don't really know if SAS will actually create these indexes via a single pass through the data together with the copy process or if creating the indexes later on via a proc datasets create index would perform the same.
I'm creating the target table using the spde engine. I don't have a lot of hands-on experience with this engine but I believe because the data copy needs to be under a user folder, there is no use for more than one data or index folder.
About the number of partitions: Based on what I've read in the docu, I'm thinking about defining them as "size of source tables/number of actual cpu". Not sure if I got that right though.
And here the sample code (self contained and fully executable as is).
options fullstimer msglevel=i dlcreatedir;
%let root_path=%sysfunc(pathname(work));
%let tbl =analytical_table;
/*-- source table: existing analytical table with up to 2.5TB (compressed) and up to 1000 variables --*/
libname src "&root_path.\src";
data src.&tbl.;
set sashelp.class;
array var_ {20} 8 ;
do k=1 to 10000;
do i=1 to dim(var_);
var_[i]=i*_n_;
end;
output;
end;
run;
proc sort data=src.&tbl. out=src.&tbl.(index=(name age n_a=(name age)));
by name;
run;
/*-- target table: copy of analytical table with added indexes and reduced number of columns --*/
/* create required folders */
libname trg "&root_path.\trg";
libname trg "&root_path.\trg\spde_&tbl";
/* assign target library using spde engine */
libname trg spde "&root_path.\trg" datapath=("&root_path.\trg\spde_&tbl") indexpath=("&root_path.\trg\spde_&tbl") compress=char /*partsize=&gb_part_size.G*/;
/* create target table with added indexes and less columns */
proc datasets lib=trg nolist nowarn;
delete &tbl.;
run;
append
base=trg.&tbl.(index=(var_1 var_2 idx_01=(var_1 var_2)))
data=src.&tbl.(keep=name--var_4)
getsort
;
run;
quit;
/* list indexes on target table */
proc sql;
select *
from dictionary.indexes
where libname='TRG' and memname=%upcase("&tbl.")
;
quit;
Thanks, Patrick
... View more