Hi all,
I'm dealing with a case where analysts/modellers are dealing with Base SAS analytical tables that can get up-to 2.5TB in volume (compressed) with up-to 1000 variables.
Obviously such volumes are not ideal for adhoc analysis especially because there aren't many indexes on these tables.
I can't do anything about these source tables and though what I'm thinking:
Write a macro that allows these users to create a copy of such tables under their user folder with ideally only a sub-set of columns and added indexes of their choice.
Because even reading a 2.5TB table also takes really long, I'm trying to come-up with a table copy approach that performs as good as possible.
The environment is a SAS Grid, SAS 9.4M5, rhel with 32 cpu's per node (actual). The default for an EG session the users spawn are 4 nodes.
Below demo code is the best I could come up with and I'd love your advise if there is anything else I could do to improve performance both for creating the copy as well as query performance using where clauses.
I assume for the copy and index creation process it's all about minimizing i/o and passes through the data.
For this reason, I've added creation of new indexes as data set option to the append base table ...but I don't really know if SAS will actually create these indexes via a single pass through the data together with the copy process or if creating the indexes later on via a proc datasets create index would perform the same.
I'm creating the target table using the spde engine. I don't have a lot of hands-on experience with this engine but I believe because the data copy needs to be under a user folder, there is no use for more than one data or index folder.
About the number of partitions: Based on what I've read in the docu, I'm thinking about defining them as "size of source tables/number of actual cpu". Not sure if I got that right though.
And here the sample code (self contained and fully executable as is).
options fullstimer msglevel=i dlcreatedir;
%let root_path=%sysfunc(pathname(work));
%let tbl =analytical_table;
/*-- source table: existing analytical table with up to 2.5TB (compressed) and up to 1000 variables --*/
libname src "&root_path.\src";
data src.&tbl.;
set sashelp.class;
array var_ {20} 8 ;
do k=1 to 10000;
do i=1 to dim(var_);
var_[i]=i*_n_;
end;
output;
end;
run;
proc sort data=src.&tbl. out=src.&tbl.(index=(name age n_a=(name age)));
by name;
run;
/*-- target table: copy of analytical table with added indexes and reduced number of columns --*/
/* create required folders */
libname trg "&root_path.\trg";
libname trg "&root_path.\trg\spde_&tbl";
/* assign target library using spde engine */
libname trg spde "&root_path.\trg" datapath=("&root_path.\trg\spde_&tbl") indexpath=("&root_path.\trg\spde_&tbl") compress=char /*partsize=&gb_part_size.G*/;
/* create target table with added indexes and less columns */
proc datasets lib=trg nolist nowarn;
delete &tbl.;
run;
append
base=trg.&tbl.(index=(var_1 var_2 idx_01=(var_1 var_2)))
data=src.&tbl.(keep=name--var_4)
getsort
;
run;
quit;
/* list indexes on target table */
proc sql;
select *
from dictionary.indexes
where libname='TRG' and memname=%upcase("&tbl.")
;
quit;
Thanks, Patrick
Generally speaking, SPDE does not usually speed up writing, because the data is still fed in single thread.
But it has a flip side when it comes to updating indexes, which is done in parallel while inserting data.
And of course, querying the data...
Dive into keynotes, announcements and breakthroughs on demand.
Explore Now →Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.