Database Multi-Node Load – New SLICEEXPRESSIONS Option in Viya

1 Like

Multi-node load is the process of involving several CAS workers in an MPP environment to load data coming from a database. Each CAS worker sends a different SQL query to the database and receives the corresponding portion of the data. This allows the parallelization of database table loads. If the database can handle multiple concurrent queries appropriately and the bandwidth between the CAS nodes and the database is good enough, this is a very efficient way of loading external data in CAS.

Select the image to see a larger version.
Mobile users: To view the image, select the "Full" version at the bottom of the page.

You enable multi-node load by using the NUMREADNODES option (in a CASLIB statement or CASUTIL procedure). This defines how many workers you want to involve in the parallel load. Then, by default, if not user-specified, CAS locates the first suitable numeric column in the source table (Integer, Decimal, Numeric, Double) to be used as the slicing column. CAS will use that slicing column to prepare and send distinct SQL queries based on the database MOD(ULO) function (remainder of the division) and the NUMREADNODES value.

Example of multi-node load code:

proc casutil incaslib="caspg" outcaslib="caspg" ;
   load casdata="film" casout="PG_film" options=(numreadnodes=4) replace ;
quit ;

Example of generated SQL (last CAS worker):

select "SLICE_SQL"."film_id", ...,
       "SLICE_SQL"."fulltext" 
from  (select "public"."film"."film_id", ...,
              "public"."film"."fulltext" 
       from "public"."film")  "SLICE_SQL"
where  ( (MOD (ABS ("SLICE_SQL"."film_id") , 4) =1) )

Of course, one can choose the slicing column instead of letting CAS do it. A user probably knows better the data than CAS does. So, if he knows a good slicing column candidate (right cardinality, correctly balanced, indexed, etc.) he can specify it using the SLICECOLUMN option.

Example of multi-node code with SLICECOLUMN:

proc casutil incaslib="caspg" outcaslib="caspg" ;
   load casdata="film" casout="PG_film" options=(numreadnodes=4 sliceColumn="length") replace ;
quit ;

Example of generated SQL (last CAS worker):

select "SLICE_SQL"."film_id", ...,
       "SLICE_SQL"."fulltext" 
from  (select "public"."film"."film_id", ...,
              "public"."film"."fulltext" 
       from "public"."film")  "SLICE_SQL"
where  ( (MOD (ABS ("SLICE_SQL"."length") , 4) =1) )

With the new SAS Viya, a new option has appeared and is named SLICEEXPRESSIONS. It allows the user to be in total control of the SQL queries that will be sent to the database by explicitly specifying the filtering conditions that will partition your database data for multi-node loading into CAS. This offers many advantages like controlling the distribution of the data over the CAS workers or specifying efficient filtering conditions based on database or table settings.

Example of multi-node code with SLICEEXPRESSIONS:

proc casutil incaslib="caspg" outcaslib="caspg" ;
   load casdata="film" casout="PG_film" options=(numreadnodes=2 sliceExpressions=("length < 100","length >= 100")) replace ;
quit ;

Example of generated SQL (second CAS worker):

select "SLICE_SQL"."film_id", ...,
       "SLICE_SQL"."fulltext" 
from  (select "public"."film"."film_id", ...,
              "public"."film"."fulltext" 
       from "public"."film")  "SLICE_SQL"
where  ( ("SLICE_SQL"."length">=100) )

The number of expressions in the SLICEEXPRESSIONS option must match the NUMREADNODES value. SQL expressions are passed "as is" to the database. Keep also in mind that when the NUMREADNODES value is less than the number of available CAS workers, the database table will land only on that specified number of nodes.

View the documentation of the SLICEEXPRESSIONS option (PostgreSQL example, available for most Data Connectors).

Thanks for reading.

Database Multi-Node Load – New SLICEEXPRESSIONS Option in Viya

Registration is open

SAS AI and Machine Learning Courses