CAS data distribution: DUPLICATE a REPLICATION using COPIES. Can you REPEAT?

4 Likes

You’re probably asking what I’m going to talk about in this blog. Well, it is about CAS data distribution.

In CAS, one can “duplicate” tables, “repeat” tables, manage “copies” and deal with “replication”. And I’m not talking about “functional” copies of data that users may create in their day-to-day SAS life using “proc COPY” or “DATA A; SET B;” or something else.

No, I’m talking about internal CAS mechanisms provided as built-in features that help users deal with high-availability and performance. In reality, there are only 2 concepts behind those 4 words: DUPLICATE, REPEAT, COPIES and REPLICATION.

Let’s deep dive into them.

First concept: INACTIVE COPIES – DATA REDUNDANCY

That’s the one that probably many of you have already heard about. It’s one of the main differences with the LASR technology. It’s about fault-tolerance and data redundancy in CAS.

When you load data into CAS, you can specify the number of redundant blocks, or inactive copies, to create. In the event of node failure, a surviving node accesses the data from the redundant block. Among all blocks of the same data segment, only 1 block on only 1 node is active at any given time.

The following figure depicts how the server uses the system of active blocks and redundant blocks to provide fault tolerance, in a default context with COPIES=1. There are three active blocks for Table A on the first worker node. The redundant blocks for those three blocks are distributed between the second and third worker nodes. Table A is a distributed table.

There are some subtleties regarding the management of copies depending on where the data comes from (SASHDAT file in HDFS, SASHDAT file on DNFS, DBMSs, etc.). To keep things simple, let’s talk about data coming from SAS7BDAT files either on the SAS workspace server or on the CAS controller server.

Here are sample codes to load data into CAS with specifying the COPIES/REPLICATION option:

/* Using CASUTIL procedure */
proc casutil ;
   load data=sashelp.prdsale(where=(country="U.S.A."))
      outcaslib="caspath" casout="prdsale" replace copies=2 ; quit ;

/* Using FEDSQL */
proc fedsql sessref=mySession ;
   create table caspath.prdsale2 {options replace=true replication=0} as
   select * from caspath.prdsale where product='SOFA' ;
quit ;

/* Using Data Step */
data myCas.prdsale3(copies=1) ;
   set myCas.prdsale2 ;
run ;

/* Using a CAS action */
proc cas;
   table.loadTable /
      caslib="caspath"
      path="prdsale.sas7bdat"
      casout={
         caslib="caspath"
         name="prdsale4"
         replication=0
         replace=true} ;
quit ;

Notice that REPLICATION is a synonym for COPIES. Some language elements rely on COPIES, others on REPLICATION, others accept both.

Observe the total number of blocks (9) and the total number of active blocks (3). This result corresponds to a COPIES=2 table load.

Second concept: ACTIVE COPIES – REPEATED TABLES

The second concept is the ability to load a table entirely on each CAS worker node.

The SAS documentation clearly explains what repeated tables are and how useful they are:

“A repeated table has all of the rows in blocks that are identical on all worker nodes of a distributed server.

Repeated tables are useful in some operations, such as table joins, where the rows of a dimension table need to be matched against the keys in the fact table. If all the rows of the dimension table are repeated on each worker node, then the join can be completed without exchanging rows between the worker nodes. Repeated tables are not managed for fault tolerance because each node has all the rows.”

The following figure depicts a repeated table, Table B, and a distributed table, Table A, as described previously.

A join involving table A (for instance a big fact table) and table B (for instance a small dimension table) on a common variable (or set of variables) can be processed separately and independently on each CAS worker without having to move data from one node to another. This is particularly efficient.

Here are sample codes to load data into CAS with specifying the REPEAT/DUPLICATE option:

/* Using CASUTIL procedure */
proc casutil ;
   load data=sashelp.prdsale outcaslib="caspath"
      casout="prdsale" replace repeat ;
quit ;
/* Using Data Step */
data myCas.prdsale2(duplicate=yes) ;
   set myCas.prdsale ;
run ;

Notice that DUPLICATE is a synonym for REPEAT. The CASUTIL relies on REPEAT, the Data Step relies on DUPLICATE. Also, the DUPLICATE/REPEAT option in not supported in FedSQL and DS2, and not supported on all CASLIB data sources.

The “Duplicated Rows” (table.tableInfo CAS action) or “Repeated Table” (CASUTIL contents statement) output column indicates whether the table is repeated or not. COPIES/REPLICATION is ignored on a repeated table.

In conclusion, do not confuse copies for data redundancy (COPIES/REPLICATION option) and copies for repeated tables (REPEAT/DUPLICATE option). Both have different roles:

Data redundancy copies are useful for fault tolerance
- Only one copy of the table is active across all CAS workers
Repeated tables copies are useful for performance (this is fault tolerant by definition)
- Each copy is active on each CAS worker

sanjays9 · ‎09-14-2017

Well written article. It was easy to understand the difference between copies and repeat options. Thanks

cici0017 · ‎06-01-2022

Does it mean the best practice to have a proper fault tolerance is to define both COPIES/REPLICATION option and REPEAT/DUPLICATE option. Is there a case that only defining REPLICATION =1 option could end up having not full copy of the data set crossing all CAS workers when some of the workers were shut down?

CAS data distribution: DUPLICATE a REPLICATION using COPIES. Can you REPEAT?

Free course: Data Literacy Essentials

Get Started