SAS Viya: CAS 'Duplicate Value Reduction'

2 Likes

SASHDAT tables can get big. Compression is one way to address this but the performance impact can be significant. Using Parquet files instead of SASHDAT is another approach but has limited applicability. Duplicate Value Reduction gives another mechanism to help.

.

What is Duplicate Value Reduction (DVR) and how should we apply it to our CAS tables? Let's take a look.

The Problem DVR Addresses -- Repeating Values

To understand DVR, we first have to understand the problem it addresses -- Repeating Values. "What???" you ask. We probably need to be more specific. After all "repeating values" could imply arrays, duplicate rows, pattern data, and so on.

The particular form of repeating values DVR helps with results from data denormalization. When we prepare data for analysis, we invariably flatten it from its idealized third normal form down to a single table or maybe a star schema. During this process, categorical data that applies to multiple child records becomes repeated.

To illustrate, let's look at a simplified data model that includes CUSTOMER ORDER and GEOGRAPHY data. .

Normalized

3rd normal schema; No data values repeat.

In 3rd normal form, no data values repeat. There is only one Texas, only one Cary, only one Dallas, etc..

⇩

Star Schema

Star Schema; Some repeating values

Combining the reference tables, city and state, gives us a star schema where we start to see values repeating. Now we have two 'Texas' values. This will of course take twice as much space to store as only one 'Texas.' Since reference tables tend to be small, the extra storage due to duplication in this form is generally not a large concern. .

⇩

Fully Flattened

Fully Flattened; Lots of repeating values

Combining the reference tables with the transaction table gives us a fully flattened table. This is the form most optimized for CAS. There is no joining to do. CAS can simply spin through the records for aggregation and analysis. CAS can directly treat the records as vectors for advanced analytics.

Here, however, we start to see considerable value duplication. Now we have two 'Houston,' two 'Cary,' two 'North Carolina,' and three 'Texas' values. So what used to take 36 bytes now takes 77, and of course the real problem starts when we add more and more customer order records. Double the customer order records and we ~double the storage requirements. With the same geographic rollups, the 77 bytes becomes 154 when we double the number of records, and then becomes 308 when we double again, 619, 1238, and so on. It isn't long before we have a lot more storage required.

While this simple example was designed only to illustrate the point, value duplication due to denormalization becomes an even bigger concern when you add in all of the extra reference data, customer names, addresses, phone numbers, email addresses, product descriptions, product categories, enhanced geographic data, customer demographics, etc.. On a fully flattened table, all of this voluminous data repeats over and over again.

How DVR Helps Reduce Table Size

While the details of DVR are complex, we can think of DVR internally (at the block level) storing fully flattened CAS tables more like star schemas. Thus mitigating some of the value duplications.

.

When to Use it

Ok, now that we have some idea of what DVR is, let's forget all that and just treat it like compression! Oh, all right, well maybe we should remember a little about it so we know that it will help the most with ultra-wide tables that include lots of categorical fields and won't help at all with "correlation matrix" tables, tables that only have a key and numbers (doubles).

Example Time!

Now let's DVR MegaCorp! If you remember from previous GEL courses, mega_corp is a fairly classic example of a data mart table. It's fully denormalized and has a few different rollups on it including geography, manufacturing facility, product line -> product category..., etc.. The mega_corp table on the GEL environments has about 1.6M rows. I concatenated it to itself 10 times though. So the table used here has 16M rows.

Here is the code I used to DVR it. Note the memoryFormat="DVR" option.: .

proc cas; 
  table.copyTable /
    table={name="mega_corp_10" caslib="public"}
    casOut={name="mega_corp_10_dvr" caslib="public" memoryFormat="DVR" replace=True replication=0};
run;

.

The resulting DVR'ed table is 73% smaller than the original mega_corp_10 table as shown by the tableDetails action below.

tableDetails Output

Performance Impact?

While we'll need more time with DVR to assess its impact on performance, my initial (very limited) tests are extremely promising. In my simple tests, the DVR table outperformed the un-DVR table every time. The logs from the tests are shown below. .

79   proc cas;
80   simple.summary / table={name="mega_corp_10" caslib="public" groupby={"unit"}}
81   subset={"Mean"} inputs={"profit" "ProductMaterialCost"};
82   run;
NOTE: Active Session now MYSESSION.
NOTE: Executing action 'simple.summary'.
NOTE: Action 'simple.summary' used (Total process time):
NOTE:       real time               0.537396 seconds
NOTE:       cpu time                5.895829 seconds (1097.11%)
NOTE:       total nodes             3 (24 cores)
NOTE:       total memory            188.28G
NOTE:       memory                  5.82M (0.00%)
83   
NOTE: The PROCEDURE CAS printed pages 378-446.
NOTE: PROCEDURE CAS used (Total process time):
      real time           1.01 seconds
      cpu time            0.49 seconds
      
84   proc cas;
85   simple.summary / table={name="mega_corp_10_dvr" caslib="public" groupby={"unit"}}
86   subset={"Mean"} inputs={"profit" "ProductMaterialCost"};
87   run;
NOTE: Active Session now MYSESSION.
NOTE: Executing action 'simple.summary'.
NOTE: Action 'simple.summary' used (Total process time):
NOTE:       real time               0.525608 seconds
NOTE:       cpu time                5.325841 seconds (1013.27%)
NOTE:       total nodes             3 (24 cores)
NOTE:       total memory            188.28G
NOTE:       memory                  5.75M (0.00%)
88   
NOTE: The PROCEDURE CAS printed pages 447-515.
NOTE: PROCEDURE CAS used (Total process time):
      real time           0.97 seconds
      cpu time            0.46 seconds
      
89   proc cas;
90   simple.summary / table={name="mega_corp_10" caslib="public" groupby={"productBrand"}}
91   subset={"Mean"} inputs={"profit" "ProductMaterialCost"};
92   run;
NOTE: Active Session now MYSESSION.
NOTE: Executing action 'simple.summary'.
NOTE: Action 'simple.summary' used (Total process time):
NOTE:       real time               0.663037 seconds
NOTE:       cpu time                4.892792 seconds (737.94%)
NOTE:       total nodes             3 (24 cores)
NOTE:       total memory            188.28G
NOTE:       memory                  5.65M (0.00%)
93   
NOTE: The PROCEDURE CAS printed pages 516-517.
NOTE: PROCEDURE CAS used (Total process time):
      real time           0.69 seconds
      cpu time            0.03 seconds
      
94   proc cas;
95   simple.summary / table={name="mega_corp_10_dvr" caslib="public" groupby={"productBrand"}}
96   subset={"Mean"} inputs={"profit" "ProductMaterialCost"};
97   run;
NOTE: Active Session now MYSESSION.
NOTE: Executing action 'simple.summary'.
NOTE: Action 'simple.summary' used (Total process time):
NOTE:       real time               0.514195 seconds
NOTE:       cpu time                4.461704 seconds (867.71%)
NOTE:       total nodes             3 (24 cores)
NOTE:       total memory            188.28G
NOTE:       memory                  5.57M (0.00%)

Conclusion

DVR appears to be a very promising mechanism for restricting the size of CAS and SASHDAT tables in new SAS Viya.