BookmarkSubscribeRSS Feed

CAS is Fast – DQMatch Edition

Started ‎03-01-2022 by
Modified ‎03-01-2022 by
Views 1,453

Building on my "CAS is fast" post, today, we'll compare match code generation between foundation SAS (a.k.a the Compute Server) and CAS and we'll be using data of varying size in response to concerns that "CAS is only for big data." On each row, we'll generate a match code for a customer name field at 95% sensitivity.

 

Test Results

 

From the results below, we see that CAS does indeed process the data more slowly at extremely small volumes (hundreds and thousands of rows) but quickly overtakes Compute around volumes of tens of thousands of records.

 

                                        DQ Match Code Generation Comparison Test Results
Engine Rows Size (mb) Match Code Generation Time
Compute (Foundation SAS) 599 0.256 0.31s
CAS 599 0.153 10.63s
Compute 5990 1.5 3.02s
CAS 5990 1.5 10.99s
Compute 59900 14 25.85s
CAS 59900 15 11.85s
Compute 599000 140 4:04m
CAS 599000 153 27.14s
Compute 5990000 1392 50:25m
CAS 5990000 1533 (238 DVR) 1:55m

.

From this graphical representation, we can also see that performance degrades more abruptly for Compute than CAS as data volumes increase.

 

sf_1_2022-02-28_10-21-36.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

 

Conclusions

 

At extremely small "PoC" data volumes Foundation SAS (Compute) outperforms CAS on match code generation. However, at more realistic volumes CAS outperforms Compute by orders of magnitude. .

 

Project Environment, Code, and Log Snipets

 

The environment contains the latest LTS Viya software deployed with 3 CAS worker nodes and one controller each with 4 virtual cores and 32MB of RAM.

 

Project Code -- Not Production Quality; Not Meant to Run Without Intervention

 


CAS mySession  SESSOPTS=(metrics=true);

CASLIB _all_ assign;

/* Prepare the compute server test */

data sasdm.customer_list; 
length id 8 name $32 address $50 'zip code'n $10 phone $10 city $32 country $32 notes $50 sid 8;
set saspgdvd.customer_list; 
run;

data sasdm.customer_list10 ;
set sasdm.customer_list sasdm.customer_list sasdm.customer_list sasdm.customer_list sasdm.customer_list
    sasdm.customer_list sasdm.customer_list sasdm.customer_list sasdm.customer_list sasdm.customer_list;
run;

data sasdm.customer_list100;
set sasdm.customer_list10 sasdm.customer_list10 sasdm.customer_list10 sasdm.customer_list10 sasdm.customer_list10
    sasdm.customer_list10 sasdm.customer_list10 sasdm.customer_list10 sasdm.customer_list10 sasdm.customer_list10;
run;

data sasdm.customer_list1000;
set sasdm.customer_list100 sasdm.customer_list100 sasdm.customer_list100 sasdm.customer_list100 sasdm.customer_list100
    sasdm.customer_list100 sasdm.customer_list100 sasdm.customer_list100 sasdm.customer_list100 sasdm.customer_list100;
run;

data sasdm.customer_list10000;
set sasdm.customer_list1000 sasdm.customer_list1000 sasdm.customer_list1000 sasdm.customer_list1000 sasdm.customer_list1000
    sasdm.customer_list1000 sasdm.customer_list1000 sasdm.customer_list1000 sasdm.customer_list1000 sasdm.customer_list1000;
run;

proc contents data=sasdm.customer_list;run;
proc contents data=sasdm.customer_list10;run;
proc contents data=sasdm.customer_list100;run;
proc contents data=sasdm.customer_list1000;run;
proc contents data=sasdm.customer_list10000;run;

/*Prepare the CAS test*/
data dm_pgdvd.customer_list (copies=0 promote=yes);
set sasdm.customer_list;
run;

data dm_pgdvd.customer_list10 (copies=0 promote=yes);
set sasdm.customer_list10;
run;

data dm_pgdvd.customer_list100 (copies=0 promote=yes);
set sasdm.customer_list100;
run;

data dm_pgdvd.customer_list1000 (copies=0 promote=yes);
set sasdm.customer_list1000;
run;

data dm_pgdvd.customer_list10000 (copies=0 promote=yes);
set sasdm.customer_list10000;
run;

proc cas; 
  table.copyTable /
    table={name="customer_list10000" caslib="dm_pgdvd"}
    casOut={name="customer_list10000dvr" caslib="dm_pgdvd" memoryFormat="DVR" replace=True replication=0};
run;

/* Table Stats */

proc cas;
  table.fileInfo / caslib="dm"  ;
quit ;

proc cas;
  table.tableInfo / caslib="dm_pgdvd" name="customer_list" ;
  table.tableInfo / caslib="dm_pgdvd" name="customer_list10" ;
  table.tableInfo / caslib="dm_pgdvd" name="customer_list100" ;
  table.tableInfo / caslib="dm_pgdvd" name="customer_list1000" ;
  table.tableInfo / caslib="dm_pgdvd" name="customer_list10000" ;
  table.tableInfo / caslib="dm_pgdvd" name="customer_list10000dvr" ;
quit ;

proc cas;
  table.columnInfo / table={caslib="dm_pgdvd" name="customer_list"} ;
  table.columnInfo / table={caslib="dm_pgdvd" name="customer_list10"} ;
  table.columnInfo / table={caslib="dm_pgdvd" name="customer_list100"} ;
  table.columnInfo / table={caslib="dm_pgdvd" name="customer_list1000"} ;
  table.columnInfo / table={caslib="dm_pgdvd" name="customer_list10000"} ;
  table.columnInfo / table={caslib="dm_pgdvd" name="customer_list10000dvr"} ;
quit ;


proc cas;
  table.tabledetails / caslib="dm_pgdvd" name="customer_list" level="SUM";
  table.tabledetails / caslib="dm_pgdvd" name="customer_list10" level="SUM";
  table.tabledetails / caslib="dm_pgdvd" name="customer_list100" level="SUM";
  table.tabledetails / caslib="dm_pgdvd" name="customer_list1000" level="SUM";
  table.tabledetails / caslib="dm_pgdvd" name="customer_list10000" level="SUM";
  table.tabledetails / caslib="dm_pgdvd" name="customer_list10000dvr" level="SUM";
quit ;

/* Load the DQ locale */

%DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='/opt/sas/viya/home/share/refdata/qkb/QKB CI 32/qkb-ci-32.1.3-qkb-viya.qarc');

/* Run the compute server test */

data sasdm.customerMatchCode; 
length mcName $100;
set sasdm.customer_list;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;

data sasdm.customerMatchCode; 
length mcName $100;
set sasdm.customer_list10;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;

data sasdm.customerMatchCode; 
length mcName $100;
set sasdm.customer_list100;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;

data sasdm.customerMatchCode; 
length mcName $100;
set sasdm.customer_list1000;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;

data sasdm.customerMatchCode; 
length mcName $100;
set sasdm.customer_list10000;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;

/* Run the CAS test */

data dm_pgdvd.customerMatchCode ;
length mcName $100;
set dm_pgdvd.customer_list;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;

data dm_pgdvd.customerMatchCode ;
length mcName $100;
set dm_pgdvd.customer_list10;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;

data dm_pgdvd.customerMatchCode ;
length mcName $100;
set dm_pgdvd.customer_list100;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;

data dm_pgdvd.customerMatchCode ;
length mcName $100;
set dm_pgdvd.customer_list1000;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;


data dm_pgdvd.customerMatchCode ;
length mcName $100;
set dm_pgdvd.customer_list10000;
mcName=dqMatch(name,'NAME',95,'ENUSA') ; 
run;


 

 

Test Log -- Compute Running on 5.9M Rows

 


79   data sasdm.customerMatchCode;
80   length mcName $100;
81   set sasdm.customer_list10000;
82   mcName=dqMatch(name,'NAME',95,'ENUSA') ;
83   run;
NOTE: There were 5990000 observations read from the data set SASDM.CUSTOMER_LIST10000.
NOTE: The data set SASDM.CUSTOMERMATCHCODE has 5990000 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           50:25.07
      cpu time            50:34.12 

 

Test Log -- CAS Running on 5.9M Rows

 

79   data dm_pgdvd.customerMatchCode ;
80   length mcName $100;
81   set dm_pgdvd.customer_list10000;
NOTE: Executing action 'table.tableInfo'.
NOTE: Action 'table.tableInfo' used (Total process time):
NOTE:       real time               0.016233 seconds
NOTE:       cpu time                0.017619 seconds (108.54%)
NOTE:       total nodes             4 (32 cores)
NOTE:       total memory            251.04G
NOTE:       memory                  1.58M (0.00%)
NOTE: Executing action 'table.tableInfo'.
NOTE: Action 'table.tableInfo' used (Total process time):
NOTE:       real time               0.010916 seconds
NOTE:       cpu time                0.013365 seconds (122.43%)
NOTE:       total nodes             4 (32 cores)
NOTE:       total memory            251.04G
NOTE:       memory                  1.58M (0.00%)
NOTE: Executing action 'table.columnInfo'.
NOTE: Action 'table.columnInfo' used (Total process time):
NOTE:       real time               0.031052 seconds
NOTE:       cpu time                0.026757 seconds (86.17%)
NOTE:       total nodes             4 (32 cores)
NOTE:       total memory            251.04G
NOTE:       memory                  3.41M (0.00%)
82   mcName=dqMatch(name,'NAME',95,'ENUSA') ;
83   run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: Executing action 'sessionProp.getSessOpt'.
NOTE: Action 'sessionProp.getSessOpt' used (Total process time):
NOTE:       real time               0.018381 seconds
NOTE:       cpu time                0.014225 seconds (77.39%)
NOTE:       total nodes             4 (32 cores)
NOTE:       total memory            251.04G
NOTE:       memory                  851.53K (0.00%)
NOTE: Executing action 'sessionProp.setSessOpt'.
NOTE: Action 'sessionProp.setSessOpt' used (Total process time):
NOTE:       real time               0.012184 seconds
NOTE:       cpu time                0.015754 seconds (129.30%)
NOTE:       total nodes             4 (32 cores)
NOTE:       total memory            251.04G
NOTE:       memory                  1.07M (0.00%)
NOTE: The DATA step will run in multiple threads.
NOTE: Executing action 'dataStep.runBinary'.
NOTE: There were 5990000 observations read from the table CUSTOMER_LIST10000 in caslib DM_PGDVD.
NOTE: The table customerMatchCode in caslib DM_PGDVD has 5990000 observations and 10 variables.
NOTE: Action 'dataStep.runBinary' used (Total process time):
NOTE:       real time               114.835489 seconds
NOTE:       cpu time                2115.218305 seconds (1841.96%)
NOTE:       data movement time      0.055814 seconds
NOTE:       total nodes             4 (32 cores)
NOTE:       total memory            251.04G
NOTE:       memory                  4.04G (1.61%)
NOTE:       bytes moved             2.01G
NOTE: Executing action 'sessionProp.setSessOpt'.
NOTE: Action 'sessionProp.setSessOpt' used (Total process time):
NOTE:       real time               0.011197 seconds
NOTE:       cpu time                0.014895 seconds (133.03%)
NOTE:       total nodes             4 (32 cores)
NOTE:       total memory            251.04G
NOTE:       memory                  1.07M (0.00%)
NOTE: DATA statement used (Total process time):
      real time           1:55.11
      cpu time            0.53 seconds

 

Notes: 

  • We ran both DATA Step and PROC DQMatch on Compute and both performed similarly.
  • We tested both DVR and non-DVR format in CAS. Performance on both formats was similar.
  • These tests were run on virtual hardware and should not be treated as benchmarks.

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎03-01-2022 02:54 PM
Updated by:
Contributors

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags