Hello there.
I would like to know how to perform key deduplication of a CAS table more efficiently than using proc sort. According to this document it is possible to perform this task using hash tables, but the doc uses non-CAS tables.
Heres a sample code using non CAS tables:
data testdata(drop=i j);
array vars var1-var5;
do i=1 to 10e6;
do j=1 to dim(vars);
vars[j]=rand('integer', 1, 10);
end;
output;
end;
run;
proc sort data=testdata out=test1 nodupkey;
by _ALL_;
run;
data test2;
if _N_ = 1 then do;
declare hash h(hashexp:20);
h.defineKey('var1', 'var2', 'var3', 'var4', 'var5');
h.defineDone();
end;
set testdata;
if h.check() ne 0 then do;
output;
h.add();
end;
run;
When trying to use the same code with CAS, I noticed that the resulting table still contains duplicate records, and I assume this is related to parallel execution on different workers, but I'm not sure.
I would appreciate the community's help: is it possible to perform record deduplication of a CAS table using hash tables? If so, how?
Environment information: SAS Viya 3.5 with 4 CAS workers running on linux
Thank you!
@alisio_meneses , Have you tried the deduplicate action in PROC CAS ?
data example;
set sashelp.class;
output;
output;
run;
proc casutil;
droptable casdata="example" incaslib="casuser" quiet;
load data=work.example
outcaslib="casuser"
casout="example"
replace;
run;
proc cas;
deduplication.deduplicate / table = {caslib="casuser", name="example", groupBy={"name"}}
noDuplicateKeys=true
casOut={name="deduplicated", caslib="casuser", replace=true};
quit;
I don't have viya , but if you do:
data test2 /single=yes ;
I think that should force the step to run in a single thread, as in this example: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/casdspgm/p10ux48pz7qknzn1ux1bs48vplnx.htm#n05... .
Hi! Thanks for replying. I think you may be right, and it might work. But if it should work, doesn't it limit the usage of CAS resources (like massive parallel processing), thus making it possibly slower than using PROC SORT. I'll have a go anyway.
Yes, probably. : )
My understanding is if you want to do things that rely on order of the data being processed (e.g. retain statement, lag, etc), you need to run it single threaded, to avoid different threads getting different chunks of data. But again, that's just based on reading the docs a bit and chatting with folks, I haven't tested this. If you turn on that option, does it work?
Generally, I wouldn't try to beat PROC SORT. In my experience most SAS PROCS are optimized better than I could do. : )
Maybe try PROC CAS for de-duping? https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/caspg/n1gx5u07e5tqivn1n680voj4ozff.htm
Below multiple options that all return a deduped table. The already proposed CAS action deduplication.deduplicate performed best closely followed by Proc Sort. I assume timings for Proc Sort could be quite different if data volumes are different (like a long string variable). ...turns out Proc Sort executes fully in CAS and though the only difference in time compared to the deduplicate action must be caused by the overhead "mapping" the Proc Sort syntax to the CAS action.
Also the data step works both single and multi threaded but especially single threaded is much much slower on my 4 nodes 192 threads environment.
CAS by group processing: It's also somehow in the documentation but the clearest statement I could find is as below from forum paper Accelerate DATA Step BY-Group Processing in SAS® Viya®
PARALLEL PROCESSING When a DATA step runs in CAS, BY groups are assigned to threads. Many BY groups can be assigned to a single thread. However, a BY group cannot span threads.
What this means is you can use a data step /if first construct to dedup your data because all the data belonging to a by group will get processed by a single thread. The downside of this in a MPP environment is of course that there is potentially a lot of data movement between nodes.
Below working sample code executed using a recent Viya 4 environment but based on documentation all or at least most of the options should work for any Viya version.
%let sessref=MySess;
%if %sysfunc(sessfound(&sessref)) %then
%do;
cas mySess terminate;
%end;
cas &sessref cassessopts=(caslib="casuser");
libname casuser cas;
data casuser.testdata(drop=i j);
array vars var1-var5;
do i=1 to 10e6;
do j=1 to dim(vars);
vars[j]=rand('integer', 1, 10);
end;
output;
end;
run;
/* option 1: CAS action */
proc sql noprint;
select cats('{name="',name,'"}') into :var_list separated by ','
from dictionary.columns
where libname='CASUSER' and memname='TESTDATA'
;
quit;
%put &=var_list;
proc cas;
deduplication.deduplicate /
table={ caslib="casuser", name="testdata",
groupBy={&var_list}
},
casOut={caslib="casuser", name="dedup" replace=true},
noDuplicateKeys=true
;
quit;
/* option 2: Proc Sort */
proc sort data=casuser.testdata out=casuser.dedup nodupkey;
by _all_;
run;
/* option 3: CAS data step */
proc sql noprint;
select name
into :sort_list separated by ' '
from dictionary.columns
where libname='CASUSER' and memname='TESTDATA'
;
quit;
%let last_var=%scan(&sort_list,-1);
%put &=sort_list;
%put &=last_var;
/* single threaded */
data casuser.dedup /single=yes;
set casuser.testdata;
by &sort_list;
if first.&last_var;
run;
/* multi threaded */
data casuser.dedup;
set casuser.testdata;
by &sort_list;
if first.&last_var;
run;
cas &sessref terminate;
And here the SAS log with the timings
88 data casuser.testdata(drop=i j); 89 array vars var1-var5; 90 do i=1 to 10e6; 91 do j=1 to dim(vars); 92 vars[j]=rand('integer', 1, 10); 93 end; 94 output; 95 end; 96 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step has no input data set and will run in a single thread. NOTE: The table testdata in caslib CASUSER(***) has 10000000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 3.75 seconds cpu time 0.04 seconds 97 98 /* option 1: CAS action */ 99 proc sql noprint; 100 select cats('{name="',name,'"}') into :var_list separated by ',' 101 from dictionary.columns 102 where libname='CASUSER' and memname='TESTDATA' 103 ; 104 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.09 seconds cpu time 0.00 seconds 105 %put &=var_list; VAR_LIST={name="var1"},{name="var2"},{name="var3"},{name="var4"},{name="var5"} 106 107 proc cas; 108 deduplication.deduplicate / 109 table={ caslib="casuser", name="testdata", 110 groupBy={&var_list} 111 }, 112 casOut={caslib="casuser", name="dedup" replace=true}, 113 noDuplicateKeys=true 114 ; 115 116 quit; NOTE: Active Session now MYSESS. NOTE: Added action set 'deduplication'. NOTE: There were 10000000 rows read from the table TESTDATA. NOTE: The table dedup has 100000 rows and 5 columns. NOTE: PROCEDURE CAS used (Total process time): real time 3.63 seconds cpu time 0.02 seconds 117 118 /* option 2: Proc Sort */ 119 proc sort data=casuser.testdata out=casuser.dedup nodupkey; 120 by _all_; 121 run; NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables. NOTE: Processing was performed in CAS. NOTE: PROCEDURE SORT used (Total process time): real time 3.88 seconds cpu time 0.05 seconds 122 123 /* option 3: CAS data step */ 124 proc sql noprint; 125 select name 126 into :sort_list separated by ' ' 127 from dictionary.columns 128 where libname='CASUSER' and memname='TESTDATA' 129 ; 130 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.10 seconds cpu time 0.00 seconds 131 %let last_var=%scan(&sort_list,-1); 132 %put &=sort_list; SORT_LIST=var1 var2 var3 var4 var5 133 %put &=last_var; LAST_VAR=var5 134 135 /* single threaded */ 136 data casuser.dedup /single=yes; 137 set casuser.testdata; 138 by &sort_list; 139 if first.&last_var; 140 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: There were 10000000 observations read from the table TESTDATA in caslib CASUSER(***). NOTE: The table dedup in caslib CASUSER(***) has 100000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1:12.60 cpu time 0.14 seconds 141 142 /* multi threaded */ 143 data casuser.dedup; 144 set casuser.testdata; 145 by &sort_list; 146 if first.&last_var; 147 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads. NOTE: There were 10000000 observations read from the table TESTDATA in caslib CASUSER(***). NOTE: The table dedup in caslib CASUSER(***) has 100000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 5.14 seconds cpu time 0.03 seconds 148 149 cas &sessref terminate; NOTE: Libref CASUSER has been deassigned. NOTE: Deletion of the session MYSESS was successful. NOTE: The default CAS session MYSESS identified by SAS option SESSREF= was terminated. Use the OPTIONS statement to set the SESSREF= option to an active session. NOTE: Request to TERMINATE completed for session MYSESS.
As a side note:
When using a hash table in a CAS data step then there will be a separate hash table per thread as soon as you use a hash method that requires write access to the hash. If you only use read methods then the hash will get created only once.
LOL! And after all of my testing as share in my previous post I was wondering why Proc Sort performs that well and found: SAS Cloud Analytic Services Processing for PROC SORT
Turns out that the "usual" Proc Sort Nodupkey syntax already fully executes in CAS.
NOTE: Processing was performed in CAS.
99 /* option 2: Proc Sort */
100 proc sort data=casuser.testdata out=casuser.dedup nodupkey;
101 by _all_;
102 run;
NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables.
NOTE: Processing was performed in CAS.
Thanks for all those examples @Patrick ! Quick follow-up question.
I'm confused how these steps, which read un-sorted data, can work with the BY statement (which of course on SAS 9 would require data to be sorted by the BY variables).
/* single threaded */
data casuser.dedup /single=yes;
set casuser.testdata;
by &sort_list;
if first.&last_var;
run;
/* multi threaded */
data casuser.dedup;
set casuser.testdata;
by &sort_list;
if first.&last_var;
run;
Is CAS just sorting in the background when you use a BY statement, or is something more interesting going on?
When executing in CAS data belonging to the same by group are automatically sent to the same node, therefore sorting is not required. If you have more nested levels, that is sorted within that node.
Thanks @Mazi , I was wondering what would happen when there are more by-groups than nodes. That's cool that each node would wait to get all of its data and then sort its data when needed, before iterating through it.
So if you write a data step with a BY statement and the input dataset is an unsorted CAS table, then the controller node needs to look at the full set of BY values and sort (or index) it into BY groups before sending each BY group off to a CPU, is that it? So there is basically an implicit sort or indexing happening. Where the controller node needs to find all of the rows in a non-contiguous by group and then send them off to a single CPU?
It's almost like in CAS, the DATA step BY statement is able to do grouping similar to what the CLASS statement can do in a SAS 9 PROC.
@Quentin There are things I know and understand and the rest is filling gaps with theories which is why doing all this testing with coding options is so valuable for me.
To not further hijack the OPs question I suggest you ask a new question where we can have further discussions and testing.
@alisio_meneses , Have you tried the deduplicate action in PROC CAS ?
data example;
set sashelp.class;
output;
output;
run;
proc casutil;
droptable casdata="example" incaslib="casuser" quiet;
load data=work.example
outcaslib="casuser"
casout="example"
replace;
run;
proc cas;
deduplication.deduplicate / table = {caslib="casuser", name="example", groupBy={"name"}}
noDuplicateKeys=true
casOut={name="deduplicated", caslib="casuser", replace=true};
quit;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.