Solved: Re: How to perform key Deduplication of CAS Tables using Hash Tables.

alisio_meneses · Posted 04-03-2024 01:32 PM

Hello there.

I would like to know how to perform key deduplication of a CAS table more efficiently than using proc sort. According to this document it is possible to perform this task using hash tables, but the doc uses non-CAS tables.

Heres a sample code using non CAS tables:

data testdata(drop=i j);
   array vars var1-var5;
   do i=1 to 10e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

proc sort data=testdata out=test1 nodupkey;
   by _ALL_;
run;

data test2;
   if _N_ = 1 then do;
      declare hash h(hashexp:20);
      h.defineKey('var1', 'var2', 'var3', 'var4', 'var5');
      h.defineDone();
   end;
 
   set testdata;
 
   if h.check() ne 0 then do;
      output;
      h.add();
   end;
run;

When trying to use the same code with CAS, I noticed that the resulting table still contains duplicate records, and I assume this is related to parallel execution on different workers, but I'm not sure.

I would appreciate the community's help: is it possible to perform record deduplication of a CAS table using hash tables? If so, how?

Environment information: SAS Viya 3.5 with 4 CAS workers running on linux

Thank you!

Mazi · Posted 04-04-2024 09:02 AM

@alisio_meneses , Have you tried the deduplicate action in PROC CAS ?

data example;
	set sashelp.class;
	output;
	output;
run;

proc casutil;
	droptable casdata="example" incaslib="casuser" quiet;
	load data=work.example
		  outcaslib="casuser"
		  casout="example"
		  replace;
run;

proc cas;
	deduplication.deduplicate / table = {caslib="casuser", name="example", groupBy={"name"}}
								noDuplicateKeys=true
								casOut={name="deduplicated", caslib="casuser", replace=true};
quit;

View solution in original post

Quentin · Posted 04-03-2024 01:44 PM

I don't have viya , but if you do:

data test2 /single=yes ;

I think that should force the step to run in a single thread, as in this example: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/casdspgm/p10ux48pz7qknzn1ux1bs48vplnx.htm#n05... .

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

alisio_meneses · Posted 04-03-2024 02:04 PM

Hi! Thanks for replying. I think you may be right, and it might work. But if it should work, doesn't it limit the usage of CAS resources (like massive parallel processing), thus making it possibly slower than using PROC SORT. I'll have a go anyway.

Quentin · Posted 04-03-2024 02:18 PM

Yes, probably. : )

My understanding is if you want to do things that rely on order of the data being processed (e.g. retain statement, lag, etc), you need to run it single threaded, to avoid different threads getting different chunks of data. But again, that's just based on reading the docs a bit and chatting with folks, I haven't tested this. If you turn on that option, does it work?

Generally, I wouldn't try to beat PROC SORT. In my experience most SAS PROCS are optimized better than I could do. : )

Maybe try PROC CAS for de-duping? https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/caspg/n1gx5u07e5tqivn1n680voj4ozff.htm

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Patrick · Posted 04-03-2024 10:15 PM

Below multiple options that all return a deduped table. The already proposed CAS action deduplication.deduplicate performed best closely followed by Proc Sort. ~~I assume timings for Proc Sort could be quite different if data volumes are different (like a long string variable).~~ ...turns out Proc Sort executes fully in CAS and though the only difference in time compared to the deduplicate action must be caused by the overhead "mapping" the Proc Sort syntax to the CAS action.

Also the data step works both single and multi threaded but especially single threaded is much much slower on my 4 nodes 192 threads environment.

CAS by group processing: It's also somehow in the documentation but the clearest statement I could find is as below from forum paper Accelerate DATA Step BY-Group Processing in SAS® Viya®

PARALLEL PROCESSING
When a DATA step runs in CAS, BY groups are assigned to threads. Many BY groups can be
assigned to a single thread. However, a BY group cannot span threads.

What this means is you can use a data step /if first construct to dedup your data because all the data belonging to a by group will get processed by a single thread. The downside of this in a MPP environment is of course that there is potentially a lot of data movement between nodes.

Below working sample code executed using a recent Viya 4 environment but based on documentation all or at least most of the options should work for any Viya version.

Spoiler

%let sessref=MySess;
%if %sysfunc(sessfound(&sessref)) %then
  %do;
    cas mySess terminate;
  %end;
cas &sessref cassessopts=(caslib="casuser");
libname casuser cas;

data casuser.testdata(drop=i j);
   array vars var1-var5;
   do i=1 to 10e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* option 1: CAS action */
proc sql noprint;
  select cats('{name="',name,'"}') into :var_list separated by ','
  from dictionary.columns
  where libname='CASUSER' and memname='TESTDATA'
  ;
quit;
%put &=var_list;

proc cas;
  deduplication.deduplicate /
    table={ caslib="casuser", name="testdata",      
            groupBy={&var_list}
          },                         
    casOut={caslib="casuser", name="dedup" replace=true}, 
    noDuplicateKeys=true
    ;           
  
quit;

/* option 2: Proc Sort */
proc sort data=casuser.testdata out=casuser.dedup nodupkey;
  by _all_;
run;

/* option 3: CAS data step */
proc sql noprint;
  select name 
    into :sort_list separated by ' '
  from dictionary.columns
  where libname='CASUSER' and memname='TESTDATA'
  ;
quit;
%let last_var=%scan(&sort_list,-1);
%put &=sort_list;
%put &=last_var;

/* single threaded */
data casuser.dedup /single=yes;
  set casuser.testdata;
  by &sort_list;
  if first.&last_var;
run;

/* multi threaded */
data casuser.dedup;
  set casuser.testdata;
  by &sort_list;
  if first.&last_var;
run;

cas &sessref terminate;

%let sessref=MySess; %if %sysfunc(sessfound(&sessref)) %then %do; cas mySess terminate; %end; cas &sessref cassessopts=(caslib="casuser"); libname casuser cas; data casuser.testdata(drop=i j); array vars var1-var5; do i=1 to 10e6; do j=1 to dim(vars); vars[j]=rand('integer', 1, 10); end; output; end; run; /* option 1: CAS action */ proc sql noprint; select cats('{name="',name,'"}') into :var_list separated by ',' from dictionary.columns where libname='CASUSER' and memname='TESTDATA' ; quit; %put &=var_list; proc cas; deduplication.deduplicate / table={ caslib="casuser", name="testdata", groupBy={&var_list} }, casOut={caslib="casuser", name="dedup" replace=true}, noDuplicateKeys=true ; quit; /* option 2: Proc Sort */ proc sort data=casuser.testdata out=casuser.dedup nodupkey; by _all_; run; /* option 3: CAS data step */ proc sql noprint; select name into :sort_list separated by ' ' from dictionary.columns where libname='CASUSER' and memname='TESTDATA' ; quit; %let last_var=%scan(&sort_list,-1); %put &=sort_list; %put &=last_var; /* single threaded */ data casuser.dedup /single=yes; set casuser.testdata; by &sort_list; if first.&last_var; run; /* multi threaded */ data casuser.dedup; set casuser.testdata; by &sort_list; if first.&last_var; run; cas &sessref terminate;

And here the SAS log with the timings

Spoiler

88   data casuser.testdata(drop=i j);
89      array vars var1-var5;
90      do i=1 to 10e6;
91         do j=1 to dim(vars);
92            vars[j]=rand('integer', 1, 10);
93         end;
94         output;
95      end;
96   run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step has no input data set and will run in a single thread.
NOTE: The table testdata in caslib CASUSER(***) has 10000000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           3.75 seconds
      cpu time            0.04 seconds
      
97   
98   /* option 1: CAS action */
99   proc sql noprint;
100    select cats('{name="',name,'"}') into :var_list separated by ','
101    from dictionary.columns
102    where libname='CASUSER' and memname='TESTDATA'
103    ;
104  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.09 seconds
      cpu time            0.00 seconds
      
105  %put &=var_list;
VAR_LIST={name="var1"},{name="var2"},{name="var3"},{name="var4"},{name="var5"}
106  
107  proc cas;
108    deduplication.deduplicate /
109      table={ caslib="casuser", name="testdata",
110              groupBy={&var_list}
111            },
112      casOut={caslib="casuser", name="dedup" replace=true},
113      noDuplicateKeys=true
114      ;
115  
116  quit;
NOTE: Active Session now MYSESS.
NOTE: Added action set 'deduplication'.
NOTE: There were 10000000 rows read from the table TESTDATA.
NOTE: The table dedup has 100000 rows and 5 columns.
NOTE: PROCEDURE CAS used (Total process time):
      real time           3.63 seconds
      cpu time            0.02 seconds
      
117  
118  /* option 2: Proc Sort */
119  proc sort data=casuser.testdata out=casuser.dedup nodupkey;
120    by _all_;
121  run;
NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables.
NOTE: Processing was performed in CAS.
NOTE: PROCEDURE SORT used (Total process time):
      real time           3.88 seconds
      cpu time            0.05 seconds
      
122  
123  /* option 3: CAS data step */
124  proc sql noprint;
125    select name
126      into :sort_list separated by ' '
127    from dictionary.columns
128    where libname='CASUSER' and memname='TESTDATA'
129    ;
130  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.10 seconds
      cpu time            0.00 seconds
      
131  %let last_var=%scan(&sort_list,-1);
132  %put &=sort_list;
SORT_LIST=var1 var2 var3 var4 var5
133  %put &=last_var;
LAST_VAR=var5
134  
135  /* single threaded */
136  data casuser.dedup /single=yes;
137    set casuser.testdata;
138    by &sort_list;
139    if first.&last_var;
140  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: There were 10000000 observations read from the table TESTDATA in caslib CASUSER(***).
NOTE: The table dedup in caslib CASUSER(***) has 100000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           1:12.60
      cpu time            0.14 seconds
      
141  
142  /* multi threaded */
143  data casuser.dedup;
144    set casuser.testdata;
145    by &sort_list;
146    if first.&last_var;
147  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 10000000 observations read from the table TESTDATA in caslib CASUSER(***).
NOTE: The table dedup in caslib CASUSER(***) has 100000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           5.14 seconds
      cpu time            0.03 seconds
      
148  
149  cas &sessref terminate;
NOTE: Libref CASUSER has been deassigned.
NOTE: Deletion of the session MYSESS was successful.
NOTE: The default CAS session MYSESS identified by SAS option SESSREF= was terminated. Use the OPTIONS statement to set the 
      SESSREF= option to an active session.
NOTE: Request to TERMINATE completed for session MYSESS.

88 data casuser.testdata(drop=i j); 89 array vars var1-var5; 90 do i=1 to 10e6; 91 do j=1 to dim(vars); 92 vars[j]=rand('integer', 1, 10); 93 end; 94 output; 95 end; 96 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step has no input data set and will run in a single thread. NOTE: The table testdata in caslib CASUSER(***) has 10000000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 3.75 seconds cpu time 0.04 seconds 97 98 /* option 1: CAS action */ 99 proc sql noprint; 100 select cats('{name="',name,'"}') into :var_list separated by ',' 101 from dictionary.columns 102 where libname='CASUSER' and memname='TESTDATA' 103 ; 104 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.09 seconds cpu time 0.00 seconds 105 %put &=var_list; VAR_LIST={name="var1"},{name="var2"},{name="var3"},{name="var4"},{name="var5"} 106 107 proc cas; 108 deduplication.deduplicate / 109 table={ caslib="casuser", name="testdata", 110 groupBy={&var_list} 111 }, 112 casOut={caslib="casuser", name="dedup" replace=true}, 113 noDuplicateKeys=true 114 ; 115 116 quit; NOTE: Active Session now MYSESS. NOTE: Added action set 'deduplication'. NOTE: There were 10000000 rows read from the table TESTDATA. NOTE: The table dedup has 100000 rows and 5 columns. NOTE: PROCEDURE CAS used (Total process time): real time 3.63 seconds cpu time 0.02 seconds 117 118 /* option 2: Proc Sort */ 119 proc sort data=casuser.testdata out=casuser.dedup nodupkey; 120 by _all_; 121 run; NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables. NOTE: Processing was performed in CAS. NOTE: PROCEDURE SORT used (Total process time): real time 3.88 seconds cpu time 0.05 seconds 122 123 /* option 3: CAS data step */ 124 proc sql noprint; 125 select name 126 into :sort_list separated by ' ' 127 from dictionary.columns 128 where libname='CASUSER' and memname='TESTDATA' 129 ; 130 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.10 seconds cpu time 0.00 seconds 131 %let last_var=%scan(&sort_list,-1); 132 %put &=sort_list; SORT_LIST=var1 var2 var3 var4 var5 133 %put &=last_var; LAST_VAR=var5 134 135 /* single threaded */ 136 data casuser.dedup /single=yes; 137 set casuser.testdata; 138 by &sort_list; 139 if first.&last_var; 140 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: There were 10000000 observations read from the table TESTDATA in caslib CASUSER(***). NOTE: The table dedup in caslib CASUSER(***) has 100000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1:12.60 cpu time 0.14 seconds 141 142 /* multi threaded */ 143 data casuser.dedup; 144 set casuser.testdata; 145 by &sort_list; 146 if first.&last_var; 147 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads. NOTE: There were 10000000 observations read from the table TESTDATA in caslib CASUSER(***). NOTE: The table dedup in caslib CASUSER(***) has 100000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 5.14 seconds cpu time 0.03 seconds 148 149 cas &sessref terminate; NOTE: Libref CASUSER has been deassigned. NOTE: Deletion of the session MYSESS was successful. NOTE: The default CAS session MYSESS identified by SAS option SESSREF= was terminated. Use the OPTIONS statement to set the SESSREF= option to an active session. NOTE: Request to TERMINATE completed for session MYSESS.

As a side note:

When using a hash table in a CAS data step then there will be a separate hash table per thread as soon as you use a hash method that requires write access to the hash. If you only use read methods then the hash will get created only once.

Patrick · Posted 04-04-2024 01:25 AM

LOL! And after all of my testing as share in my previous post I was wondering why Proc Sort performs that well and found: SAS Cloud Analytic Services Processing for PROC SORT

Turns out that the "usual" Proc Sort Nodupkey syntax already fully executes in CAS.

NOTE: Processing was performed in CAS.

99   /* option 2: Proc Sort */
100  proc sort data=casuser.testdata out=casuser.dedup nodupkey;
101    by _all_;
102  run;
NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables.
NOTE: Processing was performed in CAS.

Quentin · Posted 04-04-2024 08:29 AM

Thanks for all those examples @Patrick ! Quick follow-up question.

I'm confused how these steps, which read un-sorted data, can work with the BY statement (which of course on SAS 9 would require data to be sorted by the BY variables).

/* single threaded */
data casuser.dedup /single=yes;
  set casuser.testdata;
  by &sort_list;
  if first.&last_var;
run;

/* multi threaded */
data casuser.dedup;
  set casuser.testdata;
  by &sort_list;
  if first.&last_var;
run;

Is CAS just sorting in the background when you use a BY statement, or is something more interesting going on?

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Mazi · Posted 04-04-2024 09:16 AM

When executing in CAS data belonging to the same by group are automatically sent to the same node, therefore sorting is not required. If you have more nested levels, that is sorted within that node.

Quentin · Posted 04-04-2024 11:39 AM

Thanks @Mazi , I was wondering what would happen when there are more by-groups than nodes. That's cool that each node would wait to get all of its data and then sort its data when needed, before iterating through it.

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Patrick · Posted 04-04-2024 11:51 AM

It’s per thread (cpu) and not per node.
If there are more by groups than cpu’s then they have to wait until a cpu becomes available. That’s what the controller node manages afaik

Quentin · Posted 04-04-2024 12:35 PM

So if you write a data step with a BY statement and the input dataset is an unsorted CAS table, then the controller node needs to look at the full set of BY values and sort (or index) it into BY groups before sending each BY group off to a CPU, is that it? So there is basically an implicit sort or indexing happening. Where the controller node needs to find all of the rows in a non-contiguous by group and then send them off to a single CPU?

It's almost like in CAS, the DATA step BY statement is able to do grouping similar to what the CLASS statement can do in a SAS 9 PROC.

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Patrick · Posted 04-04-2024 09:44 PM

@Quentin There are things I know and understand and the rest is filling gaps with theories which is why doing all this testing with coding options is so valuable for me.

To not further hijack the OPs question I suggest you ask a new question where we can have further discussions and testing.

Mazi · Posted 04-04-2024 02:24 PM

Apologies, I got those mixed up.

Mazi · Posted 04-04-2024 09:02 AM

@alisio_meneses , Have you tried the deduplicate action in PROC CAS ?

data example;
	set sashelp.class;
	output;
	output;
run;

proc casutil;
	droptable casdata="example" incaslib="casuser" quiet;
	load data=work.example
		  outcaslib="casuser"
		  casout="example"
		  replace;
run;

proc cas;
	deduplication.deduplicate / table = {caslib="casuser", name="example", groupBy={"name"}}
								noDuplicateKeys=true
								casOut={name="deduplicated", caslib="casuser", replace=true};
quit;

alisio_meneses · Posted 04-05-2024 01:15 PM

Hi there @Mazi, thank you for responding.

I tried using proc cas and and also proc sort (nodupkey) with similar results. they both worked fine. thank you!

SAS Innovate 2025: Call for Content

Classroom Training Available!