BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.

Hello there.

 

I would like to know how to perform key deduplication of a CAS table more efficiently than using proc sort. According to this document it is possible to perform this task using hash tables, but the doc uses non-CAS tables.

 

Heres a sample code using non CAS tables:

data testdata(drop=i j);
   array vars var1-var5;
   do i=1 to 10e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

proc sort data=testdata out=test1 nodupkey;
   by _ALL_;
run;

data test2;
   if _N_ = 1 then do;
      declare hash h(hashexp:20);
      h.defineKey('var1', 'var2', 'var3', 'var4', 'var5');
      h.defineDone();
   end;
 
   set testdata;
 
   if h.check() ne 0 then do;
      output;
      h.add();
   end;
run;

When trying to use the same code with CAS, I noticed that the resulting table still contains duplicate records, and I assume this is related to parallel execution on different workers, but I'm not sure.

 

I would appreciate the community's help: is it possible to perform record deduplication of a CAS table using hash tables? If so, how?

 

Environment information: SAS Viya 3.5 with 4 CAS workers running on linux

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Mazi
Pyrite | Level 9

@alisio_meneses , Have you tried the deduplicate action in PROC CAS ?

data example;
	set sashelp.class;
	output;
	output;
run;

proc casutil;
	droptable casdata="example" incaslib="casuser" quiet;
	load data=work.example
		  outcaslib="casuser"
		  casout="example"
		  replace;
run;

proc cas;
	deduplication.deduplicate / table = {caslib="casuser", name="example", groupBy={"name"}}
								noDuplicateKeys=true
								casOut={name="deduplicated", caslib="casuser", replace=true};
quit;

View solution in original post

14 REPLIES 14
Quentin
Super User

I don't have viya , but if you do:

data test2 /single=yes ;

I think that should force the step to run in a single thread, as in this example: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/casdspgm/p10ux48pz7qknzn1ux1bs48vplnx.htm#n05... .

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
alisio_meneses
Quartz | Level 8

Hi! Thanks for replying. I think you may be right, and it might work. But if it should work, doesn't it limit the usage of CAS resources (like massive parallel processing), thus making it possibly slower than using PROC SORT. I'll have a go anyway.

Quentin
Super User

Yes, probably.  : )

 

My understanding is if you want to do things that rely on order of the data being processed (e.g. retain statement, lag, etc), you need to run it single threaded, to avoid different threads getting different chunks of data.  But again, that's just based on reading the docs a bit and chatting with folks, I haven't tested this.  If you turn on that option, does it work?

Generally, I wouldn't try to beat PROC SORT.  In my experience most SAS PROCS are optimized better than I could do. : )

 

Maybe try PROC CAS for de-duping?  https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/caspg/n1gx5u07e5tqivn1n680voj4ozff.htm

 

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Patrick
Opal | Level 21

Below multiple options that all return a deduped table. The already proposed CAS action deduplication.deduplicate performed best closely followed by Proc Sort. I assume timings for Proc Sort could be quite different if data volumes are different (like a long string variable).   ...turns out Proc Sort executes fully in CAS and though the only difference in time compared to the deduplicate action must be caused by the overhead "mapping" the Proc Sort syntax to the CAS action.

 

Also the data step works both single and multi threaded but especially single threaded is much much slower on my 4 nodes 192 threads environment.

 

CAS by group processing: It's also somehow in the documentation but the clearest statement I could find is as below from forum paper Accelerate DATA Step BY-Group Processing in SAS® Viya®

PARALLEL PROCESSING
When a DATA step runs in CAS, BY groups are assigned to threads. Many BY groups can be
assigned to a single thread. However, a BY group cannot span threads.

What this means is you can use a data step /if first construct to dedup your data because all the data belonging to a by group will get processed by a single thread. The downside of this in a MPP environment is of course that there is potentially a lot of data movement between nodes. 

 

Below working sample code executed using a recent Viya 4 environment but based on documentation all or at least most of the options should work for any Viya version.

 

Spoiler
%let sessref=MySess;
%if %sysfunc(sessfound(&sessref)) %then
  %do;
    cas mySess terminate;
  %end;
cas &sessref cassessopts=(caslib="casuser");
libname casuser cas;

data casuser.testdata(drop=i j);
   array vars var1-var5;
   do i=1 to 10e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* option 1: CAS action */
proc sql noprint;
  select cats('{name="',name,'"}') into :var_list separated by ','
  from dictionary.columns
  where libname='CASUSER' and memname='TESTDATA'
  ;
quit;
%put &=var_list;

proc cas;
  deduplication.deduplicate /
    table={ caslib="casuser", name="testdata",      
            groupBy={&var_list}
          },                         
    casOut={caslib="casuser", name="dedup" replace=true}, 
    noDuplicateKeys=true
    ;           
  
quit;

/* option 2: Proc Sort */
proc sort data=casuser.testdata out=casuser.dedup nodupkey;
  by _all_;
run;

/* option 3: CAS data step */
proc sql noprint;
  select name 
    into :sort_list separated by ' '
  from dictionary.columns
  where libname='CASUSER' and memname='TESTDATA'
  ;
quit;
%let last_var=%scan(&sort_list,-1);
%put &=sort_list;
%put &=last_var;

/* single threaded */
data casuser.dedup /single=yes;
  set casuser.testdata;
  by &sort_list;
  if first.&last_var;
run;

/* multi threaded */
data casuser.dedup;
  set casuser.testdata;
  by &sort_list;
  if first.&last_var;
run;

cas &sessref terminate;

And here the SAS log with the timings

 

Spoiler
88   data casuser.testdata(drop=i j);
89      array vars var1-var5;
90      do i=1 to 10e6;
91         do j=1 to dim(vars);
92            vars[j]=rand('integer', 1, 10);
93         end;
94         output;
95      end;
96   run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step has no input data set and will run in a single thread.
NOTE: The table testdata in caslib CASUSER(***) has 10000000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           3.75 seconds
      cpu time            0.04 seconds
      
97   
98   /* option 1: CAS action */
99   proc sql noprint;
100    select cats('{name="',name,'"}') into :var_list separated by ','
101    from dictionary.columns
102    where libname='CASUSER' and memname='TESTDATA'
103    ;
104  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.09 seconds
      cpu time            0.00 seconds
      
105  %put &=var_list;
VAR_LIST={name="var1"},{name="var2"},{name="var3"},{name="var4"},{name="var5"}
106  
107  proc cas;
108    deduplication.deduplicate /
109      table={ caslib="casuser", name="testdata",
110              groupBy={&var_list}
111            },
112      casOut={caslib="casuser", name="dedup" replace=true},
113      noDuplicateKeys=true
114      ;
115  
116  quit;
NOTE: Active Session now MYSESS.
NOTE: Added action set 'deduplication'.
NOTE: There were 10000000 rows read from the table TESTDATA.
NOTE: The table dedup has 100000 rows and 5 columns.
NOTE: PROCEDURE CAS used (Total process time):
      real time           3.63 seconds
      cpu time            0.02 seconds
      
117  
118  /* option 2: Proc Sort */
119  proc sort data=casuser.testdata out=casuser.dedup nodupkey;
120    by _all_;
121  run;
NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables.
NOTE: Processing was performed in CAS.
NOTE: PROCEDURE SORT used (Total process time):
      real time           3.88 seconds
      cpu time            0.05 seconds
      
122  
123  /* option 3: CAS data step */
124  proc sql noprint;
125    select name
126      into :sort_list separated by ' '
127    from dictionary.columns
128    where libname='CASUSER' and memname='TESTDATA'
129    ;
130  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.10 seconds
      cpu time            0.00 seconds
      
131  %let last_var=%scan(&sort_list,-1);
132  %put &=sort_list;
SORT_LIST=var1 var2 var3 var4 var5
133  %put &=last_var;
LAST_VAR=var5
134  
135  /* single threaded */
136  data casuser.dedup /single=yes;
137    set casuser.testdata;
138    by &sort_list;
139    if first.&last_var;
140  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: There were 10000000 observations read from the table TESTDATA in caslib CASUSER(***).
NOTE: The table dedup in caslib CASUSER(***) has 100000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           1:12.60
      cpu time            0.14 seconds
      
141  
142  /* multi threaded */
143  data casuser.dedup;
144    set casuser.testdata;
145    by &sort_list;
146    if first.&last_var;
147  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 10000000 observations read from the table TESTDATA in caslib CASUSER(***).
NOTE: The table dedup in caslib CASUSER(***) has 100000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           5.14 seconds
      cpu time            0.03 seconds
      
148  
149  cas &sessref terminate;
NOTE: Libref CASUSER has been deassigned.
NOTE: Deletion of the session MYSESS was successful.
NOTE: The default CAS session MYSESS identified by SAS option SESSREF= was terminated. Use the OPTIONS statement to set the 
      SESSREF= option to an active session.
NOTE: Request to TERMINATE completed for session MYSESS.

 

As a side note:

When using a hash table in a CAS data step then there will be a separate hash table per thread as soon as you use a hash method that requires write access to the hash. If you only use read methods then the hash will get created only once.

 

Patrick
Opal | Level 21

LOL! And after all of my testing as share in my previous post I was wondering why Proc Sort performs that well and found: SAS Cloud Analytic Services Processing for PROC SORT

Turns out that the "usual" Proc Sort Nodupkey syntax already fully executes in CAS.

 

NOTE: Processing was performed in CAS.

99   /* option 2: Proc Sort */
100  proc sort data=casuser.testdata out=casuser.dedup nodupkey;
101    by _all_;
102  run;
NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables.
NOTE: Processing was performed in CAS.

 

Quentin
Super User

Thanks for all those examples @Patrick !  Quick follow-up question. 

 

I'm confused how these steps, which read un-sorted data, can work with the BY statement (which of course on SAS 9 would require data to be sorted by the BY variables).

 

/* single threaded */
data casuser.dedup /single=yes;
  set casuser.testdata;
  by &sort_list;
  if first.&last_var;
run;

/* multi threaded */
data casuser.dedup;
  set casuser.testdata;
  by &sort_list;
  if first.&last_var;
run;

Is CAS just sorting in the background when you use a BY statement, or is something more interesting going on?

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Mazi
Pyrite | Level 9

When executing in CAS data belonging to the same by group are automatically sent to the same node, therefore sorting is not required. If you have more nested levels, that is sorted within that node.  

Quentin
Super User

Thanks @Mazi , I was wondering what would happen when there are more by-groups than nodes.  That's cool that each node would wait to get all of its data and then sort its data when needed, before iterating through it.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Patrick
Opal | Level 21
It’s per thread (cpu) and not per node.
If there are more by groups than cpu’s then they have to wait until a cpu becomes available. That’s what the controller node manages afaik
Quentin
Super User

So if you write a data step with a BY statement and the input dataset is an unsorted CAS table, then the controller node needs to look at the full set of BY values and sort (or index) it into BY groups before sending each BY group off to a CPU, is that it?  So there is basically an implicit sort or indexing happening.  Where the controller node needs to find all of the rows in a non-contiguous by group and then send them off to a single CPU?

 

It's almost like in CAS, the DATA step BY statement is able to do grouping similar to what the CLASS statement can do in a SAS 9 PROC.

 

 

 

 

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Patrick
Opal | Level 21

@Quentin There are things I know and understand and the rest is filling gaps with theories which is why doing all this testing with coding options is so valuable for me.

To not further hijack the OPs question I suggest you ask a new question where we can have further discussions and testing.

Mazi
Pyrite | Level 9
Apologies, I got those mixed up.
Mazi
Pyrite | Level 9

@alisio_meneses , Have you tried the deduplicate action in PROC CAS ?

data example;
	set sashelp.class;
	output;
	output;
run;

proc casutil;
	droptable casdata="example" incaslib="casuser" quiet;
	load data=work.example
		  outcaslib="casuser"
		  casout="example"
		  replace;
run;

proc cas;
	deduplication.deduplicate / table = {caslib="casuser", name="example", groupBy={"name"}}
								noDuplicateKeys=true
								casOut={name="deduplicated", caslib="casuser", replace=true};
quit;
alisio_meneses
Quartz | Level 8
Hi there @Mazi, thank you for responding.

I tried using proc cas and and also proc sort (nodupkey) with similar results. they both worked fine. thank you!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1124 views
  • 11 likes
  • 4 in conversation