CAS answers to 4 common data manipulation tasks – Part 3 – DE-DUPLICATE

6 Likes

This is Part 3 in a series of articles about common data manipulation tasks. Part 1 focused on appending data, and Part 2 focused on sorting data. Now we’ll focus on de-duplicating data.

De-duplicating records

CAS offers multiple options to de-duplicate records. The first recommendation is to use one of those operations instead of pulling the data on the client and using PROC SORT to de-duplicate.

What are they?

The following examples correspond to de-duplicating full records (all the variables are part of the key).

FedSQL SELECT DISTINCT

proc fedsql sessref=mysession _method ;
   create table dm.bigprdsale_dedup_fed_distinct{options replication=0 replace=true} as
   select distinct actual, predict, country, region, division, prodtype, product, quarter, year, month 
   from dm.bigprdsale ;
quit ;

FedSQL SELECT … GROUP BY

proc fedsql sessref=mysession _method ;
   create table dm.bigprdsale_dedup_fed_grpby{options replication=0 replace=true} as
   select * from dm.bigprdsale 
   group by actual, predict, country, region, division, prodtype, product, quarter, year, month ;
quit ;

DATA Step BY … if first./last.

data casdm.bigprdsale_dedup_ds(copies=0) ;
   set casdm.bigprdsale ;
   by _all_ ;
   if first.month ;
run ;

GroupBy CAS action

proc cas;
   simple.groupBy result=r status=rc /
      inputs={"actual","predict","country","region","division","prodtype","product","quarter","year","month"}
      table={name="bigprdsale" caslib="dm"}
      casout={caslib="dm" name="bigprdsale_dedup_action_gb" replace=true replication=0} ;
run ;
quit ;

Tip: omit the inputs parameter to de-duplicate on the whole record (all variables) without having to list all the variables.

GroupByInfo CAS action

proc cas ;
   simple.groupByInfo /
      table={caslib="dm",name="bigprdsale",
         groupBy={"actual","predict","country","region","division","prodtype","product","quarter","year","month"}}
      casOut={caslib="dm",name="bigprdsale_dedup_action_gbinfo",replace=true,replication=0}
      includeDuplicates=false
      groupbylimit=20000
      details=true ;
run ;
quit ;

They all provide the same results with very varying run times. In my case on my data (14 million rows, 14,000 distinct rows), groupBy performed extremely well compared to the others. But the other options might run faster in other conditions, depending on the data size, the number of duplicate records, the architecture, the memory, etc.

Technique	Time to run
FedSQL SELECT DISTINCT	43.22 seconds
GroupByInfo CAS action	42.51 seconds
DATA Step BY … if first./last.	33.56 seconds
FedSQL SELECT … GROUP BY	23.51 seconds
GroupBy CAS action	1.48 seconds

Some techniques are known to be slow. For example, in Viya 3.4, FedSQL SELECT DISTINCT is single-threaded, runs on only one CAS node and requires the data to be moved on this CAS node. Also, groupByInfo is known to perform better than FedSQL SELECT… GROUP BY when there are less than around 10,000 groups and worse when there are more. So, you need to know your data before choosing the right technique.

But, R&D is working hard on improving things. FedSQL SELECT DISTINCT will be multi-threaded in the next Viya version (3.5). GroupByInfo has also been rewritten in Viya 3.5 using a significantly different implementation. It will probably run as fast as groupBy in the future but with many more features than groupBy.

So, things are moving, and they are moving fast. Tests are needed in order to find the right operation for the right scenario. Don’t hesitate to experiment with different techniques to see which one fits your situation, your architecture, your data size, etc.

If you need to de-duplicate data on fewer variables but want to keep all variables in the output (typical data quality scenario where you want to de-duplicate data on common matchcodes and keep other original variables), then you have fewer options. Only the DATA Step BY … if first./last., the groupByInfo CAS action, and possibly FedSQL (using aggregation statistics on category variables) will help in that case.

Takeaways

There are multiple ways to de-duplicate data in CAS
Depending on your data and your CAS architecture, they might run in very different times
Best practice: experiment with the groupBy, groupByInfo CAS actions, DATA Step or FedSQL GROUP BY operations to find the right technique that fits your case
Mistakes to avoid:
- Purposely pull the CAS table on the client (SPRE or SAS) and use PROC SORT to de-duplicate
- Use FedSQL SELECT DISTINCT in Viya 3.4 (single-threaded)

Thanks for reading and stay tuned for the last article on aggregating data.

jlongauer · ‎01-02-2020

Hi Nicolas,

...I learned something today, thank you for that. I keep thinking of how could you achieve 2 seconds for de-dup using CAS action it seems to being way too far from the rest...I played today with my testing data set which has 300 variables, 132K observations, and ~5 GB size. And the times are as follows:

FED SQL Distinct: 47 sec // This is a single-threaded operation by design

DS Multi-thread: 38 sec // I believe this is functionally incorrect as de-dup is applied within partitions and not on a whole data set

DS Single-thread: 2 min 39 sec. // Much slower but functionally correct result as de-dup applied on the consolidated data set

CASL: 45 sec // Need to check whether this is functionally the same as FED SQL or DS Multi-threaded?

So my takeaway is the CASL is not that far from the rest as in your case, wondering how you achieved such a result...

Thank you,

Juraj

MarkEscauriaga · ‎02-13-2020

Hi Nicolas,

Great blog!

Just want to clarify that the 3rd option: DATA Step BY … the last variable of the table should be used for the first./last. option to have the same result as the select distinct.

Ex:

VarX VarY

1 2

1 1

2 3

2 4

1 2

data public.test_dedup_ds(copies=0) ;

set public.test ;

by _all_ ;

if last.varY;

run ;

Results:

(a) DATA Step BY … if first./last.

VarX VarY

1 1

1 2

2 3

2 4

(b) Select distinct *

VarX VarY

1 2

1 1

2 3

2 4

For Viya 3.5, the PROC SORT NODUPKEY is now supported with source and target being CAS tables.

https://go.documentation.sas.com/?activeCdc=pgmsascdc&cdcId=sasstudiocdc&cdcVersion=5.2&docsetId=pro...

NicolasRobert · ‎02-14-2020

Hi Mark,

Yes, that's right, in my example, month is the last variable of my data set.

And, yes, there's a new CAS action to perform deduplication and this is used behind the scenes by the PROC SORT on CAS. Thanks for bringing this to the discussion.

Nicolas.

sarakatz · ‎10-02-2020

I have a question about the simple.groupby. This has helped me out quite a bit recently but was wondering if there is anyway to drop the _f columns in the output. I tried to find something in the documentation but have been unsucessful.

jlongauer · ‎10-06-2020

@sarakatz : I checked with Tech Support on this and they say _f columns will always be there hence you need to do the clean up of _f columns after running casl groupby. I am either using Data Step or PROC SORT (Viya 3.5).

sarakatz · ‎10-06-2020

@jlongauer: Thank you for the response. I also posted this question internally on Yammer where someone suggested that an alterTable action could be used right afterward to drop the _f columns. Just thought I would share.

Patrick · ‎04-04-2024

As an update: Running a de-dup on a current Viya version (SYSVLONG4 V.04.00M0P01162024, MPP 4 nodes with 192 cpu's in total) it looks like one can now just use Proc Sort Nodupkey which uses then cas action deduplication.deduplicate.

80   proc sort data=casuser.testdata out=casuser.dedup nodupkey;
81     by _all_;
82   run;
NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables.
NOTE: Processing was performed in CAS.
NOTE: PROCEDURE SORT used (Total process time):
      real time           2.00 seconds
      cpu time            0.04 seconds

Interestingly simple.groupBy still outperformed deduplication.deduplicate even though it creates these extra (and unwanted) variables.

80   proc sql noprint;
81     select cats('{name="',name,'"}') into :var_list separated by ','
82     from dictionary.columns
83     where libname='CASUSER' and memname='TESTDATA'
84     ;
85   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.11 seconds
      cpu time            0.00 seconds
      
86   %put &=var_list;
VAR_LIST={name="var1"},{name="var2"},{name="var3"},{name="var4"},{name="var5"}
87   
88   proc cas;
89     deduplication.deduplicate /
90       table={ caslib="casuser", name="testdata",
91               groupBy={&var_list}
92             },
93       casOut={caslib="casuser", name="dedup" replace=true},
94       noDuplicateKeys=true
95       ;
96   
97   quit;
NOTE: Active Session now MYSESS.
NOTE: There were 10000000 rows read from the table TESTDATA.
NOTE: The table dedup has 100000 rows and 5 columns.
NOTE: PROCEDURE CAS used (Total process time):
      real time           1.99 seconds
      cpu time            0.04 seconds
      
98   
99   proc cas;
100     simple.groupBy result=r status=rc/
101        table={caslib="casuser" name="testdata"},
102        casout={caslib="casuser" name="dedup" replace=true} ;
103    run ;
NOTE: Active Session now MYSESS.
104  quit;
NOTE: PROCEDURE CAS used (Total process time):
      real time           0.94 seconds
      cpu time            0.02 seconds

Patrick · ‎04-08-2024

And as another update:

When running the same script against source data with the same number of rows and variables but less duplicates then the elapsed run-time for deduplication.deduplicate remains the same but the run-time for simple.groupBy deteriorates significantly.