This is Part 3 in a series of articles about common data manipulation tasks. Part 1 focused on appending data, and Part 2 focused on sorting data. Now we’ll focus on de-duplicating data.
CAS offers multiple options to de-duplicate records. The first recommendation is to use one of those operations instead of pulling the data on the client and using PROC SORT to de-duplicate.
What are they?
The following examples correspond to de-duplicating full records (all the variables are part of the key).
proc fedsql sessref=mysession _method ;
create table dm.bigprdsale_dedup_fed_distinct{options replication=0 replace=true} as
select distinct actual, predict, country, region, division, prodtype, product, quarter, year, month
from dm.bigprdsale ;
quit ;
proc fedsql sessref=mysession _method ;
create table dm.bigprdsale_dedup_fed_grpby{options replication=0 replace=true} as
select * from dm.bigprdsale
group by actual, predict, country, region, division, prodtype, product, quarter, year, month ;
quit ;
data casdm.bigprdsale_dedup_ds(copies=0) ;
set casdm.bigprdsale ;
by _all_ ;
if first.month ;
run ;
proc cas;
simple.groupBy result=r status=rc /
inputs={"actual","predict","country","region","division","prodtype","product","quarter","year","month"}
table={name="bigprdsale" caslib="dm"}
casout={caslib="dm" name="bigprdsale_dedup_action_gb" replace=true replication=0} ;
run ;
quit ;
Tip: omit the inputs parameter to de-duplicate on the whole record (all variables) without having to list all the variables.
proc cas ;
simple.groupByInfo /
table={caslib="dm",name="bigprdsale",
groupBy={"actual","predict","country","region","division","prodtype","product","quarter","year","month"}}
casOut={caslib="dm",name="bigprdsale_dedup_action_gbinfo",replace=true,replication=0}
includeDuplicates=false
groupbylimit=20000
details=true ;
run ;
quit ;
They all provide the same results with very varying run times. In my case on my data (14 million rows, 14,000 distinct rows), groupBy performed extremely well compared to the others. But the other options might run faster in other conditions, depending on the data size, the number of duplicate records, the architecture, the memory, etc.
Technique | Time to run |
FedSQL SELECT DISTINCT | 43.22 seconds |
GroupByInfo CAS action | 42.51 seconds |
DATA Step BY … if first./last. | 33.56 seconds |
FedSQL SELECT … GROUP BY | 23.51 seconds |
GroupBy CAS action | 1.48 seconds |
Some techniques are known to be slow. For example, in Viya 3.4, FedSQL SELECT DISTINCT is single-threaded, runs on only one CAS node and requires the data to be moved on this CAS node. Also, groupByInfo is known to perform better than FedSQL SELECT… GROUP BY when there are less than around 10,000 groups and worse when there are more. So, you need to know your data before choosing the right technique.
But, R&D is working hard on improving things. FedSQL SELECT DISTINCT will be multi-threaded in the next Viya version (3.5). GroupByInfo has also been rewritten in Viya 3.5 using a significantly different implementation. It will probably run as fast as groupBy in the future but with many more features than groupBy.
So, things are moving, and they are moving fast. Tests are needed in order to find the right operation for the right scenario. Don’t hesitate to experiment with different techniques to see which one fits your situation, your architecture, your data size, etc.
If you need to de-duplicate data on fewer variables but want to keep all variables in the output (typical data quality scenario where you want to de-duplicate data on common matchcodes and keep other original variables), then you have fewer options. Only the DATA Step BY … if first./last., the groupByInfo CAS action, and possibly FedSQL (using aggregation statistics on category variables) will help in that case.
Thanks for reading and stay tuned for the last article on aggregating data.
Hi Nicolas,
...I learned something today, thank you for that. I keep thinking of how could you achieve 2 seconds for de-dup using CAS action it seems to being way too far from the rest...I played today with my testing data set which has 300 variables, 132K observations, and ~5 GB size. And the times are as follows:
FED SQL Distinct: 47 sec // This is a single-threaded operation by design
DS Multi-thread: 38 sec // I believe this is functionally incorrect as de-dup is applied within partitions and not on a whole data set
DS Single-thread: 2 min 39 sec. // Much slower but functionally correct result as de-dup applied on the consolidated data set
CASL: 45 sec // Need to check whether this is functionally the same as FED SQL or DS Multi-threaded?
So my takeaway is the CASL is not that far from the rest as in your case, wondering how you achieved such a result...
Thank you,
Juraj
Hi Nicolas,
Great blog!
Just want to clarify that the 3rd option: DATA Step BY … the last variable of the table should be used for the first./last. option to have the same result as the select distinct.
Ex:
VarX VarY
1 2
1 1
2 3
2 4
1 2
data public.test_dedup_ds(copies=0) ;
set public.test ;
by _all_ ;
if last.varY;
run ;
Results:
(a) DATA Step BY … if first./last.
VarX VarY
1 1
1 2
2 3
2 4
(b) Select distinct *
VarX VarY
1 2
1 1
2 3
2 4
For Viya 3.5, the PROC SORT NODUPKEY is now supported with source and target being CAS tables.
Hi Mark,
Yes, that's right, in my example, month is the last variable of my data set.
And, yes, there's a new CAS action to perform deduplication and this is used behind the scenes by the PROC SORT on CAS. Thanks for bringing this to the discussion.
Nicolas.
I have a question about the simple.groupby. This has helped me out quite a bit recently but was wondering if there is anyway to drop the _f columns in the output. I tried to find something in the documentation but have been unsucessful.
@sarakatz : I checked with Tech Support on this and they say _f columns will always be there hence you need to do the clean up of _f columns after running casl groupby. I am either using Data Step or PROC SORT (Viya 3.5).
@jlongauer: Thank you for the response. I also posted this question internally on Yammer where someone suggested that an alterTable action could be used right afterward to drop the _f columns. Just thought I would share.
As an update: Running a de-dup on a current Viya version (SYSVLONG4 V.04.00M0P01162024, MPP 4 nodes with 192 cpu's in total) it looks like one can now just use Proc Sort Nodupkey which uses then cas action deduplication.deduplicate.
80 proc sort data=casuser.testdata out=casuser.dedup nodupkey; 81 by _all_; 82 run; NOTE: The data set CASUSER.DEDUP has 100000 observations and 5 variables. NOTE: Processing was performed in CAS. NOTE: PROCEDURE SORT used (Total process time): real time 2.00 seconds cpu time 0.04 seconds
Interestingly simple.groupBy still outperformed deduplication.deduplicate even though it creates these extra (and unwanted) variables.
80 proc sql noprint; 81 select cats('{name="',name,'"}') into :var_list separated by ',' 82 from dictionary.columns 83 where libname='CASUSER' and memname='TESTDATA' 84 ; 85 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.11 seconds cpu time 0.00 seconds 86 %put &=var_list; VAR_LIST={name="var1"},{name="var2"},{name="var3"},{name="var4"},{name="var5"} 87 88 proc cas; 89 deduplication.deduplicate / 90 table={ caslib="casuser", name="testdata", 91 groupBy={&var_list} 92 }, 93 casOut={caslib="casuser", name="dedup" replace=true}, 94 noDuplicateKeys=true 95 ; 96 97 quit; NOTE: Active Session now MYSESS. NOTE: There were 10000000 rows read from the table TESTDATA. NOTE: The table dedup has 100000 rows and 5 columns. NOTE: PROCEDURE CAS used (Total process time): real time 1.99 seconds cpu time 0.04 seconds 98 99 proc cas; 100 simple.groupBy result=r status=rc/ 101 table={caslib="casuser" name="testdata"}, 102 casout={caslib="casuser" name="dedup" replace=true} ; 103 run ; NOTE: Active Session now MYSESS. 104 quit; NOTE: PROCEDURE CAS used (Total process time): real time 0.94 seconds cpu time 0.02 seconds
And as another update:
When running the same script against source data with the same number of rows and variables but less duplicates then the elapsed run-time for deduplication.deduplicate remains the same but the run-time for simple.groupBy deteriorates significantly.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.