Hello,
I have a SAS data set with 15 columns and 20 million rows.
I want to use sort+nodupkey in order to delete duplicate rows.
I run this short query and it is running long time (More than 20 minutes) and still didn't finish.
Any idea what to do in order to speed up the query?
proc sort data=r_r.My_tbl (Where=(Event_Id ne .)) out=Want nodupkey dupout=x;
by Event_Id;
run;
Hello,
BR, Koen
Where is this being run - on a SAS server or on a PC? Try runs with and without NODUPKEY. What are the times for each run?
For single processes like sorts, the easiest way to speed processing a lot is to run more than one sort at the same time on different sets of rows. Running two parallel sorts could halve processing time, but you have to join the two datasets together after sorting.
@SASKiwi wrote:
Where is this being run - on a SAS server or on a PC? Try runs with and without NODUPKEY. What are the times for each run?
For single processes like sorts, the easiest way to speed processing a lot is to run more than one sort at the same time on different sets of rows. Running two parallel sorts could halve processing time, but you have to join the two datasets together after sorting.
Wouldn't it suffice to ensure that the sort executes threaded to achieve similar parallelism with much less coding?
@Patrick - Good question. Have you done any testing with the THREADS option on PROC SORT - if so what has been the performance improvement?
Can you please show code exmaple?
Maxim 1: Read the Documentation.
PROC SORT Statement provides the relevant options.
@Ronein - Threaded sorting may already be switched on by default along with how many CPUs this is spread across. You can check if it is by running this:
proc options option = (threads cpucount);
run;
If your results are similar then increasing the CPUCOUNT may not improve performance and should not be set higher than the number of CPUs in your SAS server.
Maxim 3: Know Your Data.
What are the attributes of the 15 columns?
Do you have long character variables, mostly filled with blanks?
Is the dataset stored with a COMPRESS option?
You need to provide more information.
You can sort a 20 million observation dataset with 15 variables in very little time:
304 proc sort nodupkey; 305 by event_id; 306 run; NOTE: There were 20000000 observations read from the data set WORK.HAVE. NOTE: 46659 observations with duplicate key values were deleted. NOTE: The data set WORK.HAVE has 19953341 observations and 15 variables. NOTE: PROCEDURE SORT used (Total process time): real time 13.99 seconds cpu time 15.29 seconds
/*Add these two option in PROC SORT*/
proc sort data=sashelp.class out=want nodupkey dupout=x sortsize=60G noequals ;
by age;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.