- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have a SAS data set with 15 columns and 20 million rows.
I want to use sort+nodupkey in order to delete duplicate rows.
I run this short query and it is running long time (More than 20 minutes) and still didn't finish.
Any idea what to do in order to speed up the query?
proc sort data=r_r.My_tbl (Where=(Event_Id ne .)) out=Want nodupkey dupout=x;
by Event_Id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
- Try to remove duplicates (or better ... rows with duplicate keys) first without sorting.
See approach 3 (final approach) in this paper :
SAS Global Forum 2017 proceedings -- Paper 188-2017
Removing Duplicates Using SAS®
Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California
https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf
"A final approach to removing duplicates using PROC SUMMARY and the CLASS statement was illustrated as a more efficient alternative to PROC SORT and PROC SQL, because it eliminates the need for sorting in advance". - Then sort (if that's still needed).
BR, Koen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Where is this being run - on a SAS server or on a PC? Try runs with and without NODUPKEY. What are the times for each run?
For single processes like sorts, the easiest way to speed processing a lot is to run more than one sort at the same time on different sets of rows. Running two parallel sorts could halve processing time, but you have to join the two datasets together after sorting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@SASKiwi wrote:
Where is this being run - on a SAS server or on a PC? Try runs with and without NODUPKEY. What are the times for each run?
For single processes like sorts, the easiest way to speed processing a lot is to run more than one sort at the same time on different sets of rows. Running two parallel sorts could halve processing time, but you have to join the two datasets together after sorting.
Wouldn't it suffice to ensure that the sort executes threaded to achieve similar parallelism with much less coding?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Patrick - Good question. Have you done any testing with the THREADS option on PROC SORT - if so what has been the performance improvement?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you please show code exmaple?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Maxim 1: Read the Documentation.
PROC SORT Statement provides the relevant options.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ronein - Threaded sorting may already be switched on by default along with how many CPUs this is spread across. You can check if it is by running this:
proc options option = (threads cpucount);
run;
If your results are similar then increasing the CPUCOUNT may not improve performance and should not be set higher than the number of CPUs in your SAS server.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Maxim 3: Know Your Data.
What are the attributes of the 15 columns?
Do you have long character variables, mostly filled with blanks?
Is the dataset stored with a COMPRESS option?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You need to provide more information.
You can sort a 20 million observation dataset with 15 variables in very little time:
304 proc sort nodupkey; 305 by event_id; 306 run; NOTE: There were 20000000 observations read from the data set WORK.HAVE. NOTE: 46659 observations with duplicate key values were deleted. NOTE: The data set WORK.HAVE has 19953341 observations and 15 variables. NOTE: PROCEDURE SORT used (Total process time): real time 13.99 seconds cpu time 15.29 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
/*Add these two option in PROC SORT*/
proc sort data=sashelp.class out=want nodupkey dupout=x sortsize=60G noequals ;
by age;
run;