BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello,

I have a SAS data set with 15 columns and 20 million rows.

I want to use sort+nodupkey in order to delete duplicate rows.

I run this short query and it is running  long time (More than 20 minutes) and still didn't finish.

Any idea what to do in order to speed up the query?

proc sort data=r_r.My_tbl (Where=(Event_Id ne .))  out=Want nodupkey dupout=x;
by Event_Id;
run;
10 REPLIES 10
sbxkoenk
SAS Super FREQ

Hello,

  • Try to remove duplicates (or better ... rows with duplicate keys) first without sorting.
    See approach 3 (final approach) in this paper :
    SAS Global Forum 2017 proceedings -- Paper 188-2017
    Removing Duplicates Using SAS®
    Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California
    https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf
    "A final approach to removing duplicates using PROC SUMMARY and the CLASS statement was illustrated as a more efficient alternative to PROC SORT and PROC SQL, because it eliminates the need for sorting in advance".
  • Then sort (if that's still needed).

BR, Koen

SASKiwi
PROC Star

Where is this being run - on a SAS server or on a PC? Try runs with and without NODUPKEY. What are the times for each run?

 

For single processes like sorts, the easiest way to speed processing a lot is to run more than one sort at the same time on different sets of rows. Running two parallel sorts could halve processing time, but you have to join the two datasets together after sorting. 

Patrick
Opal | Level 21

@SASKiwi wrote:

Where is this being run - on a SAS server or on a PC? Try runs with and without NODUPKEY. What are the times for each run?

 

For single processes like sorts, the easiest way to speed processing a lot is to run more than one sort at the same time on different sets of rows. Running two parallel sorts could halve processing time, but you have to join the two datasets together after sorting. 


Wouldn't it suffice to ensure that the sort executes threaded to achieve similar parallelism with much less coding?

SASKiwi
PROC Star

@Patrick - Good question. Have you done any testing with the THREADS option on PROC SORT - if so what has been the performance improvement? 

Ronein
Meteorite | Level 14

Can you please show code exmaple?

SASKiwi
PROC Star

@Ronein - Threaded sorting may already be switched on by default along with how many CPUs this is spread across. You can check if it is by running this:

proc options option = (threads cpucount);
run;

SASKiwi_1-1700513314613.png

If your results are similar then increasing the CPUCOUNT may not improve performance and should not be set higher than the number of CPUs in your SAS server. 

 

Kurt_Bremser
Super User

Maxim 3: Know Your Data.

What are the attributes of the 15 columns?

Do you have long character variables, mostly filled with blanks?

Is the dataset stored with a COMPRESS option?

Tom
Super User Tom
Super User

You need to provide more information.

You can sort a 20 million observation dataset with 15 variables in very little time:

 

304  proc sort nodupkey;
305   by event_id;
306  run;

NOTE: There were 20000000 observations read from the data set WORK.HAVE.
NOTE: 46659 observations with duplicate key values were deleted.
NOTE: The data set WORK.HAVE has 19953341 observations and 15 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           13.99 seconds
      cpu time            15.29 seconds
Ksharp
Super User
/*Add these two option in PROC SORT*/
proc sort data=sashelp.class out=want nodupkey dupout=x  sortsize=60G  noequals  ;
by age;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1260 views
  • 4 likes
  • 7 in conversation