SAS Programming

Ronein · Posted 11-15-2023 03:52 PM

Hello,

I have a SAS data set with 15 columns and 20 million rows.

I want to use sort+nodupkey in order to delete duplicate rows.

I run this short query and it is running long time (More than 20 minutes) and still didn't finish.

Any idea what to do in order to speed up the query?

proc sort data=r_r.My_tbl (Where=(Event_Id ne .))  out=Want nodupkey dupout=x;
by Event_Id;
run;

sbxkoenk · Posted 11-15-2023 04:05 PM

Hello,

Try to remove duplicates (or better ... rows with duplicate keys) first without sorting.
See approach 3 (final approach) in this paper :
SAS Global Forum 2017 proceedings -- Paper 188-2017
Removing Duplicates Using SAS®
Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California
https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf
"A final approach to removing duplicates using PROC SUMMARY and the CLASS statement was illustrated as a more efficient alternative to PROC SORT and PROC SQL, because it eliminates the need for sorting in advance".
Then sort (if that's still needed).

BR, Koen

SASKiwi · Posted 11-15-2023 04:27 PM

Where is this being run - on a SAS server or on a PC? Try runs with and without NODUPKEY. What are the times for each run?

For single processes like sorts, the easiest way to speed processing a lot is to run more than one sort at the same time on different sets of rows. Running two parallel sorts could halve processing time, but you have to join the two datasets together after sorting.

Patrick · Posted 11-15-2023 08:28 PM

@SASKiwi wrote:

Where is this being run - on a SAS server or on a PC? Try runs with and without NODUPKEY. What are the times for each run?

For single processes like sorts, the easiest way to speed processing a lot is to run more than one sort at the same time on different sets of rows. Running two parallel sorts could halve processing time, but you have to join the two datasets together after sorting.

Wouldn't it suffice to ensure that the sort executes threaded to achieve similar parallelism with much less coding?

SASKiwi · Posted 11-15-2023 08:53 PM

@Patrick - Good question. Have you done any testing with the THREADS option on PROC SORT - if so what has been the performance improvement?

Ronein · Posted 11-16-2023 02:16 AM

Can you please show code exmaple?

Kurt_Bremser · Posted 11-16-2023 05:27 AM

Maxim 1: Read the Documentation.

PROC SORT Statement provides the relevant options.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SASKiwi · Posted 11-20-2023 03:53 PM

@Ronein - Threaded sorting may already be switched on by default along with how many CPUs this is spread across. You can check if it is by running this:

proc options option = (threads cpucount);
run;

If your results are similar then increasing the CPUCOUNT may not improve performance and should not be set higher than the number of CPUs in your SAS server.

Kurt_Bremser · Posted 11-15-2023 04:43 PM

Maxim 3: Know Your Data.

What are the attributes of the 15 columns?

Do you have long character variables, mostly filled with blanks?

Is the dataset stored with a COMPRESS option?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 11-15-2023 09:17 PM

You need to provide more information.

You can sort a 20 million observation dataset with 15 variables in very little time:

304  proc sort nodupkey;
305   by event_id;
306  run;

NOTE: There were 20000000 observations read from the data set WORK.HAVE.
NOTE: 46659 observations with duplicate key values were deleted.
NOTE: The data set WORK.HAVE has 19953341 observations and 15 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           13.99 seconds
      cpu time            15.29 seconds

Ksharp · Posted 11-15-2023 09:36 PM

/*Add these two option in PROC SORT*/
proc sort data=sashelp.class out=want nodupkey dupout=x  sortsize=60G  noequals  ;
by age;
run;

SAS Programming

Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Re: Proc sort+Nodupkey Long time run

Proc sort nodupkey

PROC SORT with NODUPKEY

Long running proc sql query.

From long format to short data format

Calculating the run time of running multiple programs at once

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...