DATA Step, Macro, Functions and more

Sorting large datasets on multiple variables

Reply
Occasional Contributor
Posts: 17

Sorting large datasets on multiple variables

Hello everyone,

 

I have been trying to sort a large dataset on 4 variables. The dataset is 100GB and it is using up the entire work space in the background and is unable to complete the process. Is there any efficient way other than the traditional proc sort or proc sql. I have tried both.

 

PROC SORT DATA = xyz;

BY a b c d;

RUN;

 

PROC SQL;

CREATE TABLE xyz1

SELECT *

FROM XYZ

ORDER BY a,b,c,d;

QUIT;

 

Thank you for your time

Super User
Posts: 3,102

Re: Sorting large datasets on multiple variables

Try compressing your table first then sorting with compression on:

options compress = binary;

DATA xyz;
  set xyz;
run;

proc sort data = xyz;
etc....

 

Occasional Contributor
Posts: 17

Re: Sorting large datasets on multiple variables

Hello,

Thanks for the suggestion. The dataset which I am using is compressed
already.



##- Please type your reply above this line. Simple formatting, no
attachments. -##
Super User
Posts: 3,102

Re: Sorting large datasets on multiple variables

[ Edited ]

Ok, then sort the dataset in chunks and interleave the chunks into one final sorted version:

 

proc sort data = xyz (firstobs = 1 obs = 100000)
               out = chunk1
              ;
  by a b c d;
run;

proc sort data = xyz (firstobs = 100001 obs = 200000)
               out = chunk2
              ;
  by a b c d;
run;

data final
  set chunk1
      chunk2;
  by a b c d;
run;
       
Super User
Posts: 17,785

Re: Sorting large datasets on multiple variables

@SASKiwi That doesn't guarantee that the chunks will be in the overall correct sort order, does it?

Does it find the 'smallest' records required and then put them to the output data set?

Super User
Posts: 3,102

Re: Sorting large datasets on multiple variables

@Reeza I would have thought that SET plus BY will interleave the chunks to maintain the correct sorted order. Happy to be put right if that is not correct.

Super User
Posts: 6,933

Re: Sorting large datasets on multiple variables


Reeza wrote:

@SASKiwi That doesn't guarantee that the chunks will be in the overall correct sort order, does it?

Does it find the 'smallest' records required and then put them to the output data set?


It has been my experience (up to now) that using equally sorted datasets in one set statement, followed by the respective by, will keep the sort order in the output.

Like

%macro test_sort (num);
%do i = 1 %to #
proc sort data=sashelp.class out=class&i;
by weight height age sex name;
run;
%end;

data want;
set
%do i = 1 %to #
  class&i
%end;
;
by weight height age sex name;
run;
%mend;
%test_sort(5);
proc print data=want;run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 117

Re: Sorting large datasets on multiple variables

@KurtBremser The macro code you provided creates 5x19 observations. Its not subsetting the data. It is just repeating 5 times.

Super User
Posts: 6,933

Re: Sorting large datasets on multiple variables


SuryaKiran wrote:

@KurtBremser The macro code you provided creates 5x19 observations. Its not subsetting the data. It is just repeating 5 times.


You completely missed what I was pointing at. I just wanted to illustrate that using a set and by statement to combine separately sorted datasets will preserve the correct sort order in the output dataset.

Please observe that this discussion thread deals with sorting, not subsetting.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 6,933

Re: Sorting large datasets on multiple variables


SASKiwi wrote:

Try compressing your table first then sorting with compression on:

options compress = binary;

DATA xyz;
  set xyz;
run;

proc sort data = xyz;
etc....

 


I just ran a test here with SAS 9.2 on AIX.

The compress system option does not affect the utility file that is created during the sort, so sorting a compressed dataset of 100 GB might easily eat 1 TB of WORK (although the final output will once again be only 100 GB).

Sorting large compressed datasets is done best with the tagsort option.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 17,785

Re: Sorting large datasets on multiple variables

100GB on anything is going to be a bit of a nightmare. 

 

What are are you trying to achieve? 

Respected Advisor
Posts: 4,644

Re: Sorting large datasets on multiple variables

Have you considered using the TAGSORT option?

PG
Occasional Contributor
Posts: 17

Re: Sorting large datasets on multiple variables

I have not used it. Does it reduce the background memory usage ?

Respected Advisor
Posts: 4,644

Re: Sorting large datasets on multiple variables

That's what the doc says

 

"The TAGSORT option is useful in single-threaded situations where there might not be enough disk space to sort a large SAS data set. The TAGSORT option is not supported for multi-threaded sorts.
When you specify the TAGSORT option, only sort keys (that is, the variables specified in the BY statement) and the observation number for each observation are stored in the temporary files. The sort keys, together with the observation number, are referred to as tags. At the completion of the sorting process, the tags are used to retrieve the records from the input data set in sorted order. Thus, in cases where the total number of bytes of the sort keys is small compared with the length of the record, temporary disk use is reduced considerably. However, you should have enough disk space to hold another copy of the data (the output data set) or two copies of the tags, whichever is greater. Note that although using the TAGSORT option can reduce temporary disk use, the processing time might be much higher."
PG
Respected Advisor
Posts: 3,887

Re: Sorting large datasets on multiple variables

I consider it as a sub-optimal approach if you have to partially hand code sorting only because of a lack of disk space. May be have a chat with your SAS Admin if there is something that could be done.

You want a fast disk for UTILLOC but it doesn't have to be the same than WORK and it can also be pointed to multiple disks.

 

If extending disk space is not an option then I'd go for TAGSORT as already suggested. Creation of a sorted data set will likely take much longer but you shouldn't run out of disk space anymore.

Ask a Question
Discussion stats
  • 19 replies
  • 669 views
  • 4 likes
  • 11 in conversation