BookmarkSubscribeRSS Feed
div44
Calcite | Level 5

Hello everyone,

 

I have been trying to sort a large dataset on 4 variables. The dataset is 100GB and it is using up the entire work space in the background and is unable to complete the process. Is there any efficient way other than the traditional proc sort or proc sql. I have tried both.

 

PROC SORT DATA = xyz;

BY a b c d;

RUN;

 

PROC SQL;

CREATE TABLE xyz1

SELECT *

FROM XYZ

ORDER BY a,b,c,d;

QUIT;

 

Thank you for your time

42 REPLIES 42
SASKiwi
PROC Star

Try compressing your table first then sorting with compression on:

options compress = binary;

DATA xyz;
  set xyz;
run;

proc sort data = xyz;
etc....

 

div44
Calcite | Level 5
Hello,

Thanks for the suggestion. The dataset which I am using is compressed
already.



##- Please type your reply above this line. Simple formatting, no
attachments. -##
SASKiwi
PROC Star

Ok, then sort the dataset in chunks and interleave the chunks into one final sorted version:

 

proc sort data = xyz (firstobs = 1 obs = 100000)
               out = chunk1
              ;
  by a b c d;
run;

proc sort data = xyz (firstobs = 100001 obs = 200000)
               out = chunk2
              ;
  by a b c d;
run;

data final
  set chunk1
      chunk2;
  by a b c d;
run;
       
Reeza
Super User

@SASKiwi That doesn't guarantee that the chunks will be in the overall correct sort order, does it?

Does it find the 'smallest' records required and then put them to the output data set?

SASKiwi
PROC Star

@Reeza I would have thought that SET plus BY will interleave the chunks to maintain the correct sorted order. Happy to be put right if that is not correct.

Kurt_Bremser
Super User

@Reeza wrote:

@SASKiwi That doesn't guarantee that the chunks will be in the overall correct sort order, does it?

Does it find the 'smallest' records required and then put them to the output data set?


It has been my experience (up to now) that using equally sorted datasets in one set statement, followed by the respective by, will keep the sort order in the output.

Like

%macro test_sort (num);
%do i = 1 %to #
proc sort data=sashelp.class out=class&i;
by weight height age sex name;
run;
%end;

data want;
set
%do i = 1 %to #
  class&i
%end;
;
by weight height age sex name;
run;
%mend;
%test_sort(5);
proc print data=want;run;
SuryaKiran
Meteorite | Level 14

@Kurt_Bremser The macro code you provided creates 5x19 observations. Its not subsetting the data. It is just repeating 5 times.

Thanks,
Suryakiran
Kurt_Bremser
Super User

@SuryaKiran wrote:

@Kurt_Bremser The macro code you provided creates 5x19 observations. Its not subsetting the data. It is just repeating 5 times.


You completely missed what I was pointing at. I just wanted to illustrate that using a set and by statement to combine separately sorted datasets will preserve the correct sort order in the output dataset.

Please observe that this discussion thread deals with sorting, not subsetting.

Kurt_Bremser
Super User

@SASKiwi wrote:

Try compressing your table first then sorting with compression on:

options compress = binary;

DATA xyz;
  set xyz;
run;

proc sort data = xyz;
etc....

 


I just ran a test here with SAS 9.2 on AIX.

The compress system option does not affect the utility file that is created during the sort, so sorting a compressed dataset of 100 GB might easily eat 1 TB of WORK (although the final output will once again be only 100 GB).

Sorting large compressed datasets is done best with the tagsort option.

Reeza
Super User

100GB on anything is going to be a bit of a nightmare. 

 

What are are you trying to achieve? 

PGStats
Opal | Level 21

Have you considered using the TAGSORT option?

PG
div44
Calcite | Level 5

I have not used it. Does it reduce the background memory usage ?

PGStats
Opal | Level 21

That's what the doc says

 

"The TAGSORT option is useful in single-threaded situations where there might not be enough disk space to sort a large SAS data set. The TAGSORT option is not supported for multi-threaded sorts.
When you specify the TAGSORT option, only sort keys (that is, the variables specified in the BY statement) and the observation number for each observation are stored in the temporary files. The sort keys, together with the observation number, are referred to as tags. At the completion of the sorting process, the tags are used to retrieve the records from the input data set in sorted order. Thus, in cases where the total number of bytes of the sort keys is small compared with the length of the record, temporary disk use is reduced considerably. However, you should have enough disk space to hold another copy of the data (the output data set) or two copies of the tags, whichever is greater. Note that although using the TAGSORT option can reduce temporary disk use, the processing time might be much higher."
PG
Patrick
Opal | Level 21

I consider it as a sub-optimal approach if you have to partially hand code sorting only because of a lack of disk space. May be have a chat with your SAS Admin if there is something that could be done.

You want a fast disk for UTILLOC but it doesn't have to be the same than WORK and it can also be pointed to multiple disks.

 

If extending disk space is not an option then I'd go for TAGSORT as already suggested. Creation of a sorted data set will likely take much longer but you shouldn't run out of disk space anymore.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 42 replies
  • 5645 views
  • 5 likes
  • 14 in conversation