Help using Base SAS procedures

Issue in sorting a large dataset

Reply
Contributor
Posts: 31

Issue in sorting a large dataset

HI,

I am trying to sort a large dataset with 367538640 rows . Sort is getting failed because of the space issue. Tried using options compress =yes and tagsort . Tagsort is taking very long time. Please suggest any alternatives.

Thanks,


Super Contributor
Posts: 287

Re: Issue in sorting a large dataset

It is likely that the problem is that SAS creates a copy of the dataset, which will overwrite the original data when the sort procedure is finish. Therefore, if you have a system where there is more space on the drive where permanent data is supposed to be saved, then your can tell sas that it should use that drive as a work-directory. But, remember to change back afterwards.

If you use windows you should add

-work "d:\path_to_temporary_workfolder"

in the command line from where you start sas.

see the documentation here: http://support.sas.com/documentation/cdl/en/lesysoptsref/66899/HTML/default/viewer.htm#p1er6tm8fay8u...

Valued Guide
Posts: 3,208

Re: Issue in sorting a large dataset

see also: https://communities.sas.com/message/209847#209847

Tagsort and compressing the definitive dataset will not help you much.

The sorting requires apx 3 times the sizing of the original dataset as intermediate work.

Overwriting the original datasets is adding the need of one additional copy.You can redirect the intermediate work to an other location using utilloc system option.

I am assuming you are using a server of some kind with a limited setup in this 365M records is a big number what is the size of that? if a recordsize is 100 bytes it should by 36Gb.

Unless your logical requirement is absolutely needing the sort there are possible better solutions to your original question.

Needing this sort really, you could try to split this big data set in multiple smaller ones and merge the several sorted smaller ones in a dedicated step.

---->-- ja karman --<-----
Super User
Posts: 6,963

Re: Issue in sorting a large dataset

UTILLOC in the configuration file allows you to specify a location different from WORK for the temporary sort file. This will reduce the requirement for the file to be sorted to 2x.

If you do

proc sort data=x1.xxx out=x2.xxx;

where x1 and x2 are libraries on different file systems, this may also help preventing an out of space condition, because you "only" need the size of xxx to be free one time in the UTILLOC and the x2 location, alike.

Then I recommend what Jaap suggested, split the file, sort every partial file on its own, and then do:

data want;

set

  have1

  have2

  ...

  haven

;

by sortcrit;

run;

This is called interleaving, the sort order is preserved.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Ask a Question
Discussion stats
  • 3 replies
  • 219 views
  • 0 likes
  • 4 in conversation