Solved: Problem with PROC SORT in a large dataset - Page 2

Phil_NZ · Posted 03-21-2021 05:46 PM

What a small world!! A greeting to Massey alumni! Thank you for your help and suggestion so far!

Warmest regards!

Phil.

Massey Business School

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

FreelanceReinh · Posted 03-19-2021 06:11 AM

Hi @Phil_NZ,

I'm confident that using the TAGSORT option, as suggested by ballardw, will help because the error message suggests that a utility file got too big.

proc sort data=bigds tagsort;
by byvar1 ... byvarn;
run;

If even this reduction of the space requirements was not sufficient, I would check if more disk space could be made available by changing the value of the UTILLOC= system option (i.e., specify a different or an additional location).

If all else fails, you can still consider changes to the sort process (e.g., split the large dataset, ...).

ChrisNZ · Posted 03-21-2021 09:49 PM

> the error message suggests that a utility file got too big.

Another option is to store the utility files in a compressed folder.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 03-21-2021 03:50 PM

You are using Compustat data, and you apparently created an identifier variable GVKEYIID from the company id (GVKEY) and the stock issue (IID).

If your data originally came from WRDS it was originally sorted by GVKEY/DATADATE (for company data) or by GVKEY/IID/DATADATE (for stock issue data).

Of course, you may have re-ordered it in your intermediate work. But I wouldn't be surprised if your data currently is sorted by GVKEY, if not already by GVKEY/IID/DATADATE (or maybe GVKEY/DATADATE/IID).

If so you can use space-saving and time-saving techniques to convert a dataset already sorted by group to a dataset sorted by detail within group.

In particular, if is is sorted by GVKEY, you don't need a proc sort. Instead you can (a) read in all details records for one GVKEY, (b) sort the detail records in memory by GVKEYIID/DATADATE and (c) output, (d) clear the in-memory data container. Then go to the next GVKEY and repeat.

This approach would take almost NO temporary disk space (and would be much faster). Here's how the code looks.

data want (drop=_:);
  set have;
  by gvkey;
  if _n_=1 then do;
    declare hash h (dataset:'have (obs=0)',ordered:'A',multidata:'Y');
     h.definekey('gvkeyiid','datadate');
     h.definedata(all:'Y');
   declare hiter hi ('h');
  end;
  h.add();
  if last.gvkey;
  do _rc=hi.first() by 0 until(_rc^=0);
    output;
    _rc=hi.next();
  end;
  _rc=h.clear();
run;

To check in advance whether the data are already ordered by GVKEY, just run:

data _null_;
  set have;
  by gvkey;
run;

If there are no error messages, then the order by gvkey is true. Obviously you could even do this by your specific variables to see if the desired final order already is in place.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Phil_NZ · Posted 03-21-2021 04:10 PM

Hi @mkeintz

Thank you for your specific suggestion. It seems that it is time for me to learn about the HASH objects because I also saw @Kurt_Bremser give me a lot of suggestions relating HASH OBJECT recently.

Warmest regards.

Phil

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

ChrisNZ · Posted 03-21-2021 09:53 PM

It seems that you really should start considering storing your data in binary-compressed SPDE files. Not only will they occupy (much) less space on disk, but SPDE can also avoid using proc sort at all, as it can sort its data on the fly faster than proc sort can. So you can use a BY statement in a step without having to sort first. Of course if you do that several times, sorting becomes the best option.

High-Performance SAS Coding - Third Edition

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Re: Problem with PROC SORT in a large dataset

Register Today!

SAS Training: Just a Click Away