BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LisaXu
Fluorite | Level 6

I have been trying to sort a dataset and then dedup after that but I kept on getting this ERROR message even if I broke down the dataset to monthly and the size was down from 15 millions to 1.1 millions. Can anyone give me some suggestions? I also tried Linux server and still failed with the same error message.

 

PROC sort data=zz.Out2019_01 out=zz.Out2019_01a(compress=yes);
by cust_account_number dt_outage descending event_duration;run;
run;

ERROR: No disk space is available for the write operation. Filename =
C:\Users\wxu\AppData\Local\Temp\SAS Temporary
Files\SAS_util0001000033CC_D-6HPY9Z2\ut33CC000008.utl.
ERROR: Failure while attempting to write page 1851 of sorted run 124.
ERROR: Failure while attempting to write page 647355 to utility file 1.
ERROR: Failure encountered while creating initial set of sorted runs.
ERROR: Failure encountered during external sort.
ERROR: Sort execution failure.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Run this for a quick shot:

proc sql;
select
  max(length(a)) as max_a,
  max(length(b)) as max_b,
  /* and so on */
  max(length(x)) as max_x /* no comma here! */
from your_dataset;
quit;

Then do this:

data your_data_improved;
length /* set the lengths here according to what the SQL revealed */;
set your_data;
run;

You will get a WARNING about truncated data, but you can safely ignore that, as (because of the SQL) you know what you are doing.

 

Alternatively, you can set appropriate lengths in the step that reads your data from the external source, if you do that with a data step (if you used PROC IMPORT for a text file (CSV or similar), you can extract this data step from the log and modify it to your needs).

View solution in original post

15 REPLIES 15
PeterClemmensen
Tourmaline | Level 20

What is your goal here? Just deduping?

LisaXu
Fluorite | Level 6

I need to sort it first to have the event_duration longer listed first if there are multiple records for the same day. Then I'll keep the longest event for that day for that account. Thanks!!

ChrisNZ
Tourmaline | Level 20

> I'll keep the longest event for that day for that account

To dedupe, there is no need to use proc sort again; A data step will never fail and will probably be faster. 

mkeintz
PROC Star

@Kurt_Bremser 

 

I agree that TAGSORT is a likely solution for the reported problem.   But why would the dataset compress status make the TAGSORT any more recommendable than if it were not compressed?

 

But if the same disk is being used for the output dataset (libname zz) as the intermediate files (in C:\Users\wxu\AppData\Local\Temp\), then I see the benefit of careful choice of compress option on the output.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Kurt_Bremser
Super User

Without tagsort, the utility file contains all data from the source, but the way it is built, overall I/O is kept to the bare minimum, so the sort is remarkably fast.

But the utility file is uncompressed, so in the case of a compressed dataset with a high compression rate it will be much larger than the source; tagsort can be the only option to make the sort work, albeit with a performance penalty.

That's why I would not recommend tagsort for uncompressed (or moderately compressed) datasets.

mkeintz
PROC Star

Can you confirm that none of your 42 variables are needlessly long?  (e.g. 10 vars each 200 characters long, even though their values never need more than 10 characters)?  if so, you might be able to substantially reduce the size of your data set (and therefore avoid exceeding available disk space for intermediate sort files) without any loss of information.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LisaXu
Fluorite | Level 6
Thanks to you that I looked into the length of the variables. Appreciate it!
ChrisNZ
Tourmaline | Level 20

It's odd that the Windows server and the Unix server both run out of disk space. 

Are the disks full on both?

 

TAGSORT is an option.

Another thing you may try is to free space is enable compression on the utility folder C:\Users\wxu\AppData\Local\Temp\SAS Temporary Files, since the files stored there are not compressed.

LisaXu
Fluorite | Level 6

Sorry that I missed so many responses as I was in a training.  Thanks so much for taking the time to respond to my question. To answer the questions:

I thought I don't have any super long variables but when I read the posts and checked them I found quite a few imported variables have $32767 length. Maybe this is why it takes so much space? I was watching the temp folder running out of space from 430G to finally the message "out of space". I was thinking how can 1G data sorting take that much of space. Maybe those variables are the reason? Any comments?

 

I reassigned the work library location and the sorting is done.

I also created 2 indices on two variables. I'll see whether I can run the whole year's data now with the length modified. 

 

Thanks so much for you all!

Kurt_Bremser
Super User

First of all, before you do anything else, take a good look at the content of those variables, and fix or amend your import process. Depending on the source type (Excel or text), you either have to add a correcting data step that reduces the length to what is actually needed, or you can set the correct length in the data step that reads your data. PROC IMPORT is good for a "first shot", but not for a really good (and consistent!) result.

Some of these variables might even be missing for the whole dataset, so you should simply drop them.

 

From your description, I am now very sure that what you experience is a consequence of the dataset being stored with the COMPRESS=YES option, and lots of overlong and sparsely populated variables. Compression rates of 99% and better can easily come from this, and the uncompressed utility file (either of a sort or a SQL) blows up your WORK/UTIL location, unless you use TAGSORT (which means that only the BY variable(s) and a record index are stored in the utility file).

 

So, once again, get to know your data, improve/fix your import process, and if you still end up with large character variables and a good compression rate, use TAGSORT.

 

LisaXu
Fluorite | Level 6

Yes, I learned my lesson. I normally check the dataset before I work on it. This one I read from the data lake and it looked normal when I looked at it in SAS and I was in a rush to report some summaries to the team. So, it didn't register that the variables are so long. I'm working on a macro to adjust the length now and also adding tagsort. I did try tagsort before I posted the question. First time to see the variables are set for this length. I do appreciate your feedback, Kurt.

LisaXu
Fluorite | Level 6

All the character variables from the data lake are defined $32767. Is there a quick way to find out what the true length for each of those 18 variables and then reassign the length? I used trim() but didn't change the format.

 

Thanks again!

Lisa

Kurt_Bremser
Super User

Run this for a quick shot:

proc sql;
select
  max(length(a)) as max_a,
  max(length(b)) as max_b,
  /* and so on */
  max(length(x)) as max_x /* no comma here! */
from your_dataset;
quit;

Then do this:

data your_data_improved;
length /* set the lengths here according to what the SQL revealed */;
set your_data;
run;

You will get a WARNING about truncated data, but you can safely ignore that, as (because of the SQL) you know what you are doing.

 

Alternatively, you can set appropriate lengths in the step that reads your data from the external source, if you do that with a data step (if you used PROC IMPORT for a text file (CSV or similar), you can extract this data step from the log and modify it to your needs).

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1343 views
  • 3 likes
  • 5 in conversation