BookmarkSubscribeRSS Feed
abeck19
Calcite | Level 5

I was running a PROC SORT on a very large set of data (was running for about 2 weeks) when my SAS server rebooted and all jobs running were stopped. My sort was almost finished (of course!) and I don't have time to wait another 2 weeks for this job to re-run.

The LCK file from this sort is still in my work folder. Is there any way to pick up the job where it left off before the reboot using the LCK? Any suggestions/ideas will be very much appreciated!

18 REPLIES 18
ballardw
Super User

It may be worth describing your data and the sort options used as that is a very long time for a sort to run.

I suspect you may not be able to do much with the dataset as long as that LCK file exists.

abeck19
Calcite | Level 5

The data is information on impressions of some internet banner ads. Without going into too much detail, included in the data are the user's ID, ad ID, time, website, and some other identifiers. I am dealing with billions of impressions and thus billions of observations - hence the long sort time. I use the tagsort option and am sorting my data on user ID and time so that I can then "roll up" my data by user ID.

Having an admin delete the LCK and/or copying the dataset to a new name would not be an issue - I just don't really have the time to restart the entire sort process. Ideally there would be some way to resume the sort from where it left off using the LCK file but it seems that this is not possible.

Peter_C
Rhodochrosite | Level 12

if you are doing any kind of roll-up as you handle your data, I would suggest:

  1. having your internet-search-trawl create a view rather than data
  2. roll-up that "view" with proc summary

then you only host summary data.

It is impractical to wait 2 weeks for a sort!

If my suggestion is not relevant because you only want to summarize the results of that internet search, try performing your routine in blocks of fifty million rows, interleave (set with a by statement) the results,   then summarize that collection.

Depending on your data it might take only a day (or less)

good luck

abeck19
Calcite | Level 5

Performing the sort in blocks may work - even if it takes a day or two, that's better than another 2 weeks. Any chance you could provide an example/sample code? I would certainly call myself a SAS newbie and am not sure of how I would execute your suggestion.

SASKiwi
PROC Star

Here is a simple example with two sort blocks:

proc sort data = data1;

by sort_vars;

run;

proc sort data = data2;

by sort_vars;

run;

data combined;

  set data1 data2;

  by sort_vars;

run;

The combined dataset maintains the sort order of the of the block1 and block2 datasets- this is interleaving as described by Peter C. Try it yourself on some test data.

Peter_C
Rhodochrosite | Level 12

Thank for domonstrating the interleave when I was off-line for so long....

I should add a little more:

Because the starting data set is so large, it would be great if it is already in a base SAS dataset. Then we can use dataset options to read only a part of the set

proc sort data=your.data( obs=20000000 firstobs=10000001 )

  Out= set10e6 ;

  by the by vars ;

Run;

That should sort the 10 million rows  from row 10 million and 1 to row 20million

The macro language or even call execute in a datastep could automate this selection of blocks of data "by row number".

If you knew all the values of one of those sort keys, even better performance would be available.

PeterC

gergely_batho
SAS Employee

Hi abeck19,

My solution is similar then the previous ones, but is applicable only if you are lucky enough, that your dataset is sorted by date/time. (Not by id and then by time as you require) Since you have web log data it is quite likely... In the following example it is enough if it is groupped by day.

And I also assume, you don't need a raw sort, but some kind of aggregation.

proc summary data=in nway;

var amount;

class id;

by datetime;

format datetime datetime7.;

output out=out sum=;

run;

Now you have data grouped by day and id. Hopefully it is smaller, and you can process it. You can control the aggregation level with the format statement.

Everyone: Is this really faster: Dividing a dataset, sorting separately, then interleaving? This is what almost all sort algorithms are supposed to do internally (called "merge sort").

Peter_C
Rhodochrosite | Level 12

The big problem in SAS sorting is the sortwork areas. These demand at least twice the original filesize and perhaps 3 or 4 times. And that's before it completes in a reasonable time.

proc summary uses memory arrays to collect its stats - providing a different constraint. Despite that it privides what I think is the simplest approach.

As I (weakly) demonstrated earlier, proc sumnary can be executed in hadoop-style blocks. (I should have extended the demo to show consolidation /summarization / rollup of the blocks)

gergely_batho
SAS Employee

Hi All,

I know you know it, but still want to clarify: The example above also executes in blocks. Because of the by statement.

If daily data does not fit into memory, just change the format. Flexible.

jakarman
Barite | Level 11

The .lck file is the temporary name that will be used to give the definitive name when ready. First deleting the original dataset when overwriting that.

As the sort process was in the merge stage there must be some #utl files in a saswork directory.

You cannot restart this process because there is no restart for the proc sort defined.  As simple and possible frustrating start again at your last known correct state.

Possible you could use checkpoint restart in your code.  SAS(R) 9.3 Language Reference: Concepts, Second Edition. The restart actions can be handed to operations guys.

Still it is not possible to restart somewhere in the middle when executing a proc,

---->-- ja karman --<-----
jakarman
Barite | Level 11

Going further into details. Would avoid using Tagsort. SAS(R) 9.4 Companion for UNIX Environments, Third Edition ( SORT Procedure: UNIX) It is single threaded, normally the last rescue when you have disk-space problems. In a one run situation this will do, but when needing to run it more often there are smarter approaches.

Better tuning of your IO system (ask your admin). Splitting up a too big dataset in  smaller portions and do the merge (as described by Peter C / SASkiwi ) will help.

Still having problems and there is also budget, than you could do some investigations on the mentioned host sort syncsort.     

---->-- ja karman --<-----
Ksharp
Super User

Using the following code would be better for sorting a large table:

data F M;

  set class;

if sex="F" then output F;

  esle if sex="M" then output M;

run;

After that :

proc append ....

Xia Keshan

jakarman
Barite | Level 11

@Xia, nice approach it is making the data process easier. With some G-s (10**9) on input records and possible streamprocessing as the source are some internet-banners (click analysis) I would have expected you to go to using the hash-object approach. That can be an interesting way to solve some kind of problems. It think Abeck19 is not that far, we will lose him/  

---->-- ja karman --<-----
Ksharp
Super User

  ,

Thanks. Actually I took this idea from someone( I don't remember ).

About Hash Table you said, it is easy, hash table has already built-in function to sort data, but That is not suitable for this situation( large table ), you know no one could have so huge memory .

data _null_;
if 0 then set sashelp.class;
 declare hash h(dataset:'sashelp.class',ordered:'a',multidata:'y');
  h.definekey('sex');
  h.definedata(all:'y');
  h.definedone();

  h.output(dataset:'want');
  stop;
  run;

Xia Keshan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 2606 views
  • 0 likes
  • 8 in conversation