05-05-2014 11:00 AM
I was running a PROC SORT on a very large set of data (was running for about 2 weeks) when my SAS server rebooted and all jobs running were stopped. My sort was almost finished (of course!) and I don't have time to wait another 2 weeks for this job to re-run.
The LCK file from this sort is still in my work folder. Is there any way to pick up the job where it left off before the reboot using the LCK? Any suggestions/ideas will be very much appreciated!
05-05-2014 11:55 AM
It may be worth describing your data and the sort options used as that is a very long time for a sort to run.
I suspect you may not be able to do much with the dataset as long as that LCK file exists.
05-05-2014 12:07 PM
The data is information on impressions of some internet banner ads. Without going into too much detail, included in the data are the user's ID, ad ID, time, website, and some other identifiers. I am dealing with billions of impressions and thus billions of observations - hence the long sort time. I use the tagsort option and am sorting my data on user ID and time so that I can then "roll up" my data by user ID.
Having an admin delete the LCK and/or copying the dataset to a new name would not be an issue - I just don't really have the time to restart the entire sort process. Ideally there would be some way to resume the sort from where it left off using the LCK file but it seems that this is not possible.
05-05-2014 12:17 PM
if you are doing any kind of roll-up as you handle your data, I would suggest:
then you only host summary data.
It is impractical to wait 2 weeks for a sort!
If my suggestion is not relevant because you only want to summarize the results of that internet search, try performing your routine in blocks of fifty million rows, interleave (set with a by statement) the results, then summarize that collection.
Depending on your data it might take only a day (or less)
05-05-2014 12:24 PM
Performing the sort in blocks may work - even if it takes a day or two, that's better than another 2 weeks. Any chance you could provide an example/sample code? I would certainly call myself a SAS newbie and am not sure of how I would execute your suggestion.
05-05-2014 08:49 PM
Here is a simple example with two sort blocks:
proc sort data = data1;
proc sort data = data2;
set data1 data2;
The combined dataset maintains the sort order of the of the block1 and block2 datasets- this is interleaving as described by Peter C. Try it yourself on some test data.
05-06-2014 02:52 AM
I should add a little more:
Because the starting data set is so large, it would be great if it is already in a base SAS dataset. Then we can use dataset options to read only a part of the set
proc sort data=your.data( obs=20000000 firstobs=10000001 )
Out= set10e6 ;
by the by vars ;
That should sort the 10 million rows from row 10 million and 1 to row 20million
The macro language or even call execute in a datastep could automate this selection of blocks of data "by row number".
If you knew all the values of one of those sort keys, even better performance would be available.
05-12-2014 12:07 PM
My solution is similar then the previous ones, but is applicable only if you are lucky enough, that your dataset is sorted by date/time. (Not by id and then by time as you require) Since you have web log data it is quite likely... In the following example it is enough if it is groupped by day.
And I also assume, you don't need a raw sort, but some kind of aggregation.
proc summary data=in nway;
format datetime datetime7.;
output out=out sum=;
Now you have data grouped by day and id. Hopefully it is smaller, and you can process it. You can control the aggregation level with the format statement.
Everyone: Is this really faster: Dividing a dataset, sorting separately, then interleaving? This is what almost all sort algorithms are supposed to do internally (called "merge sort").
05-12-2014 01:16 PM
The big problem in SAS sorting is the sortwork areas. These demand at least twice the original filesize and perhaps 3 or 4 times. And that's before it completes in a reasonable time.
proc summary uses memory arrays to collect its stats - providing a different constraint. Despite that it privides what I think is the simplest approach.
As I (weakly) demonstrated earlier, proc sumnary can be executed in hadoop-style blocks. (I should have extended the demo to show consolidation /summarization / rollup of the blocks)
05-13-2014 05:51 AM
I know you know it, but still want to clarify: The example above also executes in blocks. Because of the by statement.
If daily data does not fit into memory, just change the format. Flexible.
05-05-2014 12:02 PM
The .lck file is the temporary name that will be used to give the definitive name when ready. First deleting the original dataset when overwriting that.
As the sort process was in the merge stage there must be some #utl files in a saswork directory.
You cannot restart this process because there is no restart for the proc sort defined. As simple and possible frustrating start again at your last known correct state.
Possible you could use checkpoint restart in your code. SAS(R) 9.3 Language Reference: Concepts, Second Edition. The restart actions can be handed to operations guys.
Still it is not possible to restart somewhere in the middle when executing a proc,
05-06-2014 02:30 AM
Going further into details. Would avoid using Tagsort. SAS(R) 9.4 Companion for UNIX Environments, Third Edition ( SORT Procedure: UNIX) It is single threaded, normally the last rescue when you have disk-space problems. In a one run situation this will do, but when needing to run it more often there are smarter approaches.
Better tuning of your IO system (ask your admin). Splitting up a too big dataset in smaller portions and do the merge (as described by Peter C / SASkiwi ) will help.
Still having problems and there is also budget, than you could do some investigations on the mentioned host sort syncsort.
05-06-2014 07:26 AM
Using the following code would be better for sorting a large table:
data F M;
if sex="F" then output F;
esle if sex="M" then output M;
After that :
proc append ....
05-06-2014 08:01 AM
@Xia, nice approach it is making the data process easier. With some G-s (10**9) on input records and possible streamprocessing as the source are some internet-banners (click analysis) I would have expected you to go to using the hash-object approach. That can be an interesting way to solve some kind of problems. It think Abeck19 is not that far, we will lose him/
05-06-2014 08:21 AM
Thanks. Actually I took this idea from someone( I don't remember ).
About Hash Table you said, it is easy, hash table has already built-in function to sort data, but That is not suitable for this situation( large table ), you know no one could have so huge memory .
data _null_; if 0 then set sashelp.class; declare hash h(dataset:'sashelp.class',ordered:'a',multidata:'y'); h.definekey('sex'); h.definedata(all:'y'); h.definedone(); h.output(dataset:'want'); stop; run;