Solved: Automate SAS code to wait for sorted datasets

JJP1 · Posted 05-16-2019 11:56 PM

Hi ,

Would you please help on how can we automate code for below steps please

1. get number of records from SAS dataset

2.split the dataset into 8 jobs based on the number of records(approximately 1 million) and sort the splitted datasets based on the key and run these 8 jobs in parallel

3.also i need to create step where it should wait and check whether these splited datasets are sorted successfully,so that we can proceed with next step

Patrick · Posted 05-17-2019 09:15 PM

@JJP1

If you read what @Kurt_Bremser writes in this track then you'll understand that the manual approach you want to take is only suitable for rare exceptional cases. What's your reason for splitting up the source table?

If space restrictions in work/utilloc are the issue and the sort keys fit into memory then one coding alternative is the use of a hash table.

I get in my environment comparable run times with below sample data for all 3 sort steps.

options fullstimer;
data have(compress=yes);
  array bigvars{30} $100. (30*'A bbb cccc');
  do key1=8000 to 1 by -1;
    key2=rand('integer', 100);
    do key3=1000 to 1 by -1;
      output;
    end;
  end;
  stop;
run;

data want_hashsort(sortedby=key1 key2 key3 compress=yes);
  if _n_=1 then
    do;
      length _rownum 8;
      dcl hash h1(multidata:'y', ordered:'y');
      dcl hiter hh1('h1');
      h1.defineKey('key1','key2','key3');
      h1.defineData('_rownum');
      h1.defineDone();

      do while(not _last);
        set have(keep=key1 key2 key3) end=_last;
        _rownum+1;
        _rc =h1.add();
      end;
    end;

  _rc = hh1.first();
  do while (_rc = 0);
    set have point=_rownum;
    output;
    _rc = hh1.next();
  end;
  stop;
run;

proc datasets lib=work nolist nowarn;
  delete want_:;
  run;
quit;

proc sort data=have out=want_procsort1(compress=yes);
  by key1 key2 key3;
run;

proc datasets lib=work nolist nowarn;
  delete want_:;
  run;
quit;

proc sort data=have out=want_procsort2(compress=yes) tagsort;
  by key1 key2 key3;
run;

proc datasets lib=work nolist nowarn;
  delete have want_:;
  run;
quit;

Real Times

Hash: 1:04

Sort1: 1:16

Sort2: 0:53

From what I understand TAGSORT does actually something pretty similar to what above HASH approach uses - so eventually TAGSORT is the way to go if work space is the issue even though the sort is no more threaded.

View solution in original post

ChrisNZ · Posted 05-17-2019 12:35 AM

Unless the 8 jobs can do their sort in memory or use their own disks, this could be be much slower than a single sort (or even successive sorts) as the jobs will be competing for the same disk resources.

High-Performance SAS Coding - Third Edition

JJP1 · Posted 05-17-2019 01:32 AM

Thanks.Would you please help on coding part where we can split the dataset(that is having 800 millions +records)

into 1 million records of data set(8 subsets of datasets) and sort simultaneously so that we can reduced the time taking.

please how can we approach this please.actually i was using split 1 macro which i found in SAS papers and splitting the datasets into number of subsets.

but we should sort simultaneously as and when the splitted data set gets created. this part i stuck please help. also i need to create wait step for these to wait till these data sets completed please

Patrick · Posted 05-17-2019 01:49 AM

@JJP1

Why don't you just go for a multi threaded sort. That will likely perform better than anything you can manually code for.

https://go.documentation.sas.com/?docsetId=proc&docsetTarget=p0guut2xk8yz2yn17ibn9nwcyx8v.htm&docset...

If that's still not good enough then investigate how UTILLOC is assigned and eventually try to spread this out over multiple disks (for better disk I/O).

And last but not least: Consider creating the source and sorted target table using the SPDE engine and also make sure that the data chunks and SPDEUTILLOC gets spread out over multiple disks.
https://support.sas.com/documentation/cdl/en/engspde/69752/HTML/default/viewer.htm#p14iytucvmskdtn1q...

Kurt_Bremser · Posted 05-17-2019 04:27 AM

You need to fit your computing and storage power to the size of your data. SAS sorting tools (proc sort, sorting in SQL, implicit sorting in SPDE) are already multi-threaded, and can use proper infrastructure if it's there. If the infrastructure is not there, you can't magically create better performance. Your split/sort/merge will most probably be worse than what you have now.

Separate WORK and UTILLOC. Use striped, high-quality SSD's for these volumes. Have enough cores.

If you want to split, do not run the sorts in parallel, run them one after another, to avoid contention on the disks (especially if you still use spinning metal). Write that first without macro code, see if it helps at all, and once that is confirmed, make it dynamic.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

JJP1 · Posted 05-17-2019 05:37 AM

ok. thank you.would you please help on i need to create a wait step to wait till sort operation completes

would you please help on this

Kurt_Bremser · Posted 05-17-2019 05:50 AM

Read the last paragraph of my post again.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

JJP1 · Posted 05-17-2019 06:06 AM

yes,i have decided to run manually with out any macros',so now i can not run currently as we are facing with some slowness issue in our SAS server

so i am just preparing to do this whenever environment is back and stable

so i am aware of splitted and sorting process (i mean writing the code).

bit i don't know how to write code to wait till sorted datasets gets completed please.

Also sorry to ask again,i am not clear on below statement ,please help

"Separate WORK and UTILLOC. Use striped, high-quality SSD's for these volumes. Have enough cores."

how can we follow above approach please.

1.kindly confirm is this the above approach you are suggesting will be better than whatever i posted in this thread(split ,sort and merge parallely) please.

Kurt_Bremser · Posted 05-17-2019 06:44 AM

When you run the sorts in succession (eg in a macro %do loop), you can interleave/concatenate the resulting datasets as soon as the loop is finished, no need for a special wait.

Having infrastructure that lets you use simple code without any artificial optimizations is always to be preferred. Especially if the problem occurs repeatedly.

In my ~1000 SAS batch jobs, I have exactly one where I split a large dataset into subsets, and that was not because of performance, but because the required sort would use so much of UTILLOC that other, concurrent jobs would not have enough left. Mind that the overall performance is considerably worse than if I ran it the usual way, but then I would only be able to test it through the main batch job production user which has no quotas defined. Something you do not want, period.

So I do this

dynamically split along the contents of the first original "by" variable
sort individually, and run proc means against each subset, using only the second and following "by"variables
stack the result subsets in sorted order, thereby emulating a sort of the top level "by" variable

IIRC, I do not even use macro code. I create a control dataset with the distinct values of the top-level by variable, and use call execute repeatedly to work through the data one value at a time.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 05-17-2019 06:50 AM

Kurt, For that massive sort, have you tried option tagset instead? Not saying it will be better, wondering rather.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 05-17-2019 07:12 AM

@ChrisNZ wrote:
Kurt, For that massive sort, have you tried option tagset instead? Not saying it will be better, wondering rather.

I often use tagset for datasets that produce over-large utility files (because of the compress option), but in this case, the "big chunk" was so large that I already had to "dance" across multiple workspaces to even create it. And when a new hierarchy level with considerable cardinality was added that prevented me from using class in proc means, the sort would either be too resource- (usual sort) or time- (tagset) consuming, so I opted for the split, where I could delete the master dataset after splitting, and the individual steps ran in a decent timeframe. Mind that I don't create the big dataset any longer, I already create splitted datasets in the first place now.

If this happened to me more than once, I would surely have decided to increase my UTILLOC and workspaces.

As the resulting code is far from "elegant" and therefore violates several of my Maxims, it is a clear exception to the rules. And it was one of the things I went through in detail when I started introducing my successor on the job, so he won't have a bad surprise the first time he has to do maintenance on it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 05-17-2019 07:19 AM

Thanks for the insight. And I presume you split with a where clause on the sort key rather than by obs, so the sorted data sets are ready to use?

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 05-17-2019 11:18 AM

@ChrisNZ wrote:
Thanks for the insight. And I presume you split with a where clause on the sort key rather than by obs, so the sorted data sets are ready to use?

Yes. That way the final step after the summaries is a very simple data step that concatenates everything, without even a by for interleaving needed.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 05-17-2019 07:20 PM

Yes. SET BY really kills speed compared to SET. Proc append rather than a data step might cut a bit of time too. Thanks again for sharing.

High-Performance SAS Coding - Third Edition

Patrick · Posted 05-17-2019 09:15 PM

@JJP1

If you read what @Kurt_Bremser writes in this track then you'll understand that the manual approach you want to take is only suitable for rare exceptional cases. What's your reason for splitting up the source table?

If space restrictions in work/utilloc are the issue and the sort keys fit into memory then one coding alternative is the use of a hash table.

I get in my environment comparable run times with below sample data for all 3 sort steps.

options fullstimer;
data have(compress=yes);
  array bigvars{30} $100. (30*'A bbb cccc');
  do key1=8000 to 1 by -1;
    key2=rand('integer', 100);
    do key3=1000 to 1 by -1;
      output;
    end;
  end;
  stop;
run;

data want_hashsort(sortedby=key1 key2 key3 compress=yes);
  if _n_=1 then
    do;
      length _rownum 8;
      dcl hash h1(multidata:'y', ordered:'y');
      dcl hiter hh1('h1');
      h1.defineKey('key1','key2','key3');
      h1.defineData('_rownum');
      h1.defineDone();

      do while(not _last);
        set have(keep=key1 key2 key3) end=_last;
        _rownum+1;
        _rc =h1.add();
      end;
    end;

  _rc = hh1.first();
  do while (_rc = 0);
    set have point=_rownum;
    output;
    _rc = hh1.next();
  end;
  stop;
run;

proc datasets lib=work nolist nowarn;
  delete want_:;
  run;
quit;

proc sort data=have out=want_procsort1(compress=yes);
  by key1 key2 key3;
run;

proc datasets lib=work nolist nowarn;
  delete want_:;
  run;
quit;

proc sort data=have out=want_procsort2(compress=yes) tagsort;
  by key1 key2 key3;
run;

proc datasets lib=work nolist nowarn;
  delete have want_:;
  run;
quit;

Real Times

Hash: 1:04

Sort1: 1:16

Sort2: 0:53

From what I understand TAGSORT does actually something pretty similar to what above HASH approach uses - so eventually TAGSORT is the way to go if work space is the issue even though the sort is no more threaded.

Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Re: Automate SAS code to wait for sorted datasets

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!