BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JJP1
Pyrite | Level 9

Hi ,

Would you please help on how can we automate code for below steps please

 

1. get number of records from SAS dataset

2.split the dataset into 8 jobs based on the number of records(approximately 1 million) and sort the splitted datasets based on the key and run these 8 jobs in parallel

3.also i need to create step where it should wait and check whether these splited datasets are sorted successfully,so that we can proceed with next step

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@JJP1 

If you read what @Kurt_Bremser writes in this track then you'll understand that the manual approach you want to take is only suitable for rare exceptional cases. What's your reason for splitting up the source table?

 

If space restrictions in work/utilloc are the issue and the sort keys fit into memory then one coding alternative is the use of a hash table.

I get in my environment comparable run times with below sample data for all 3 sort steps.

options fullstimer;
data have(compress=yes);
  array bigvars{30} $100. (30*'A bbb cccc');
  do key1=8000 to 1 by -1;
    key2=rand('integer', 100);
    do key3=1000 to 1 by -1;
      output;
    end;
  end;
  stop;
run;

data want_hashsort(sortedby=key1 key2 key3 compress=yes);
  if _n_=1 then
    do;
      length _rownum 8;
      dcl hash h1(multidata:'y', ordered:'y');
      dcl hiter hh1('h1');
      h1.defineKey('key1','key2','key3');
      h1.defineData('_rownum');
      h1.defineDone();

      do while(not _last);
        set have(keep=key1 key2 key3) end=_last;
        _rownum+1;
        _rc =h1.add();
      end;
    end;

  _rc = hh1.first();
  do while (_rc = 0);
    set have point=_rownum;
    output;
    _rc = hh1.next();
  end;
  stop;
run;

proc datasets lib=work nolist nowarn;
  delete want_:;
  run;
quit;

proc sort data=have out=want_procsort1(compress=yes);
  by key1 key2 key3;
run;

proc datasets lib=work nolist nowarn;
  delete want_:;
  run;
quit;

proc sort data=have out=want_procsort2(compress=yes) tagsort;
  by key1 key2 key3;
run;

proc datasets lib=work nolist nowarn;
  delete have want_:;
  run;
quit;

Real Times

Hash:  1:04

Sort1:  1:16

Sort2:  0:53 

 

From what I understand TAGSORT does actually something pretty similar to what above HASH approach uses - so eventually TAGSORT is the way to go if work space is the issue even though the sort is no more threaded.

View solution in original post

34 REPLIES 34
ChrisNZ
Tourmaline | Level 20

Unless the 8 jobs can do their sort in memory or use their own disks, this could be be much slower than a single sort (or even successive sorts) as the jobs will be competing for the same disk resources.

JJP1
Pyrite | Level 9

Thanks.Would you please help on coding part where we can split the dataset(that is having 800 millions +records)

into 1 million records of data set(8 subsets of datasets) and sort simultaneously so that we can reduced the time taking.

please how can we approach this please.actually i was using split 1 macro which i found in SAS papers and splitting the datasets into number of subsets.

but we should sort simultaneously as and when the splitted data set gets created. this part i stuck please help. also i need to create wait step for these to wait till these data sets completed please

Patrick
Opal | Level 21

@JJP1 

Why don't you just go for a multi threaded sort. That will likely perform better than anything you can manually code for.

https://go.documentation.sas.com/?docsetId=proc&docsetTarget=p0guut2xk8yz2yn17ibn9nwcyx8v.htm&docset... 

 

If that's still not good enough then investigate how UTILLOC is assigned and eventually try to spread this out over multiple disks (for better disk I/O).

 

And last but not least: Consider creating the source and sorted target table using the SPDE engine and also make sure that the data chunks and SPDEUTILLOC gets spread out over multiple disks.
https://support.sas.com/documentation/cdl/en/engspde/69752/HTML/default/viewer.htm#p14iytucvmskdtn1q... 

 

Kurt_Bremser
Super User

You need to fit your computing and storage power to the size of your data. SAS sorting tools (proc sort, sorting in SQL, implicit sorting in SPDE) are already multi-threaded, and can use proper infrastructure if it's there. If the infrastructure is not there, you can't magically create better performance. Your split/sort/merge will most probably be worse than what you have now.

 

Separate WORK and UTILLOC. Use striped, high-quality SSD's for these volumes. Have enough cores.

 

If you want to split, do not run the sorts in parallel, run them one after another, to avoid contention on the disks (especially if you still use spinning metal). Write that first without macro code, see if it helps at all, and once that is confirmed, make it dynamic.

JJP1
Pyrite | Level 9

ok. thank you.would you please help on i need to create a wait step to wait till sort operation completes

would you please help on this

 

JJP1
Pyrite | Level 9

yes,i have decided to run manually with out any macros',so now i can not run currently as we are facing with some slowness issue in our  SAS server

so i am just preparing to do this whenever environment is back and stable

so i am aware of splitted and sorting process (i mean writing the code).

bit i don't know how to write code to wait till sorted datasets gets completed please.

 

Also sorry to ask again,i am not clear on below statement ,please help

 

"Separate WORK and UTILLOC. Use striped, high-quality SSD's for these volumes. Have enough cores."

 

how can we  follow above approach please.

1.kindly confirm is this the above approach you are suggesting will be better  than whatever i posted in this thread(split ,sort and merge parallely) please.

Kurt_Bremser
Super User

When you run the sorts in succession (eg in a macro %do loop), you can interleave/concatenate the resulting datasets as soon as the loop is finished, no need for a special wait.

 

Having infrastructure that lets you use simple code without any artificial optimizations is always to be preferred. Especially if the problem occurs repeatedly.

 

In my ~1000 SAS batch jobs, I have exactly one where I split a large dataset into subsets, and that was not because of performance, but because the required sort would use so much of UTILLOC that other, concurrent jobs would not have enough left. Mind that the overall performance is considerably worse than if I ran it the usual way, but then I would only be able to test it through the main batch job production user which has no quotas defined. Something you do not want, period.

So I do this

  • dynamically split along the contents of the first original "by" variable
  • sort individually, and run proc means against each subset, using only the second and following "by"variables
  • stack the result subsets in sorted order, thereby emulating a sort of the top level "by" variable

IIRC, I do not even use macro code. I create a control dataset with the distinct values of the top-level by variable, and use call execute repeatedly to work through the data one value at a time.

ChrisNZ
Tourmaline | Level 20
Kurt, For that massive sort, have you tried option tagset instead? Not saying it will be better, wondering rather.
Kurt_Bremser
Super User

@ChrisNZ wrote:
Kurt, For that massive sort, have you tried option tagset instead? Not saying it will be better, wondering rather.

I often use tagset for datasets that produce over-large utility files (because of the compress option), but in this case, the "big chunk" was so large that I already had to "dance" across multiple workspaces to even create it. And when a new hierarchy level with considerable cardinality was added that prevented me from using class in proc means, the sort would either be too resource- (usual sort) or time- (tagset) consuming, so I opted for the split, where I could delete the master dataset after splitting, and the individual steps ran in a decent timeframe. Mind that I don't create the big dataset any longer, I already create splitted datasets in the first place now.

 

If this happened to me more than once, I would surely have decided to increase my UTILLOC and workspaces.

 

As the resulting code is far from "elegant" and therefore violates several of my Maxims, it is a clear exception to the rules. And it was one of the things I went through in detail when I started introducing my successor on the job, so he won't have a bad surprise the first time he has to do maintenance on it.

ChrisNZ
Tourmaline | Level 20
Thanks for the insight. And I presume you split with a where clause on the sort key rather than by obs, so the sorted data sets are ready to use?
Kurt_Bremser
Super User

@ChrisNZ wrote:
Thanks for the insight. And I presume you split with a where clause on the sort key rather than by obs, so the sorted data sets are ready to use?

Yes. That way the final step after the summaries is a very simple data step that concatenates everything, without even a by for interleaving needed.

ChrisNZ
Tourmaline | Level 20
Yes. SET BY really kills speed compared to SET. Proc append rather than a data step might cut a bit of time too. Thanks again for sharing.
Patrick
Opal | Level 21

@JJP1 

If you read what @Kurt_Bremser writes in this track then you'll understand that the manual approach you want to take is only suitable for rare exceptional cases. What's your reason for splitting up the source table?

 

If space restrictions in work/utilloc are the issue and the sort keys fit into memory then one coding alternative is the use of a hash table.

I get in my environment comparable run times with below sample data for all 3 sort steps.

options fullstimer;
data have(compress=yes);
  array bigvars{30} $100. (30*'A bbb cccc');
  do key1=8000 to 1 by -1;
    key2=rand('integer', 100);
    do key3=1000 to 1 by -1;
      output;
    end;
  end;
  stop;
run;

data want_hashsort(sortedby=key1 key2 key3 compress=yes);
  if _n_=1 then
    do;
      length _rownum 8;
      dcl hash h1(multidata:'y', ordered:'y');
      dcl hiter hh1('h1');
      h1.defineKey('key1','key2','key3');
      h1.defineData('_rownum');
      h1.defineDone();

      do while(not _last);
        set have(keep=key1 key2 key3) end=_last;
        _rownum+1;
        _rc =h1.add();
      end;
    end;

  _rc = hh1.first();
  do while (_rc = 0);
    set have point=_rownum;
    output;
    _rc = hh1.next();
  end;
  stop;
run;

proc datasets lib=work nolist nowarn;
  delete want_:;
  run;
quit;

proc sort data=have out=want_procsort1(compress=yes);
  by key1 key2 key3;
run;

proc datasets lib=work nolist nowarn;
  delete want_:;
  run;
quit;

proc sort data=have out=want_procsort2(compress=yes) tagsort;
  by key1 key2 key3;
run;

proc datasets lib=work nolist nowarn;
  delete have want_:;
  run;
quit;

Real Times

Hash:  1:04

Sort1:  1:16

Sort2:  0:53 

 

From what I understand TAGSORT does actually something pretty similar to what above HASH approach uses - so eventually TAGSORT is the way to go if work space is the issue even though the sort is no more threaded.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 34 replies
  • 1444 views
  • 3 likes
  • 4 in conversation