BookmarkSubscribeRSS Feed

CAS answers to 4 common data manipulation tasks – Part 2 – SORT

Started ‎07-09-2019 by
Modified ‎07-09-2019 by
Views 8,906

This is part 2 of my previous article on appending data. Now we’ll focus on sorting data.

Sorting data in CAS

PROC SORT is probably one of the utility procedures the most used by SAS programmers. Who has not complained 😊 about having to sort the data prior to run a task that requires a BY statement? I don’t need a poll to find out.

 

Sorting data in the SAS world (and using PROC SORT) is helpful and sometimes mandatory in the following circumstances (list might not be complete):

  • Mandatory when you need to process the data with a BY statement
  • Helpful to detect top/bottom values
  • Helpful to de-duplicate records or by-groups

What about CAS?

 

2 main messages.

Storing a CAS table in a sorted way IS NOT PERMITTED.

The order of rows in a CAS table is not predictable. Thus, you cannot create a CAS table in a sorted way. But you can partition it (distribute the data across the CAS workers according to by variables using the GROUPBY option). Then, within partitions, you can sort the data according to additional variables (using the ORDERBY option) which could help in some situations.

 

If you try to sort a CAS table and store it in CAS, you get the following error message:

 

 73         proc sort data=casdm.bigprdsale out=casdm.bigprdsale_sorted ;
 74            by actual ;
 75         run ;
 
 ERROR: Operation is not supported when both the source and destination include CAS libname references.
 ERROR: An error has occurred.

 

Similar message when you use the FedSQL CREATE TABLE … ORDER BY statement.

 

Does that mean that CAS is not able to sort data? No. Of course CAS can sort data. If you output the result of your sort procedure in SAS, then you have:

 

 73         proc sort data=casdm.bigprdsale out=work.bigprdsale_sorted ;
 74            by actual ;
 75         run ;
 
 NOTE: Sorting was performed by the data source.
 NOTE: There were 1440 observations read from the data set CASDM.BIGPRDSALE.
 NOTE: The data set WORK.BIGPRDSALE_SORTED has 1440 observations and 10 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.13 seconds
       cpu time            0.02 seconds

 

And you can see that CAS sorted the data.

 

But do you really need a sorted table?

Sorting data in CAS IS NOT REQUIRED before running a CAS operation that uses by-groups.

With Viya and CAS, you no longer need to sort input tables before running a task with a BY statement. CAS will process the by-groups automatically on the fly for you.

 

So, you can directly write and run your magical DATA Step MERGE BY without having to care about sorting. No more PROC SORT before your DATA Step. Behind the scenes, some partitioning will occur on the fly to organize the data appropriately in order to handle the by-groups. One can anticipate this by pre-partitioning the data before.

 

What if you were using PROC SORT to detect top/bottom values?

 

You can use the fetch CAS action with the sortedBy and the to options to identify top N (or bottom N by reversing the order) records according to the order of one or several variables. Unfortunately, the result is not a CAS table (fetch actually means to pull the result on the client) so use this with caution and reasonable result sets.

 

The following example identifies the top 3 records with the highest actual value, fetches them on the client before moving them back to CAS using the saveresult CASL statement. This is not ideal but that should work fine in most cases.

 

proc cas ;
   table.fetch result=r / table={caslib="dm" name="bigprdsale"} to=3 sortby={{name="actual" order="descending"}} ;
   sessionProp.setSessOpt / caslib="dm" ;
   saveresult r casout="top3" ;
run ;
quit ;

 

What if you were using PROC SORT to de-duplicate records?

 

There are other alternatives that we’ll discuss in the next part of the blog.

Takeaways

  • Storing a CAS table in a sorted way is not permitted
  • Sorting data in CAS is not required before running a CAS operation that uses by-groups
  • PROC SORT cannot sort a CAS table and output the result to CAS
  • CAS can sort the data if the result is output to a non-CAS destination
  • Best practice: use dedicated CAS actions to achieve top/bottom values detection or de-duplication of records
  • Mistake to avoid: purposely pull the CAS table on the client (SPRE or SAS) to sort it to achieve special processing

Thanks for reading and stay tuned for my next article on de-duplicating data.

Comments

What can I do about the following error if I want to data step merge without sorting?

 

ERROR: Input data sets cannot be combined because they have different
collating sequences (SORTSEQ).

Hello.

Do you face this problem with data stored in CAS? 

 No, one was in hadoop and one was in cas.  That was the problem. Thanks. 

I am trying to rewrite this code to run in cas, but I don't know what to do about the sort descending.  Could you help?

 

proc sort data=mydata;
by a b c descending d e f descending g h;

quit;


data mydata2;
set mydata;
by a b c;
if b ne c then do;
if first.c ne 1 then delete;
end;
run;

Hello,

 

DESCENDING is now supported in SAS Viya 3.5 with some caveats (https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=lestmtsref&docsetTarge...).

 

You can also check Steven Sober's blogs on this topic:

https://blogs.sas.com/content/sgf/2018/04/18/how-to-simulate-descending-by-variables-in-data-step-co...

https://blogs.sas.com/content/sgf/2019/10/10/how-to-emulate-data-step-descending-by-statements-in-sa...

 

One of the ideas is to create a view with the opposite value and sort ascending on it.

 

Nicolas.

Version history
Last update:
‎07-09-2019 05:18 AM
Updated by:
Contributors

SAS INNOVATE 2024

Innovate_SAS_Blue.png

 

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags