DS2 is a proprietary object-oriented programming language that is appropriate for use with advanced data manipulation and preparation, data modeling applications, and computationally complex code. DS2 is a procedural language containing variables and scope, methods, packages, and parallel programming statements. Methods and packages allow for modularity and data encapsulation similar to classes and functions in other languages such as Python.
Included with Base SAS, DS2 intersects with traditional DATA Step. While traditional DATA Step supports only numeric and character data types, DS2 provides support for a wide range of additional data types. DS2 also provides support for ANSI SQL types, programming structure elements, and user-defined methods and packages. DS2 enables you to embed FedSQL on the SET statement, thus blending DATA Step and PROC SQL into one powerful block of data manipulation code. The DS2 language shares many core features and language elements with traditional DATA Step including statements, functions, and formats (SAS and user-defined).
With the SAS® In-Database Code Accelerator for Hadoop, you can publish and execute a DS2 thread and data program in parallel inside a Hadoop cluster using the SAS® Embedded Process. Thread programs can include data transformations, computationally complex code, BY group processing, and many other language elements. DS2 threads are stored as SAS data sets in libraries. Those stored in permanent libraries can be reused.
One of the primary benefits of using the SAS® In-Database Code Accelerator for Hadoop is it eliminates data movement. Instead of extracting data from your Hadoop cluster and executing code on your SAS server or a local machine, the DS2 thread and data programs are pushed down inside Hadoop for execution. As long as the source and target data reside in Hadoop, no data movement will occur.
While the capabilities of DS2 extend beyond those of traditional DATA Step, there are some features and functionalities that are not currently supported. I’ll focus on two of these functions, LAG and DIF, in this article. These functions are widely used in the banking and financial services sectors, particularly when analyzing transactional data. Although not currently supported by DS2, I’ll demonstrate how to emulate them and achieve the same results using a simple technique. Follow the community for future articles on other DS2 considerations.
LAG and DIF in traditional DATA Step
The following example demonstrates the use of LAG and DIF in traditional DATA Step.
data work.class_output;
set work.class;
agelag = lag(age);
agedif = dif(age);
heightlag = lag(height);
heightdif = dif(height);
run;
Since the LAG and DIF functions are not supported in DS2, the same results can be achieved using the RETAIN and OUTPUT statements. The RETAIN statement requires a global declaration and therefore must be placed prior to the RUN() method. Place the DIF variables and calculations prior to the output statement and the LAG variables and calculations after the output statement.
proc ds2;
data work.class_output (overwrite = yes);
dcl double agelag agedif heightlag heightdif;
retain agelag agedif heightlag heightdif;
method run();
set work.class;
agedif = age - agelag;
heightdif = height - heightlag;
output;
agelag = age;
heightlag = height;
end;
enddata;
run;
quit;
Incorporating DS2 with the SAS® In-Database Code Accelerator for Hadoop
This same technique can be applied in a DS2 thread for use with the SAS® In-Database Code Accelerator for Hadoop. The thread is then called from a DS2 DATA program. In this example, the source and destination data are read from and written to a Hive location referenced by the libref MYHIVE. This is required for the code to execute inside the Hadoop cluster. The thread can be stored in a temporary library (WORK) or a permanent library (MYHIVE or other location) for reuse.
proc ds2;
thread work.lagdif / overwrite=yes;
dcl double agelag agedif heightlag heightdif;
retain agelag agedif heightlag heightdif;
method run();
set myhive.class;
agedif = age - agelag;
heightdif = height - heightlag;
output;
agelag = age;
heightlag = height;
end;
endthread;
data myhive.class_output (overwrite=yes);
dcl thread work.lagdif ld;
method run();
set from ld;
end;
enddata;
run;
quit;
For further information regarding DS2, refer to the SAS® 9.4 DS2 Language Reference documentation. For further information regarding the SAS® In-Database Code Accelerator for Hadoop, refer to the SAS® 9.4 In-Database Products User's Guide.
And don’t forget to follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on DS2.
Genuine question: While interesting, isn't this of very limited use considering proc ds2's two main selling points are to use RDBMS data as is (and therefore process SQL data , which comes in random order) and to run multiple threads (which also brings data in unpredictable order)?
I have seen PROC DS2 presentations simulating lags within BY groups. Record order is preserved within group, but between group order would be unpredictable. In our environment (times series of thousands of days for thousand of stocks), that's a perfectly acceptable result, when using stock id as the BY variable.
So my question is: does this mean by groups are assigned to threads?
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.