DS2 is a proprietary object-oriented programming language that is appropriate for use with advanced data manipulation and preparation, data modeling applications, and computationally complex code. It is a procedural language that contains variables and scope, methods, packages, and parallel programming statements. Methods and packages allow for modularity and data encapsulation similar to classes and functions in other languages such as Python. Included with Base SAS, DS2 intersects with traditional DATA Step. While traditional DATA Step supports only numeric and character data types, DS2 provides support for a wide range of additional data types. DS2 also provides support for ANSI SQL types, programming structure elements, and user-defined methods and packages. DS2 enables you to embed FedSQL on the SET statement, thus blending DATA Step and PROC SQL into one powerful block of data manipulation code. The DS2 language shares many core features and language elements with traditional DATA Step including statements, functions, and formats (SAS and user-defined).
With the SAS® In-Database Code Accelerator for Hadoop, you can publish and execute a DS2 thread and data program in parallel inside a Hadoop cluster using the SAS® Embedded Process. Thread programs can include data transformations, computationally complex code, BY group processing, and many other language elements. DS2 threads are stored as SAS data sets in libraries. Those stored in permanent libraries can be reused. One of the primary benefits of using the SAS® In-Database Code Accelerator for Hadoop is the elimination of data movement. Instead of extracting data from your Hadoop cluster and executing code on your SAS server or a local machine, the DS2 thread and data programs are pushed down inside Hadoop for execution. As long as the source and target data reside in Hadoop, no data movement will occur.
While the capabilities of DS2 extend beyond those of traditional DATA Step, there are some features and functionalities that are not currently supported. In my previous article, I demonstrated how to simulate the LAG and DIF functions in DS2. DS2 supports subsetting IF statements, IF-THEN/ELSE, and IF-THEN/DO statements, but it does not explicitly support the IF-THEN/DELETE statement. In this article, I’ll provide a couple of simple techniques to simulate IF-THEN/DELETE logic. Follow the community for future articles on other DS2 considerations.
IF-THEN/DELETE in traditional DATA Step
The following example demonstrates the use of the IF-THEN/DELETE statement in traditional DATA Step.
data work.cars_out; set work.cars; if origin = 'USA' then delete; run;
IF-THEN/DELETE in Base DS2
The first technique to simulate the IF-THEN/DELETE statement uses logic comparable to a subsetting IF statement, namely an IF-THEN/RETURN statement. In the following example, the desired output in the WORK.CARS_OUT data set is all observations where the variable ‘origin’ has a value other than ‘USA’. To accomplish this, place appropriate logic in the IF statement that will evaluate to a nonzero value and cause the expression to be true. The statement if origin = 'USA' is true since the input data set contains observations where origin = 'USA'. As a result, the subsequent RETURN statement will return processing back to the RUN method to process the next observation. The implicit output at the end of the RUN method isn’t processed and no data is written to WORK.CARS_OUT. Once the IF statement evaluates to a zero value causing the expression to be false (if origin ≠ 'USA'), the RETURN statement will be ignored, the remainder of the code will execute including the implicit output at the end of the RUN method, and therefore output will be written to WORK.CARS_OUT. This entire process will occur across all observations in the input data set since the RUN method has the same automatic, implicit looping feature as traditional DATA Step.
proc ds2; data work.cars_out / overwrite=yes; method run(); set work.cars; if origin = 'USA' then return; end; enddata; run; quit;
An alternative technique uses a nested IF statement where the nested IF statement evaluates to a zero value causing the expression to be false. In the following example, if (0=1) is false since (0 ≠ 1). Consequently, if origin = 'USA' is also false and WORK.CARS_OUT will contain all observations where the variable ‘origin’ has a value other than ‘USA’.
proc ds2; data work.cars_out / overwrite=yes; method run(); set work.cars; if origin = 'USA' then if (0=1); end; enddata; run; quit;
Incorporating DS2 with the SAS® In-Database Code Accelerator for Hadoop
Either of these techniques can be placed in a DS2 thread for use with the SAS® In-Database Code Accelerator for Hadoop. The thread is then called from a DS2 DATA program. In this example, the source and destination data are read from and written to a Hive location referenced by the libref MYHIVE. This is required for the code to execute inside the Hadoop cluster. The thread can be stored in a temporary library (WORK) or a permanent library (MYHIVE or other location) for reuse.
proc ds2; thread work.delete / overwrite=yes; method run(); set myhive.cars; if origin = 'USA' then return; end; endthread; data myhive.cars_out (overwrite=yes); dcl thread work.delete del; method run(); set from del; end; enddata; run; quit;
proc ds2; thread work.delete / overwrite=yes; method run(); set myhive.cars; if origin = 'USA' then if (0=1); end; endthread; data myhive.cars_out (overwrite=yes); dcl thread work.delete del; method run(); set from del; end; enddata; run; quit;
For further information regarding DS2, refer to the SAS® 9.4 DS2 Language Reference documentation. For further information regarding the SAS® In-Database Code Accelerator for Hadoop, refer to the SAS® 9.4 In-Database Products User's Guide.
And don’t forget to follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on DS2.