DS2 is a proprietary object-oriented programming language that is appropriate for use with advanced data manipulation and preparation, data modeling applications, and computationally complex code. It is a procedural language that contains variables and scope, methods, packages, and parallel programming statements. Methods and packages allow for modularity and data encapsulation similar to classes and functions in other languages such as Python. Included with Base SAS, DS2 intersects with traditional DATA Step. While traditional DATA Step supports only numeric and character data types, DS2 provides support for a wide range of additional data types. DS2 also provides support for ANSI SQL types, programming structure elements, and user-defined methods and packages. DS2 enables you to embed FedSQL on the SET statement, thus blending DATA Step and PROC SQL into one powerful block of data manipulation code. The DS2 language shares many core features and language elements with traditional DATA Step including statements, functions, and formats (SAS and user-defined).
With the SAS® In-Database Code Accelerator for Hadoop, you can publish and execute a DS2 thread and data program in parallel inside a Hadoop cluster using the SAS® Embedded Process. Thread programs can include data transformations, computationally complex code, BY group processing, and many other language elements. DS2 threads are stored as SAS data sets in libraries. Those stored in permanent libraries can be reused. One of the primary benefits of using the SAS® In-Database Code Accelerator for Hadoop is the elimination of data movement. Instead of extracting data from your Hadoop cluster and executing code on your SAS server or a local machine, the DS2 thread and data programs are pushed down inside Hadoop for execution. As long as the source and target data reside in Hadoop, no data movement will occur.
While the capabilities of DS2 extend beyond those of traditional DATA Step, there are some features and functionalities that are not currently supported. In my previous article, I demonstrated how to emulate the LAG and DIF functions in DS2. DS2 supports subsetting IF statements, IF-THEN/ELSE, and IF-THEN/DO statements, but it does not explicitly support the IF-THEN/DELETE statement. In this article, I’ll provide a couple of simple techniques to emulate IF-THEN/DELETE logic. Follow the community for future articles on other DS2 considerations.
IF-THEN/DELETE in traditional DATA Step
The following example demonstrates the use of the IF-THEN/DELETE statement in traditional DATA Step.
data work.cars_out;
set work.cars;
if origin = 'USA' then delete;
run;
IF-THEN/DELETE in Base DS2
The first technique to emulate the IF-THEN/DELETE statement uses logic comparable to a subsetting IF statement, namely an IF-THEN/RETURN statement. In the following example, the desired output in the WORK.CARS_OUT data set is all observations where the variable ‘origin’ has a value other than ‘USA’. To accomplish this, place appropriate logic in the IF statement that will evaluate to a nonzero value and cause the expression to be true. The statement if origin = 'USA' is true since the input data set contains observations where origin = 'USA'. As a result, the subsequent RETURN statement will return processing back to the RUN method to process the next observation. The implicit output at the end of the RUN method isn’t processed and no data is written to WORK.CARS_OUT. Once the IF statement evaluates to a zero value causing the expression to be false (if origin ≠ 'USA'), the RETURN statement will be ignored, the remainder of the code will execute including the implicit output at the end of the RUN method, and therefore output will be written to WORK.CARS_OUT. This entire process will occur across all observations in the input data set since the RUN method has the same automatic, implicit looping feature as traditional DATA Step.
proc ds2;
data work.cars_out / overwrite=yes;
method run();
set work.cars;
if origin = 'USA' then return;
end;
enddata;
run;
quit;
An alternative technique uses a nested IF statement where the nested IF statement evaluates to a zero value causing the expression to be false. In the following example, if (0=1) is false since (0 ≠ 1). Consequently, if origin = 'USA' is also false and WORK.CARS_OUT will contain all observations where the variable ‘origin’ has a value other than ‘USA’.
proc ds2;
data work.cars_out / overwrite=yes;
method run();
set work.cars;
if origin = 'USA' then
if (0=1);
end;
enddata;
run;
quit;
Incorporating DS2 with the SAS® In-Database Code Accelerator for Hadoop
Either of these techniques can be placed in a DS2 thread for use with the SAS® In-Database Code Accelerator for Hadoop. The thread is then called from a DS2 DATA program. In this example, the source and destination data are read from and written to a Hive location referenced by the libref MYHIVE. This is required for the code to execute inside the Hadoop cluster. The thread can be stored in a temporary library (WORK) or a permanent library (MYHIVE or other location) for reuse.
proc ds2;
thread work.delete / overwrite=yes;
method run();
set myhive.cars;
if origin = 'USA' then return;
end;
endthread;
data myhive.cars_out (overwrite=yes);
dcl thread work.delete del;
method run();
set from del;
end;
enddata;
run;
quit;
OR
proc ds2;
thread work.delete / overwrite=yes;
method run();
set myhive.cars;
if origin = 'USA' then
if (0=1);
end;
endthread;
data myhive.cars_out (overwrite=yes);
dcl thread work.delete del;
method run();
set from del;
end;
enddata;
run;
quit;
For further information regarding DS2, refer to the SAS® 9.4 DS2 Language Reference documentation. For further information regarding the SAS® In-Database Code Accelerator for Hadoop, refer to the SAS® 9.4 In-Database Products User's Guide.
And don’t forget to follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on DS2.
Am I missing something? If DS2 supports
IF (0=1);
Then wouldn't be easier to just convert
if origin = 'USA' then delete;
to
if (origin NE 'USA');
Tom,
Thank you for your comments. In this simple example, yes
if (origin NE 'USA');
would work as a viable substitute for the nested IF statement
IF (0=1);
and can be used if preferred. In working with our customers, we have often found their code to contain complex and multi-leveled IF-THEN/DELETE statements for which the nested
IF (0=1);
approach is better suited for their needs. Feel free to use the method that best suits your needs.
Personally using the RETURN statement to replace the DELETE statement makes much more sense to me than adding another IF into the program. Especially since the DELETE statement is really just doing the same thing as the RETURN statement does. It stops the current iteration of the data step and returns to start the next iteration.
You could be slightly more succinct by simply using
if 0;
as in
data cars;
set sashelp.cars;
run;
proc ds2;
data cars_out / overwrite=yes;
method run();
set cars;
if origin = 'USA' then
if 0;
end;
enddata;
run;
quit;
Since 0 is always false
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.