We’re smarter together. Learn from this collection of community knowledge and add your expertise.

How to Simulate the IF-THEN/DELETE Statement in DS2 and SAS® In-Database Code Accelerator for Hadoop

by SAS Employee brian_kinnebrew_sas on ‎06-09-2016 02:55 PM - edited on ‎06-10-2016 09:36 AM by Community Manager (2,078 Views)

DS2 is a proprietary object-oriented programming language that is appropriate for use with advanced data manipulation and preparation, data modeling applications, and computationally complex code.  It is a procedural language that contains variables and scope, methods, packages, and parallel programming statements. Methods and packages allow for modularity and data encapsulation similar to classes and functions in other languages such as Python. Included with Base SAS, DS2 intersects with traditional DATA Step. While traditional DATA Step supports only numeric and character data types, DS2 provides support for a wide range of additional data types. DS2 also provides support for ANSI SQL types, programming structure elements, and user-defined methods and packages. DS2 enables you to embed FedSQL on the SET statement, thus blending DATA Step and PROC SQL into one powerful block of data manipulation code. The DS2 language shares many core features and language elements with traditional DATA Step including statements, functions, and formats (SAS and user-defined).

 

With the SAS® In-Database Code Accelerator for Hadoop, you can publish and execute a DS2 thread and data program in parallel inside a Hadoop cluster using the SAS® Embedded Process. Thread programs can include data transformations, computationally complex code, BY group processing, and many other language elements. DS2 threads are stored as SAS data sets in libraries. Those stored in permanent libraries can be reused.  One of the primary benefits of using the SAS® In-Database Code Accelerator for Hadoop is the elimination of data movement. Instead of extracting data from your Hadoop cluster and executing code on your SAS server or a local machine, the DS2 thread and data programs are pushed down inside Hadoop for execution. As long as the source and target data reside in Hadoop, no data movement will occur.

 

While the capabilities of DS2 extend beyond those of traditional DATA Step, there are some features and functionalities that are not currently supported. In my previous article, I demonstrated how to simulate the LAG and DIF functions in DS2. DS2 supports subsetting IF statements, IF-THEN/ELSE, and IF-THEN/DO statements, but it does not explicitly support the IF-THEN/DELETE statement. In this article, I’ll provide a couple of simple techniques to simulate IF-THEN/DELETE logic. Follow the community for future articles on other DS2 considerations.

 

IF-THEN/DELETE in traditional DATA Step

The following example demonstrates the use of the IF-THEN/DELETE statement in traditional DATA Step.

  

data work.cars_out;
set work.cars;
if origin = 'USA' then delete;
run;

 

Picture1.png

 IF-THEN/DELETE in Base DS2

The first technique to simulate the IF-THEN/DELETE statement uses logic comparable to a subsetting IF statement, namely an IF-THEN/RETURN statement. In the following example, the desired output in the WORK.CARS_OUT data set is all observations where the variable ‘origin’ has a value other than ‘USA’.  To accomplish this, place appropriate logic in the IF statement that will evaluate to a nonzero value and cause the expression to be true. The statement if origin = 'USA' is true since the input data set contains observations where origin = 'USA'. As a result, the subsequent RETURN statement will return processing back to the RUN method to process the next observation. The implicit output at the end of the RUN method isn’t processed and no data is written to WORK.CARS_OUT. Once the IF statement evaluates to a zero value causing the expression to be false (if origin ≠ 'USA'), the RETURN statement will be ignored, the remainder of the code will execute including the implicit output at the end of the RUN method, and therefore output will be written to WORK.CARS_OUT. This entire process will occur across all observations in the input data set since the RUN method has the same automatic, implicit looping feature as traditional DATA Step.

  

proc ds2;
data work.cars_out / overwrite=yes;
  method run();
    set work.cars;
    if origin = 'USA' then return;
  end;
enddata;
run;
quit;

 

Picture2.png

An alternative technique uses a nested IF statement where the nested IF statement evaluates to a zero value causing the expression to be false.  In the following example, if (0=1) is false since (0 ≠ 1). Consequently, if origin = 'USA' is also false and WORK.CARS_OUT will contain all observations where the variable ‘origin’ has a value other than ‘USA’.

  

proc ds2; 
data work.cars_out / overwrite=yes;
  method run();
    set work.cars;
    if origin = 'USA' then
      if (0=1);
  end;
enddata;
run;
quit;

 

Incorporating DS2 with the SAS® In-Database Code Accelerator for Hadoop

Either of these techniques can be placed in a DS2 thread for use with the SAS® In-Database Code Accelerator for Hadoop. The thread is then called from a DS2 DATA program. In this example, the source and destination data are read from and written to a Hive location referenced by the libref MYHIVE. This is required for the code to execute inside the Hadoop cluster. The thread can be stored in a temporary library (WORK) or a permanent library (MYHIVE or other location) for reuse.

  

proc ds2;
thread work.delete / overwrite=yes;
  method run();
    set myhive.cars;
    if origin = 'USA' then return;
  end;
endthread;

data myhive.cars_out (overwrite=yes);
dcl thread work.delete del;
    method run();
    set from del;
    end;
enddata;
run;
quit;

OR

  

proc ds2;
thread work.delete / overwrite=yes;
  method run();
    set myhive.cars;
    if origin = 'USA' then
     if (0=1);
  end;
endthread;

data myhive.cars_out (overwrite=yes);
dcl thread work.delete del;
    method run();
    set from del;
    end;
enddata;
run;
quit;

 

For further information regarding DS2, refer to the SAS® 9.4 DS2 Language Reference documentation.  For further information regarding the SAS® In-Database Code Accelerator for Hadoop, refer to the SAS® 9.4 In-Database Products User's Guide.

And don’t forget to follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on DS2.

Comments
by Super User
on ‎06-09-2016 04:57 PM

Am I missing something?  If DS2 supports

IF (0=1);

Then wouldn't be easier to just convert 

if origin = 'USA' then delete;

to

if  (origin NE 'USA');

 

by SAS Employee brian_kinnebrew_sas
on ‎06-10-2016 09:59 AM

Tom,

 

Thank you for your comments.  In this simple example, yes 

if  (origin NE 'USA'); 

would work as a viable substitute for the nested IF statement

IF (0=1);

and can be used if preferred.  In working with our customers, we have often found their code to contain complex and multi-leveled IF-THEN/DELETE statements for which the nested

IF (0=1);

approach is better suited for their needs.  Feel free to use the method that best suits your needs.

by Super User
on ‎06-10-2016 10:29 AM

Personally using the RETURN statement to replace the DELETE statement makes much more sense to me than adding another IF into the program. Especially since the DELETE statement is really just doing the same thing as the RETURN statement does.  It stops the current iteration of the data step and returns to start the next iteration.

Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.