How to emulate LAG and DIF functions in DS2 and the SAS® In-Database Code Accelerator for Hadoop

3 Likes

DS2 is a proprietary object-oriented programming language that is appropriate for use with advanced data manipulation and preparation, data modeling applications, and computationally complex code. DS2 is a procedural language containing variables and scope, methods, packages, and parallel programming statements. Methods and packages allow for modularity and data encapsulation similar to classes and functions in other languages such as Python.

Included with Base SAS, DS2 intersects with traditional DATA Step. While traditional DATA Step supports only numeric and character data types, DS2 provides support for a wide range of additional data types. DS2 also provides support for ANSI SQL types, programming structure elements, and user-defined methods and packages. DS2 enables you to embed FedSQL on the SET statement, thus blending DATA Step and PROC SQL into one powerful block of data manipulation code. The DS2 language shares many core features and language elements with traditional DATA Step including statements, functions, and formats (SAS and user-defined).

With the SAS® In-Database Code Accelerator for Hadoop, you can publish and execute a DS2 thread and data program in parallel inside a Hadoop cluster using the SAS® Embedded Process. Thread programs can include data transformations, computationally complex code, BY group processing, and many other language elements. DS2 threads are stored as SAS data sets in libraries. Those stored in permanent libraries can be reused.

One of the primary benefits of using the SAS® In-Database Code Accelerator for Hadoop is it eliminates data movement. Instead of extracting data from your Hadoop cluster and executing code on your SAS server or a local machine, the DS2 thread and data programs are pushed down inside Hadoop for execution. As long as the source and target data reside in Hadoop, no data movement will occur.

While the capabilities of DS2 extend beyond those of traditional DATA Step, there are some features and functionalities that are not currently supported. I’ll focus on two of these functions, LAG and DIF, in this article. These functions are widely used in the banking and financial services sectors, particularly when analyzing transactional data. Although not currently supported by DS2, I’ll demonstrate how to emulate them and achieve the same results using a simple technique. Follow the community for future articles on other DS2 considerations.

LAG and DIF in traditional DATA Step

The following example demonstrates the use of LAG and DIF in traditional DATA Step.

data work.class_output;
set work.class;
  agelag = lag(age);
  agedif = dif(age);
  heightlag = lag(height);
  heightdif = dif(height);
run;

Since the LAG and DIF functions are not supported in DS2, the same results can be achieved using the RETAIN and OUTPUT statements. The RETAIN statement requires a global declaration and therefore must be placed prior to the RUN() method. Place the DIF variables and calculations prior to the output statement and the LAG variables and calculations after the output statement.

proc ds2;
data work.class_output (overwrite = yes);
dcl double agelag agedif heightlag heightdif;
retain agelag agedif heightlag heightdif;
  method run();
    set work.class;
	agedif = age - agelag;
	heightdif = height - heightlag;
	output;
	agelag = age;
	heightlag = height;
  end;
enddata;
run;
quit;

Incorporating DS2 with the SAS® In-Database Code Accelerator for Hadoop

This same technique can be applied in a DS2 thread for use with the SAS® In-Database Code Accelerator for Hadoop. The thread is then called from a DS2 DATA program. In this example, the source and destination data are read from and written to a Hive location referenced by the libref MYHIVE. This is required for the code to execute inside the Hadoop cluster. The thread can be stored in a temporary library (WORK) or a permanent library (MYHIVE or other location) for reuse.

proc ds2;
thread work.lagdif / overwrite=yes;
dcl double agelag agedif heightlag heightdif; 
retain agelag agedif heightlag heightdif;
    method run();
      set myhive.class;
	  agedif = age - agelag;
	  heightdif = height - heightlag;
	  output;
	  agelag = age;
	  heightlag = height;
	end;
endthread;

data myhive.class_output (overwrite=yes);
dcl thread work.lagdif ld;
    method run();
      set from ld;
    end;
enddata;
run;
quit;

For further information regarding DS2, refer to the SAS® 9.4 DS2 Language Reference documentation. For further information regarding the SAS® In-Database Code Accelerator for Hadoop, refer to the SAS® 9.4 In-Database Products User's Guide.

And don’t forget to follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on DS2.

ChrisNZ · ‎06-15-2016

Genuine question: While interesting, isn't this of very limited use considering proc ds2's two main selling points are to use RDBMS data as is (and therefore process SQL data , which comes in random order) and to run multiple threads (which also brings data in unpredictable order)?

mkeintz · ‎11-26-2016

I have seen PROC DS2 presentations simulating lags within BY groups. Record order is preserved within group, but between group order would be unpredictable. In our environment (times series of thousands of days for thousand of stocks), that's a perfectly acceptable result, when using stock id as the BY variable.

So my question is: does this mean by groups are assigned to threads?

briankinnebrew · ‎11-28-2016

DS2 BY-group processing groups the rows from input tables and orders the rows by values of one or more columns in the BY statement.

With in-database processing, data is distributed on different data partitions. Each DS2 thread running inside the database has access to one data partition. Each DS2 thread can group and order only the rows in the same data partition. Consequently, the data partition might have only part of the entire group of data. You must do a final aggregation in the main data program.

But, in some instances, it is necessary for each thread to process the entire group of data. The SAS In-Database Code Accelerator provides a way to redistribute the input table to the thread program with a BY statement so that the entire group of data resides on the same data partition.

The PROC DS2 statement BYPARTITION argument controls whether the input data is re-partitioned. By default, the input data for the DS2 program is automatically re-partitioned by the first BY variable. All of the BY groups are in the same data partition and processed by the same thread. Each thread does the BY processing for the entire group of data. You might not need to do the final aggregation in the main data program.

For more information, see “Interleaving” in SAS DS2 Language Reference<>, and the DS2 procedure in Base SAS Procedures Guide.

How to emulate LAG and DIF functions in DS2 and the SAS® In-Database Code Accelerator for Hadoop

SAS Innovate 2025: Call for Content

Free course: Data Literacy Essentials

Get Started