Perform computations across multiple events in SAS ESP

Computing a new variable based on values coming from different rows of data is something very usual in traditional SQL-based data management. And SAS provides powerful and easy-to-use techniques to do that, among them RETAIN, LAG, etc. In real-time contexts, this need is probably less frequent but I found at least one use case where I needed to calculate a geographical distance between 2 GPS positions coming from 2 different events. And I will show you 2 methods to achieve this in the SAS Event Stream Processing engine. There are probably additional methods.

Analyzing an iPhone streaming data with SAS ESP, I wanted to go further and add some location data to the collected metrics. SensorLog (the iPhone app used to stream your iPhone real-time data to a socket or a file) allows you to do that. You can then run that app in your car while driving and simulate some “connected car” location data.

Well, let me get to the point. I have streaming data coming from my iPhone at a 30Hz speed. Each event contains the location (latitude and longitude) coordinates of the iPhone position. And of course I want to compute the distance between 2 consecutive points (meaning 2 consecutive events) as well as accumulate those distances into another field to derive the total distance travelled by my iPhone.

Here is an extract of the stream:

15JUN2016:18:08:00.040,...,iPhone_ABC,...,47.214771439,-1.537893405,... 15JUN2016:18:08:00.094,...,iPhone_ABC,...,47.21475258,-1.537869601,...

Here is my continuous query and the 2 main approaches I’ve used:

First method

The first thing to do is to gather the 2 GPS coordinates coming from 2 different events into 1 event. To do that we will use special ESP aggregation functions:

ESP_aLast(locationLatitude) to get the “locationLatitude” value of the last event that affected the group (the device id is the group by variable) = the “locationLatitude” value of the currently processed event in the group
ESP_aLag(locationLatitude,1) to get the “locationLatitude” value of the previous event (lag value is 1) that affected the group = the “locationLatitude” value of the previously processed event in the group

Here is the Aggregate window XML code:

<window-aggregate name="getPreviousCoordinatesPerDevice" insert-only="false" output-insert-only="false" collapse-updates="true">

   <schema>

      <fields>

         <field name="deviceID" type="string" key="true"/>

         <field name="loggingSample" type="int32"/>

         <field name="loggingTime" type="stamp"/>

         <field name="locationLatitude" type="double"/>

         <field name="locationLongitude" type="double"/>

         <field name="locationSpeed" type="double"/>

         <field name="previousLatitude" type="double"/>

         <field name="previousLongitude" type="double"/>

      </fields>

   </schema>

   <output>

      <field-expr><![CDATA[ESP_aLast(loggingSample)]]></field-expr>

      <field-expr><![CDATA[ESP_aLast(loggingTime)]]></field-expr>

      <field-expr><![CDATA[ESP_aLast(locationLatitude)]]></field-expr>

      <field-expr><![CDATA[ESP_aLast(locationLongitude)]]></field-expr>

      <field-expr><![CDATA[ESP_aLast(locationSpeed)]]></field-expr>

      <field-expr><![CDATA[ESP_aLag(locationLatitude,1)]]></field-expr>

      <field-expr><![CDATA[ESP_aLag(locationLongitude,1)]]></field-expr>

   </output>

</window-aggregate>

As a result, I will get locationLatitude, locationLongitude, previousLatitude, previousLongitude on the same event.

I can then do simple computations using the Compute window:

<field-expr>

   <![CDATA[geodistance_haversine(previousLatitude,previousLongitude,locationLatitude,

                                  locationLongitude)*1.609344]]>

</field-expr>

NB: geodistance_haversine is a new DataFlux Expression Language 2.7 function. It computes the distance between two geographical points. This formula provides greater accuracy, particularly for shorter distances. 1.609344 is used to convert miles to kilometers.

You can also use User Defined Function and accumulate distances in a separate variable.

<udf name="getDistance" type="double">

   <![CDATA[

      private real d ;

      d=geodistance_haversine(previousLatitude,previousLongitude,          

                              locationLatitude,locationLongitude)*1.609344 ;

      if isnull(d) then cum=cum else cum=cum+d ;

      return cum;

]]>

</udf>

You have to initialize some of your variables in an “Expression Engine Initialization Block”.

Second method

The second method is more familiar to SAS developers, because it uses ESP capabilities to run SAS DS2 code. Thus, you can rely on standard SAS capabilities to retain values from previous events.

DS2 code can be used in an ESP Procedural window using a DS2 Code Handler. DS2 is fully integrated into ESP, meaning that DS2 code is executed by the ESP engine. Always prefer using DS2 code instead of traditional Data Step code in a Procedural window. Traditional Data Step has to be executed in a SAS session outside of ESP, with a severe performance impact.

The following DS2 code enables us to:

Specify the default output stream with the reserved word esp.out
Declare some global variables
Retain and initialize global variables
Set retained variables to previous values before reading a new event
Read the input stream using the esp.in reserved word in a run() method (the run() method is equivalent to the SAS Data Step natural loop ; it is run each time an event is read)
Do some computations, especially the distance calculation using the geodist function (SAS DS2 function)

data esp.out ;

   declare double prevLatitude prevLongitude dist cumDist cumTrafficJamDist ;

   retain locationLatitude locationLongitude ;

   retain cumDist cumTrafficJamDist 0 ;

   method run() ;

      prevLatitude=locationLatitude ;

      prevLongitude=locationLongitude ;

      set esp.in ;

      dist=geodist(locationLatitude,locationLongitude,prevLatitude,prevLongitude) ;

      cumDist=sum(cumDist,dist) ;

      if locationSpeed < 3 then cumTrafficJamDist=sum(cumTrafficJamDist,dist) ;

   end ;

enddata ;

Conclusion

Both methods are different not only in the design but also in the results. Computations results are different because of using different distance calculation functions (DataFlux vs SAS). But also in the number of output events. With the Aggregate window, we only maintain 1 event per device, which is continuously updated. With the Procedural window, we keep track of all the incoming events, assuming here that they are coming from only 1 device.

Perform computations across multiple events in SAS ESP

Register Today!

Free course: Data Literacy Essentials

Get Started