Feature Engineering #3 – Describing the Trend over Time

2 Likes

(This article is part of a growing collection of articles on feature engineering for data science. Find more links at the end of this article.)

Feature Engineering is an important tool for (supervised) machine learning. Model accuracy and interpretability benefits from variables that precisely describe the behavior of the analysis subjects. In many cases these features are derived from transactional data which are recorded, e.g. over time. In some cases, simple descriptive measures like the mean or the sum provide a very good picture of the analysis subjects. Often it is important to dig deeper to adequately describe the behavior. Analytic methods help to calculate KPIs that measure the trend over time per customer, the accordance with predefined pattern or the correlation of individual customer with the average customer.

Transposing the data is not enough! - Describe the behavior

Transposing longitudinal data to "WIDE" format also results in a table which has one record per analysis subset. Of course, SAS offers a large and powerful toolset to change between different data structures, e.g. by using the TRANSPOSE procedure, using the DATASTEP. Many SAS users have also programmed and published different macros for that purpose.

For powerful analytical models it is however not enough to "replicate" the original data just in another format. You want to describe the behavior of your analysis subjects. You can to this by calculating different analytic measures.

This article focuses on the "Describing the Trend over Time” and shows the rationale, the coding and the interpretation of the results.

Example Data "Service Usage"

Consider the following usage data for 10 customers in a wide format. WIDE refers here to the fact that the data are not in transactional/longitudinal form (LONG), but are already arranged for one row per customer. Variables M1-M6 contain the usage for the months 1-6.

data usage_wide;
 input CustID M1 M2 M3 M4 M5 M6 8.;
 Cards;
1 92 94 98 87 78 72
2 22 24 30 28 31 30
3 100 120 125 128 120 115
4 43 43 43 . 42 41
5 20 27 30 35 31 35
6 16  24 18 25 30 24
7 80 70 60 50 57 63
8 90 95 80 100 100 90
9 47 47 47 47 47 47
10 50 52 0 50 0 52
;
run;

Note that in the attached example program this data is available as a dataset with 10 customers. If you want to run the example on larger data, you can use the attached USAGE_LONG data directly in the REG procedure below.

Investigating the data graphically

Step 1 - Transpose from a WIDE to a LONG structure

In order to plot this data you have to transpose it to a long format. You can use the TRANSPOSE procedure for this purpose:

PROC TRANSPOSE DATA = usage_wide OUT = usage_long;
 BY custId;
RUN;

In addition you need a SAS datastep to extract the month value (1-6) from the former variable names M1-M6.

DATA usage_LONG;
 SET usage_LONG;
 FORMAT Month 8.;
 RENAME col1 = Usage;
 Month = compress(_name_,'M');
 DROP _name_;
RUN;

This leads to a dataset that has the following structure and content (for the first 2 customers):

        Cust
 Obs     ID     Usage       Month

   1      1       92            1
   2      1       94            2
   3      1       98            3
   4      1       87            4
   5      1       78            5
   6      1       72            6
   7      2       22            1
   8      2       24            2
   9      2       30            3
  10      2       28            4
  11      2       31            5
  12      2       30            6

Step 2 - Use the SGPLOT procedure

Next you use the SGPLOT procedure to create a line chart for selected customers (1,3,4,5,7):

proc sgplot data=usage_LONG;
 series x=Month y=Usage / group=CustID lineattrs=(thickness = 3) curvelabel;
 where custid in (1,3,4,5,7);
run;

Note that the CURVELABEL option is used to add the group-id at the end of each line.

Business interpretation of the line curves

You see the following lines:

If you now investigate the example lines for their long term trend (month 1-6) and for their short term trend (month 4-6) you can classify them as follows:

Customer 4: LongTerm Trend FLAT (=), ShortTerm Trend FLAT (=)
Customer 5: LongTerm Trend UP (+), ShortTerm Trend UP (+)
Customer 3: LongTerm Trend UP (+), ShortTerm Trend DOWN (-)
Customer 1: LongTerm Trend DOWN (-), ShortTerm Trend FLAT (-)

Describing the TREND with the REG procedure

In order to perform such a classification automatically you can use the REG procedure as show in the following example:

(Note that in the attached example program this data is available as a dataset with 10 customers. If you want to run the example on larger data, you can use the attached USAGE_LONG data directly in the REG procedure below.)

PROC REG DATA = usage_LONG NOPRINT 
         OUTEST=Trend_LongTerm(KEEP = CustID month
                             RENAME = (month=LongTerm));
 MODEL usage = month;
 BY CustID;
RUN;

You run the analysis for each Customer (BY CustID). Make sure that your data are sorted by CustID.
You model the dependent variable USAGE on the independent variable MONTH.
It is important to use the NOPRINT option to avoid the create of printed output in the output window for each BY group!
You are interested in the SLOPE of the regression line. You can obtain the parameter estimates for example by using the OUTEST= option. The SLOPE value is stored under the same variable name as the independent variable (MONTH).
You use the RENAME option to rename the SLOPE value for the long-term trend to LONGTERM in the dataset TREND_LONGTERM.

This results in a dataset with one row per subject, where the slot for each customer is stored:

Cust
 ID     LongTerm

  1     -4.54286
  3      2.22857
  4     -0.39535
  5      2.62857
  7     -3.82857

Calculate the Short-Term trend

You calculate the short-term trend with similar statements. Here you add a WHERE clause to restrict the analysis to months 4-6.

PROC REG DATA = usage_LONG NOPRINT 
         OUTEST=Trend_ShortTerm(KEEP = CustID month
                                RENAME = (month=ShortTerm));
  MODEL usage = month;
  BY CustID;
  WHERE month in (4 5 6);
RUN;

Combine the long-term and the short-term trend

You now want to combine the original customer data with the coefficients for the longterm and the shortterm trend.

You use a datastep with a MERGE statement as shown below to merge the 3 datasets.

In order to classify the values for SLOPE into "-" (if lower than -1), "+" (if larger than 1) and "=" otherwise, you can use the FORMAT procedure to specify a format for this task.

PROC FORMAT;
 VALUE est LOW -< -1     = '-'
           -1  -  1      = '='
            1   <- HIGH  = '+';
RUN;

You use this format EST. in the datastep to derive a new variable which concatenates the class for the longterm and the shortterm trend.

DATA usage_wide_xt;
 MERGE usage_wide
       Trend_LongTerm
       Trend_ShortTerm;
 BY CustID;
 Format ShortTerm LongTerm 8.1 LongShortInd $2.;
 LongShortInd = CAT(put(LongTerm,est.),put(ShortTerm,est.));
RUN;

The PUT statement with the created EST. format is used to convert the numeric slope value into the classes "+", "=", "-".

You use the CAT function to concatenate these values.

The resulting output in dataset looks as follows:

 Cust                                                             Short      Long
  ID      M1     M2     M3     M4     M5     M6    LongTerm        Term    ShortInd

   1      92     94     98     87     78     72        -4.5        -7.5       --   
   2      22     24     30     28     31     30         1.7         1.0       +=   
   3     100    120    125    128    120    115         2.2        -6.5       +-   
   4      43     43     43      .     42     41        -0.4        -1.0       ==   
   5      20     27     30     35     31     35         2.6         0.0       +=   
   6      16     24     18     25     30     24         1.9        -0.5       +=   
   7      80     70     60     50     57     63        -3.8         6.5       -+   
   8      90     95     80    100    100     90         1.0        -5.0       =-   
   9      47     47     47     47     47     47         0.0         0.0       ==   
  10      50     52      0     50      0     52        -2.7         1.0       -=

You see that you have used the REG procedure to derive the trend feature for both, the longterm and the shortterm trend. These features can be used as input variables for example in supervised machine learning models.

You have also classified these trend values and concatenated them to an easy-to-interpret category which shows how how the long and short term trend looks for each customers.

Optional: Display the classification in the line chart (for selected customers)

Adding the classification back to LONG data structure

If you want to add this category to the LONG data structure to be able to plot it in a line chart, you can use the SQL procedure and a left join.

proc sql;
 create table long_xt
 as select catx("|",a.CustID,b.LongShortInd) as NewID, a.*, b.LongShortInd
    from usage_long as a
    left join usage_wide_xt as b
    on a.custid = b.custid
    order by a.custid, a.month
;
quit;

Running the SGPLOT procedure

Now you can again use the SPLOT procedure to plot the lines as shown above. However you use the newly created NewID variable as GROUP variable and show them at the end of each line using the CURVELABEL option.

proc sgplot data=long_xt;
 series x=Month y=Usage / group=NewID lineattrs=(thickness = 3) curvelabel;*datalabel=NewID;
 where custid in (1,3,4,5,7);
run;

In the visual verification you see that the categorization of the usage courses which have been discussed above are now automatically created for each customer.

Conclusion

The idea of this article is to illustrate how you can derive features from longitudinal data and use them for other data science tasks. The TREND feature which has been illustrated in this example are often a good indicator to describe the change of customer behavior over time. These features have been helpful for example to early indicate whether customers drop their usage and are at risk to cancel their contracts.

Be creative!

The article could also encourages you to to be creative, when it comes to using longitudinal data in your machine learning analyses.

Your data might require other forms of aggregations or calculations - Be creative! The analytic toolbox is very large and you can find the appropriate measure for your use case.

Links and References

There is a SAS tip at Support.sas.com which contains non-commented version of this example: https://support.sas.com/kb/31/582.html

There are more SAS Communities article which show examples for feature engineering:

This example has been taken from my SAS Press book Applying Data Science - Business Case Studies Using SAS see chapter 19 and chapter 20

Further books of the author in SAS Press:

"Ask-the-expert sessions" on Data Preparation for Data Science (in German):
- https://www.sas.com/de_at/webinars/ask-the-expert-serie-data-preparation-fuer-data-science-teil1.htm...
- https://www.sas.com/de_at/webinars/ask-the-expert-serie-data-preparation-fuer-data-science-teil2.htm...
Slide Collection: Slides Deck #121 and #122 at https://github.com/gerhard1050/DataScience-Presentations-By-Gerhard