Feature Engineering #1 - Using Correlation Analysis to Describe Behaviour over Time

1 Like

(This article is part of a growing collection of articles on feature engineering for data science. Find more links at the end of this article.)

0 Feature EngineerCorrelation.png Feature Engineering is an important tool for (supervised) machine learning. Model accuracy and interpretability benefits from variables that precisely describe the behavior of the analysis subjects. In many cases these features are derived from transactional data which are recorded, e.g. over time. In some cases, simple descriptive measures like the mean or the sum provide a very good picture of the analysis subjects. Often it is important to dig deeper to adequately describe the behavior. Analytic methods help to calculate KPIs that measure the trend over time per customer, the accordance with predefined pattern or the correlation of individual customer with the average customer.

Transposing the data is not enough! - Describe the behavior

1 long wide.png

Transposing longitudinal data to "WIDE" format also results in a table which has one record per analysis subset. Of course, SAS offers a large and powerful toolset to change between different data structures, e.g. by using the TRANSPOSE procedure, using the DATASTEP. Many SAS users have also programmed and published different macros for that purpose.

For powerful analytical models it is however not enough to "replicate" the original data just in another format. You want to describe the behavior of your analysis subjects. You can to this by calculating different analytic measures.

This article focuses on “correlation analysis between individual customers and the average customer” and shows the rationale, the coding and the interpretation of the results. There is also a link to a Youtube Recording with a presentation on this topic, which you can find below.

How similar is my customer with the average customer?

2 verlauf.png Let's have a look at a line chart with usage curves for 4 selected customers (orange, green violet, blue) and the average usage curve over all 50000 customers. You can see that some customers have a usage curve which is very similar with the average customer, e.g. the orange one, which other customers show a different pattern over time. You can describe the usage curves for example with the follow three parameters: 1) Average Level, 2) Variability 3) Similarity with the "average customer"

Describing the usage of the customers

The orange customer has a very similar, slightly lower LEVEL as the average customer. His VARIABILITY is over the six time periods is rather large (++). And his SIMILARITY with the average customer is very high (++). When the average of customers is rising their usage, the orange customer is also rising his usage.

You can document the measures for the orange customer as shown in the first line of the following table.

3 table with measures corr.png

When you look at the green customer, you see that his average level is almost the same as the average customer, so you can set a "=" sign. His variability is very high. His similarity with the average customer is not only low, it is negative. When the average goes up, the green customer decreases his usage and vice versa. So you can set "--" here.

Other remarkable behavior which you can see form the chart is the the violet customer has almost no variability, his usage stays at the same level over all periods.

And the blue customer has a remarkable low level with some variability and some similarity with the average customers.

You have now described the different usage of the customers using certain features. You are not looking at the original individual data any more, but on features that allow to infer behavior, study causality between factors or define segments.

Use statistical measures to describe the above mentioned features

In order to represent the features LEVEL, VARIABILITY and SIMILARITY with statistical measures, it is straightforward to use the MEAN for level and the STANDARD DEVIATION for variability. (Pearson) CORRELATION can be used to describe similarity between two vectors of measures. The CORR procedure in SAS not only allows to calculate Pearson or Spearman correlation, it also calculates the mean and the standard deviation and outputs it into a dataset. The following 3 steps show how to code the example with SAS.

Step 1 - Calculate the average usage per time interval ("Average Customer")

In the first step you calculate the average usage per time interval. The code below shows how you can do this with a SQL query. You could also use SAS procedures like the MEANS, or the SUMMARY procedure for aggregation.

proc sql;
 create table monthly_average
 as select month,
           mean(usage) as MonthlyAverage format = 8.2
    from usage
    group by month
    order by month;
quit;

Thus code generates a dataset with one entry per time interval:

           Monthly
 Month     Average
     1       59.01
     2       62.79
     3       55.11
     4       62.68
     5       54.59
     6       58.49

Step 2 - Join the lookup table with the average usage to the original data

In the next step, you join this (lookup) table to the original data. You need to do this because the CORR procedure which will be used to calculate the correlations per CUSTID, needs to have the two variables in columns in the same dataset.

The code uses a SQL query to join the tables. A session describing different methods in SAS to join lookup tables to a master table will be published soon in my youtube webinar playlist.

proc sql;
 create table usage_enh
 as
 select u.custid,
        u.month,
        u.usage,
        m.MonthlyAverage
 from usage as u,
      Monthly_Average as m
 where u.month = m.month;
quit;

The resulting table looks as follows. You see that the 6 monthly average values for the months 1-6 are replicated for each CustID

4 corr analysis table.png

Step 3 - Calculate the correlation between the average customer and the individual values per customer

In the next step you use the CORR procedure to calculate the (Spearman) correlation between the usage values of each customer and the monthly average of all customers.

PROC CORR DATA = usage_enh 
          OUTS = Corr_Usage NOPRINT;
 BY CustID;
 VAR Usage;
 WITH MonthlyAverage;
RUN;

Note the following from the code:

Don't forget to specify a NOPRINT statement, otherwise the printed output with correlation results for each BY group will fill up your output window or output file.
Use the BY statement to calculate the correlation coefficients by CustID. Note that the dataset needs to be sorted BY CUSTID.
The OUTS= option defines the dataset, where the correlation coefficients shall be written to. The "S" in "OUTS" defines that you calculate the Spearman correlation coefficients. Use OUTP to receive Pearson correlations.
The 2 variables USAGE and MONTHLYAVERAGE are specified with the VAR and the WITH statement.

The resulting output dataset looks as follows:

5 corr stat long.png

You see that the relevant statistics (mean, standard deviation and correlation) are available in the output table. You just need to re-arrange the structure to a one-row-per-subject data structure to use it for machine learning purposes.

Step 4 - Rearrange the data to a one-row-per-subject data structure

You can use the TRANSPOSE procedure to arrange the data with one row per subject.

Use a BY statement to run the transpose per customer.
The ID statement specifies with column in the input data contains the variable names in the output data
The VAR statement specifies the variable that is used for the values.

proc transpose data=Corr_Usage out=customer_ABT(drop= _name_);
by custid;
id _type_;
var usage;
run;

Finally you can perform some formatting on the table and rename column names and specify formats.

data customer_ABT;
 set customer_ABT;
 drop n;
 rename mean=Level std=Variability corr=Similarity;
 format mean 8.1 std 8.1 corr 8.2;
run;

The resulting table shows that you have derived 3 features, LEVEL, VARIABILITY and SIMILARITY from the original transactional table. You can use this table now as addtitional input variables in supervised machine learning. Or you can use the variables to segment your customers into groups with similar behavior, e.g. AVERAGE LEVEL/LOW VARIABILITY/HIGH SIMILARITY.

Conclusion

The idea of this article is to illustrate how you can derive features from longitudinal data and use them for other data science tasks. You have seen how easy it is to code the creation of 3 powerful features with SAS code.

Be creative!

The article could also encourage you to to be creative, when it comes to using longitudinal data in your machine learning analyses. In my example the SIMILIRATY feature has been a good indicator whether a customer is close to the average customer or, for whatever reason, is behaving rather in the opposite way. This indicator can be a good predictor for product upselling or cancellation risk analyses.

Your data might require other forms of aggregations or calculations - Be creative! The analytic toolbox is very large and you can find the appropriate measure for your use case.

7 feature2.png

Youtube Webinar

There is a webinar on the same topic in my Data Preparation for Data Science Playlist. It explains the case as a software demo.

Links and References

There are more SAS Communities article which show examples for feature engineering:

This example has been taken from my SAS Press book Data Preparation for Analytics Using SAS see chapter 18.3, page 184pp

Further books of the author in SAS Press:

"Ask-the-expert sessions" on Data Preparation for Data Science (in German):
- https://www.sas.com/de_at/webinars/ask-the-expert-serie-data-preparation-fuer-data-science-teil1.htm...
- https://www.sas.com/de_at/webinars/ask-the-expert-serie-data-preparation-fuer-data-science-teil2.htm...
Slide Collection: Slides Deck #121 and #122 at https://github.com/gerhard1050/DataScience-Presentations-By-Gerhard