(This article is part of a growing collection of articles on feature engineering for data science. Find more links at the end of this article.)
Feature Engineering is an important tool for (supervised) machine learning. Model accuracy and interpretability benefits from variables that precisely describe the behavior of the analysis subjects. In many cases these features are derived from transactional data which are recorded, e.g. over time. In some cases, simple descriptive measures like the mean or the sum provide a very good picture of the analysis subjects. Often it is important to dig deeper to adequately describe the behavior. Analytic methods help to calculate KPIs that measure the trend over time per customer, the accordance with predefined pattern or the correlation of individual customer with the average customer.
Transposing longitudinal data to "WIDE" format also results in a table which has one record per analysis subset. Of course, SAS offers a large and powerful toolset to change between different data structures, e.g. by using the TRANSPOSE procedure, using the DATASTEP. Many SAS users have also programmed and published different macros for that purpose.
For powerful analytical models it is however not enough to "replicate" the original data just in another format. You want to describe the behavior of your analysis subjects. You can to this by calculating different analytic measures.
This article focuses on the "Describing the Trend over Time” and shows the rationale, the coding and the interpretation of the results.
Consider the following usage data for 10 customers in a wide format. WIDE refers here to the fact that the data are not in transactional/longitudinal form (LONG), but are already arranged for one row per customer. Variables M1-M6 contain the usage for the months 1-6.
data usage_wide;
input CustID M1 M2 M3 M4 M5 M6 8.;
Cards;
1 92 94 98 87 78 72
2 22 24 30 28 31 30
3 100 120 125 128 120 115
4 43 43 43 . 42 41
5 20 27 30 35 31 35
6 16 24 18 25 30 24
7 80 70 60 50 57 63
8 90 95 80 100 100 90
9 47 47 47 47 47 47
10 50 52 0 50 0 52
;
run;
Note that in the attached example program this data is available as a dataset with 10 customers. If you want to run the example on larger data, you can use the attached USAGE_LONG data directly in the REG procedure below.
In order to plot this data you have to transpose it to a long format. You can use the TRANSPOSE procedure for this purpose:
PROC TRANSPOSE DATA = usage_wide OUT = usage_long;
BY custId;
RUN;
In addition you need a SAS datastep to extract the month value (1-6) from the former variable names M1-M6.
DATA usage_LONG;
SET usage_LONG;
FORMAT Month 8.;
RENAME col1 = Usage;
Month = compress(_name_,'M');
DROP _name_;
RUN;
This leads to a dataset that has the following structure and content (for the first 2 customers):
Cust
Obs ID Usage Month
1 1 92 1
2 1 94 2
3 1 98 3
4 1 87 4
5 1 78 5
6 1 72 6
7 2 22 1
8 2 24 2
9 2 30 3
10 2 28 4
11 2 31 5
12 2 30 6
Next you use the SGPLOT procedure to create a line chart for selected customers (1,3,4,5,7):
proc sgplot data=usage_LONG;
series x=Month y=Usage / group=CustID lineattrs=(thickness = 3) curvelabel;
where custid in (1,3,4,5,7);
run;
Note that the CURVELABEL option is used to add the group-id at the end of each line.
You see the following lines:
If you now investigate the example lines for their long term trend (month 1-6) and for their short term trend (month 4-6) you can classify them as follows:
In order to perform such a classification automatically you can use the REG procedure as show in the following example:
(Note that in the attached example program this data is available as a dataset with 10 customers. If you want to run the example on larger data, you can use the attached USAGE_LONG data directly in the REG procedure below.)
PROC REG DATA = usage_LONG NOPRINT
OUTEST=Trend_LongTerm(KEEP = CustID month
RENAME = (month=LongTerm));
MODEL usage = month;
BY CustID;
RUN;
This results in a dataset with one row per subject, where the slot for each customer is stored:
Cust
ID LongTerm
1 -4.54286
3 2.22857
4 -0.39535
5 2.62857
7 -3.82857
PROC REG DATA = usage_LONG NOPRINT
OUTEST=Trend_ShortTerm(KEEP = CustID month
RENAME = (month=ShortTerm));
MODEL usage = month;
BY CustID;
WHERE month in (4 5 6);
RUN;
PROC FORMAT;
VALUE est LOW -< -1 = '-'
-1 - 1 = '='
1 <- HIGH = '+';
RUN;
DATA usage_wide_xt;
MERGE usage_wide
Trend_LongTerm
Trend_ShortTerm;
BY CustID;
Format ShortTerm LongTerm 8.1 LongShortInd $2.;
LongShortInd = CAT(put(LongTerm,est.),put(ShortTerm,est.));
RUN;
Cust Short Long
ID M1 M2 M3 M4 M5 M6 LongTerm Term ShortInd
1 92 94 98 87 78 72 -4.5 -7.5 --
2 22 24 30 28 31 30 1.7 1.0 +=
3 100 120 125 128 120 115 2.2 -6.5 +-
4 43 43 43 . 42 41 -0.4 -1.0 ==
5 20 27 30 35 31 35 2.6 0.0 +=
6 16 24 18 25 30 24 1.9 -0.5 +=
7 80 70 60 50 57 63 -3.8 6.5 -+
8 90 95 80 100 100 90 1.0 -5.0 =-
9 47 47 47 47 47 47 0.0 0.0 ==
10 50 52 0 50 0 52 -2.7 1.0 -=
proc sql;
create table long_xt
as select catx("|",a.CustID,b.LongShortInd) as NewID, a.*, b.LongShortInd
from usage_long as a
left join usage_wide_xt as b
on a.custid = b.custid
order by a.custid, a.month
;
quit;
proc sgplot data=long_xt;
series x=Month y=Usage / group=NewID lineattrs=(thickness = 3) curvelabel;*datalabel=NewID;
where custid in (1,3,4,5,7);
run;
The idea of this article is to illustrate how you can derive features from longitudinal data and use them for other data science tasks. The TREND feature which has been illustrated in this example are often a good indicator to describe the change of customer behavior over time. These features have been helpful for example to early indicate whether customers drop their usage and are at risk to cancel their contracts.
The article could also encourages you to to be creative, when it comes to using longitudinal data in your machine learning analyses.
Your data might require other forms of aggregations or calculations - Be creative! The analytic toolbox is very large and you can find the appropriate measure for your use case.
There is a SAS tip at Support.sas.com which contains non-commented version of this example: https://support.sas.com/kb/31/582.html
There are more SAS Communities article which show examples for feature engineering:
This example has been taken from my SAS Press book Applying Data Science - Business Case Studies Using SAS see chapter 19 and chapter 20
Further books of the author in SAS Press:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.