(This article is part of a growing collection of articles on feature engineering for data science. Find more links at the end of this article.)
Feature Engineering is an important tool for (supervised) machine learning. Model accuracy and interpretability benefits from variables that precisely describe the behavior of the analysis subjects. In many cases these features are derived from transactional data which are recorded, e.g. over time. In some cases, simple descriptive measures like the mean or the sum provide a very good picture of the analysis subjects. Often it is important to dig deeper to adequately describe the behavior. Analytic methods help to calculate KPIs that measure the trend over time per customer, the accordance with predefined pattern or the correlation of individual customer with the average customer.
Transposing longitudinal data to "WIDE" format also results in a table which has one record per analysis subset. Of course, SAS offers a large and powerful toolset to change between different data structures, e.g. by using the TRANSPOSE procedure, using the DATASTEP. Many SAS users have also programmed and published different macros for that purpose.
For powerful analytical models it is however not enough to "replicate" the original data just in another format. You want to describe the behavior of your analysis subjects. You can to this by calculating different analytic measures.
This article focuses on the "accordance of an analysis subject with predefined pattern” and shows the rationale, the coding and the interpretation of the results. You will see that it is very easy in SAS to use the FREQ procedure to create a powerful feature that helps to describe the behavior of analysis subjects. This feature can also be used to rank the analysis subjects by the accordance with the predefined pattern and to prioritize analysis subjects, in our case account managers, for further investigation and discussion.
There is also a link to a Youtube Recording with a presentation on this topic, which you can find below.
Assume a company which is active in catering business. The company runs some own restaurants in different (tourism) regions of Austria, and it also performs catering at large events. In order to fulfill the demand, the company also has sub-contrators which provide staff, logistics, services, food, and drinks. The sales force of this company should make pre-contracts with companies to secure the capacities already early, to be able to fulfill the demand throughout the year.
The historic monthly demand can be aggregated in different ways in SAS. Here an example using the FREQ procedure is shown.
proc freq data=sales_month;
table month / nocum out=HistoricDemand(rename=(percent=HistoricPct));
weight Sales_EUR;
run;
The WEIGHT statement is used to aggregate the sales amount in euros. The TABLE statement uses month as a category to have monthly aggregates. The output statistics PERCENT is used to describe the relative distribution per calendar month and is renamed to "HistoricPct".
proc sgplot data=HistoricDemand;
vline Month / response=HistoricPct lineattrs=(thickness=3 pattern=1 color=grey ) ;
yaxis min=0 max=20;
run;
The SGPLOT procedure and the VLINE statement is used to draw a line chart for the historic percentages.
You see a remarkable historic pattern with a first peak in February. This can be due to the business in some skiing regions where February is a very busy month due to the winter school holidays in these months. The summer months from May till August also have a very high value, which might be due to many outdoor events and open air concert caterings. Finally December shows a peak due to Christmas parties and also skiing events in the Christmas holidays and New Years Eve.
If you just plot the individual values against the pre-defined pattern you can hardly see who has a similar pattern. Using analytics measures like the Chi2-statistic allows you to order the account managers by their similarity with the predefined pattern.
You use the FREQ procedure to calculate the Chi2 statistic for each AccountManager.
ods exclude all;
proc freq data=sales_month ;
by AccountManager;
table month / nocum out=Sales_AccMgr chisq(testp=HistoricDemand(rename=(HistoricPct=_testp_)));
weight Sales_EUR;
ods output OneWayChiSq=Chi2_AccMgr(drop=table label cvalue);
run;
ods exclude none;
The output dataset has the following content and structure. Multiple lines are shown per account manager. So you either need to pick one statistic or it needs to be transposed to a one-row-per-subject structure.
Account
Manager Name1 Label1 cValue1 nValue1
Alfred _PCHI_ Chi-Square 1720.9937 1720.993729
Alfred DF_PCHI DF 11 11.000000
Alfred P_PCHI Pr > ChiSq <.0001 0
Alice _PCHI_ Chi-Square 1669.5182 1669.518205
Alice DF_PCHI DF 11 11.000000
Alice P_PCHI Pr > ChiSq <.0001 0
Barbara _PCHI_ Chi-Square 2205.1988 2205.198761
Barbara DF_PCHI DF 11 11.000000
Barbara P_PCHI Pr > ChiSq <.0001 0
You use the TRANSPOSE procedure to transpose it to keep the Chi2 value and the p-value for each account manager.
proc transpose data=Chi2_AccMgr out=Chi2_Results (drop=_name_ df_pchi);
by AccountManager;
var nValue1;
id name1;
run;
In order to improve the display you can sort the table by descending p-value.
proc sort data=Chi2_Results;
by descending _pchi_;
run;
And use a SAS datastep to format the output and the _N_ variable to create a rank variable.
data Chi2_AccMgr_Results;
format Rank 3.;
set Chi2_Results;
Rank = _N_;
rename _pchi_ = Chi2_Value
p_pchi = P_Value;
format p_pchi percent8.3
_pchi_ 8.1;
run;
You see the ranked output table below. As you ordered the table by descending Chi2 statistics (or p-value) you find those account managers at the top of the list which have the largest deviation from the predefined pattern. In order to illustrate this results for two account managers you find the individual line-chart compared with the overall line chart for Joyce (very high deviation) and for Jeffrey (very low deviation) below.
From a business point of view, this table allows you to see which account managers stick to the predefined targets and acquire resources according to the predefined pattern. And you also see which account managers are quite apart from the target pattern. You might want to investigate their individual sales pattern in more details.
In a machine learning context this derived feature can be used to describe the accordance of an analysis subject to a certain pattern. This feature can be used in unsupervised machine learning models to better segment your customer base or the members of your sales force. You can also use the feature in supervised machine learning models as input variable to potentially predict the outcome.
You can use the SQL procedure to join the HistoricDemand as created above to the monthly sales data.
proc sql;
create table AccMgr_XT
as select a.AccountManager,
a.month,
a.Sales_EUR as Sales_EUR_Obs format = 8.,
b.Count/11 as Sales_Eur_Exp format = 8.
from sales_month as a
left join HistoricDemand as b
on a.month = b.month
order by a.AccountManager, a.Month;
quit;
You use the SGPLOT procedure with a line-chart for the (individually) observed amount for a certain account manager and a line chart for the overall demand distribution. Note that PATTERN, COLOR and TRANSPARENCY options are used to finetune the line chart.
proc sgplot data=AccMgr_XT;
vline Month / response=Sales_EUR_Obs lineattrs=(thickness=3 ) transparency=0.2;
vline Month / response=Sales_Eur_Exp lineattrs=(thickness=3 pattern=2 color=grey ) transparency=0.5 ;
yaxis max=3000;
where AccountManager = 'Jeffrey';
run;
You see that account managers "Jeffrey" who has a very low Chi2-statistic also exhibits a monthly demand distribution which is very similar to the predefined pattern.
Whereas the demand curve for account manager "JOYCE" is quite different from the overall demand pattern. The chi2 statistic of ~2377 is almost three times as high as those of JEFFREY. This illustrates that using the chi2-statistic is a good indicator. However only reviewing the pattern shows you the individual difference. She has quite low values at the beginning of the year, but high values at the end of the year. It therefore makes sense to discuss with Joyce the rationale for her deviation. Maybe there is a good reason that she has a pattern like that.
proc sgplot data=AccMgr_XT;
vline Month / response=Sales_EUR_Obs lineattrs=(thickness=3 ) transparency=0.2;
vline Month / response=Sales_Eur_Exp lineattrs=(thickness=3 pattern=2 color=grey ) transparency=0.5 ;
yaxis max=3000;
where AccountManager = 'Joyce';
run;
You have seen that it is very easy in SAS to use the FREQ procedure to create a powerful feature that helps to describe the behavior of analysis subjects. This feature can also be used to rank the analysis subjects by the accordance with the predefined pattern and to prioritize analysis subjects, in our case account managers, for further investigation and discussion.
The article could also encourages you to to be creative, when it comes to using longitudinal data in your machine learning analyses. In this example the Chi2-statistic has been a good indicator whether an account manager is adhering to the predefined pattern.
Your data might require other forms of aggregations or calculations - Be creative! The analytic toolbox is very large and you can find the appropriate measure for your use case.
There is a webinar on the same topic in my Data Preparation for Data Science Playlist. It explains the case as a software demo.
There are more SAS Communities article which show examples for feature engineering:
This example has been taken from my SAS Press book Applying Data Science - Business Case Studies Using SAS see chapter 19 and chapter 20
Further books of the author in SAS Press:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.