Feature Engineering #2 – Accordence to Predefined Pattern

(This article is part of a growing collection of articles on feature engineering for data science. Find more links at the end of this article.)

Feature Engineering is an important tool for (supervised) machine learning. Model accuracy and interpretability benefits from variables that precisely describe the behavior of the analysis subjects. In many cases these features are derived from transactional data which are recorded, e.g. over time. In some cases, simple descriptive measures like the mean or the sum provide a very good picture of the analysis subjects. Often it is important to dig deeper to adequately describe the behavior. Analytic methods help to calculate KPIs that measure the trend over time per customer, the accordance with predefined pattern or the correlation of individual customer with the average customer.

Transposing the data is not enough! - Describe the behavior

Transposing longitudinal data to "WIDE" format also results in a table which has one record per analysis subset. Of course, SAS offers a large and powerful toolset to change between different data structures, e.g. by using the TRANSPOSE procedure, using the DATASTEP. Many SAS users have also programmed and published different macros for that purpose.

For powerful analytical models it is however not enough to "replicate" the original data just in another format. You want to describe the behavior of your analysis subjects. You can to this by calculating different analytic measures.

This article focuses on the "accordance of an analysis subject with predefined pattern” and shows the rationale, the coding and the interpretation of the results. You will see that it is very easy in SAS to use the FREQ procedure to create a powerful feature that helps to describe the behavior of analysis subjects. This feature can also be used to rank the analysis subjects by the accordance with the predefined pattern and to prioritize analysis subjects, in our case account managers, for further investigation and discussion.

There is also a link to a Youtube Recording with a presentation on this topic, which you can find below.

Using the historic seasonal demand pattern as a target for sales planning

Assume a company which is active in catering business. The company runs some own restaurants in different (tourism) regions of Austria, and it also performs catering at large events. In order to fulfill the demand, the company also has sub-contrators which provide staff, logistics, services, food, and drinks. The sales force of this company should make pre-contracts with companies to secure the capacities already early, to be able to fulfill the demand throughout the year.

The historic monthly demand can be aggregated in different ways in SAS. Here an example using the FREQ procedure is shown.

proc freq data=sales_month;
 table month / nocum out=HistoricDemand(rename=(percent=HistoricPct));
 weight Sales_EUR;
run;

The WEIGHT statement is used to aggregate the sales amount in euros. The TABLE statement uses month as a category to have monthly aggregates. The output statistics PERCENT is used to describe the relative distribution per calendar month and is renamed to "HistoricPct".

proc sgplot data=HistoricDemand;
 vline Month / response=HistoricPct lineattrs=(thickness=3 pattern=1 color=grey ) ;
 yaxis min=0 max=20;
run;

The SGPLOT procedure and the VLINE statement is used to draw a line chart for the historic percentages.

You see a remarkable historic pattern with a first peak in February. This can be due to the business in some skiing regions where February is a very busy month due to the winter school holidays in these months. The summer months from May till August also have a very high value, which might be due to many outdoor events and open air concert caterings. Finally December shows a peak due to Christmas parties and also skiing events in the Christmas holidays and New Years Eve.

Which account managers comply with the pre-set target pattern?

If you just plot the individual values against the pre-defined pattern you can hardly see who has a similar pattern. Using analytics measures like the Chi2-statistic allows you to order the account managers by their similarity with the predefined pattern.

You use the FREQ procedure to calculate the Chi2 statistic for each AccountManager.

ods exclude all;

proc freq data=sales_month ;
 by AccountManager;
 table month / nocum out=Sales_AccMgr chisq(testp=HistoricDemand(rename=(HistoricPct=_testp_)));
 weight Sales_EUR;
 ods output OneWayChiSq=Chi2_AccMgr(drop=table label cvalue);
run;

ods exclude none;

Use a BY statement to run the analysis by AccountManager.
Again you use variable MONTH as a categorical variable and SALES_EUR as the weight variable. However here you also specify the CHISQ option and use the HistoricDemand which has been derived and stored in a SAS dataset above as the test proportion with the TESTP option.
The FREQ procedures now calculates a Chi2 statistic for the deviation of the monthly distribution for each account manager from the pre-defined target distribution.
The output object OneWayChiSq contains the chi-square values and the p-values of the respective test. The output object is output to a dataset using the ODS OUTPUT statement.
Note that using the NOPRINT option in the PROC FREQ statement would avoid the output window from filling up, but also suppresses the creation of the output object at all. For this reason the ODS EXCLUDE statement has been used to turn off the create of output before the procedures runs and and turn it on again afterwards.

The output dataset has the following content and structure. Multiple lines are shown per account manager. So you either need to pick one statistic or it needs to be transposed to a one-row-per-subject structure.

Account
Manager          Name1     Label1        cValue1           nValue1

Alfred          _PCHI_     Chi-Square    1720.9937     1720.993729
Alfred          DF_PCHI    DF            11              11.000000
Alfred          P_PCHI     Pr > ChiSq    <.0001                  0
Alice           _PCHI_     Chi-Square    1669.5182     1669.518205
Alice           DF_PCHI    DF            11              11.000000
Alice           P_PCHI     Pr > ChiSq    <.0001                  0
Barbara         _PCHI_     Chi-Square    2205.1988     2205.198761
Barbara         DF_PCHI    DF            11              11.000000
Barbara         P_PCHI     Pr > ChiSq    <.0001                  0

You use the TRANSPOSE procedure to transpose it to keep the Chi2 value and the p-value for each account manager.


proc transpose data=Chi2_AccMgr out=Chi2_Results (drop=_name_ df_pchi); 
 by AccountManager;
 var nValue1;
 id name1;
run;

In order to improve the display you can sort the table by descending p-value.


proc sort data=Chi2_Results;
 by descending _pchi_;
run;

And use a SAS datastep to format the output and the _N_ variable to create a rank variable.


data Chi2_AccMgr_Results;
 format Rank 3.;
 set Chi2_Results;
 Rank = _N_;
 rename _pchi_ = Chi2_Value
        p_pchi = P_Value;
 format p_pchi percent8.3
        _pchi_ 8.1;
run;

Interpreting the output table

You see the ranked output table below. As you ordered the table by descending Chi2 statistics (or p-value) you find those account managers at the top of the list which have the largest deviation from the predefined pattern. In order to illustrate this results for two account managers you find the individual line-chart compared with the overall line chart for Joyce (very high deviation) and for Jeffrey (very low deviation) below.

From a business point of view, this table allows you to see which account managers stick to the predefined targets and acquire resources according to the predefined pattern. And you also see which account managers are quite apart from the target pattern. You might want to investigate their individual sales pattern in more details.

Maybe they have specific circumstances why they close their contracts in a certain way.
Maybe importance of the predefined target pattern has not been made clear to them.
The feature that has been derived just by using the FREQ procedure and a pre-aggregated target pattern can be used directly to create a sorted list and to see where to start such conversations first.

In a machine learning context this derived feature can be used to describe the accordance of an analysis subject to a certain pattern. This feature can be used in unsupervised machine learning models to better segment your customer base or the members of your sales force. You can also use the feature in supervised machine learning models as input variable to potentially predict the outcome.

Reviewing individual line charts

Joining the overall pattern to the individual demand values

You can use the SQL procedure to join the HistoricDemand as created above to the monthly sales data.


proc sql;
 create table AccMgr_XT
 as select a.AccountManager,
           a.month,
		   a.Sales_EUR as Sales_EUR_Obs format = 8.,
		   b.Count/11 as Sales_Eur_Exp format = 8.
    from sales_month as a
    left join HistoricDemand as b
	on a.month = b.month
    order by a.AccountManager, a.Month;
quit;

Review the demand curve of an account managers with low deviation (Jeffrey)

You use the SGPLOT procedure with a line-chart for the (individually) observed amount for a certain account manager and a line chart for the overall demand distribution. Note that PATTERN, COLOR and TRANSPARENCY options are used to finetune the line chart.



proc sgplot data=AccMgr_XT;
 vline Month / response=Sales_EUR_Obs lineattrs=(thickness=3 ) transparency=0.2;
 vline Month / response=Sales_Eur_Exp lineattrs=(thickness=3 pattern=2 color=grey  ) transparency=0.5 ;
 yaxis max=3000;
 where AccountManager = 'Jeffrey';
run;

You see that account managers "Jeffrey" who has a very low Chi2-statistic also exhibits a monthly demand distribution which is very similar to the predefined pattern.

Review the demand curve of an account managers with low deviation (JOYCE)

Whereas the demand curve for account manager "JOYCE" is quite different from the overall demand pattern. The chi2 statistic of ~2377 is almost three times as high as those of JEFFREY. This illustrates that using the chi2-statistic is a good indicator. However only reviewing the pattern shows you the individual difference. She has quite low values at the beginning of the year, but high values at the end of the year. It therefore makes sense to discuss with Joyce the rationale for her deviation. Maybe there is a good reason that she has a pattern like that.


proc sgplot data=AccMgr_XT;
 vline Month / response=Sales_EUR_Obs lineattrs=(thickness=3 ) transparency=0.2;
 vline Month / response=Sales_Eur_Exp lineattrs=(thickness=3 pattern=2 color=grey  ) transparency=0.5 ;
 yaxis max=3000;
 where AccountManager = 'Joyce';
run;

Conclusion

You have seen that it is very easy in SAS to use the FREQ procedure to create a powerful feature that helps to describe the behavior of analysis subjects. This feature can also be used to rank the analysis subjects by the accordance with the predefined pattern and to prioritize analysis subjects, in our case account managers, for further investigation and discussion.

Be creative!

The article could also encourages you to to be creative, when it comes to using longitudinal data in your machine learning analyses. In this example the Chi2-statistic has been a good indicator whether an account manager is adhering to the predefined pattern.

Your data might require other forms of aggregations or calculations - Be creative! The analytic toolbox is very large and you can find the appropriate measure for your use case.

Youtube Webinar

There is a webinar on the same topic in my Data Preparation for Data Science Playlist. It explains the case as a software demo.

Links and References

There are more SAS Communities article which show examples for feature engineering:

This example has been taken from my SAS Press book Applying Data Science - Business Case Studies Using SAS see chapter 19 and chapter 20

Further books of the author in SAS Press:

"Ask-the-expert sessions" on Data Preparation for Data Science (in German):
- https://www.sas.com/de_at/webinars/ask-the-expert-serie-data-preparation-fuer-data-science-teil1.htm...
- https://www.sas.com/de_at/webinars/ask-the-expert-serie-data-preparation-fuer-data-science-teil2.htm...
Slide Collection: Slides Deck #121 and #122 at https://github.com/gerhard1050/DataScience-Presentations-By-Gerhard