The structure of MISSING VALUES in your data - get a clearer picture with the %MV_PROFILING macro

6 Likes

Missing values are often profiled in a univariate way. This however only tells you how many variables are infected by the "missing value" disease to what extent. It does not give you insight on how many full records you have or if there is a structure of missing data.

However you should be interested to profile the missing value structure of the variables in your analysis data. This can give you insight whether there are relationships in the data that you should be aware of, before you start with the analysis.

This article introduces the %MV_PROFILING macro which allows you to profile missing values in your data in a more comprehensive way. You receive insight into the missing value pattern and potential relationships of the data. The %MV_PROFILING macros has been published with my "Data Quality for Analytics Using SAS" book in 2012. It also has been featured in my SAS Global Forum presentation "Want an Early Picture of the Data Quality Status of Your Analysis Data? SAS® Visual Analytics Shows ..." in 2015.

The output of the %MV_PROFILING macro shows you for example how many FULL RECORDS (records with any missing value in one of the variables), how many records with a missing value in one variable, or with missing values in more than 4 variables exist.

In this article you find usage examples and the interpretation of the output of the %MV_PROFILING macro. It also shows explains the macro parameters.

The idea of the %MV_PROFILING macro

The missing value profiling (%MV_PROFILING) macro defines a segmentation of the observations based on its missing value structures. For each variable, a dummy variable is created that indicates whether a value is missing or not. This set of indicator variables is then concatenated into a string. The following illustrates the definition of the profile chain.

Table10.1.PNG

Note that after the profile chain itself, the total number of missing values per observation is concatenated separated by a ‘_’. This approach shows the number of missing values for each profile and is very useful when larger numbers of variables are analyzed.

Using the %MV_PROFILING macro has many advantages:

It provides insight into the structure of missing values across variables.
It identifies the frequency and accumulation of missing value patterns.
It shows information efficiently. Instead of looking at multiple graphs and reports, an analyst can get a quick overview of the missing value structure.

Data to illustrate the %MV_PROFILING macro

Data from the HMEQ data set, which is frequently used in SAS Enterprise Miner or SAS VDMML examples and demos, illustrate the missing value profile chain macros. The data set is available together with the programs and macro in the attachment to this article..

The HMEQ data set contains histories of credit customers with different types of home equity loans. The variables shown in the following table were used to analyze the missing values.

Variable Name	Description
BAD	Defaulting or repaying the loan
DELINQ	Number of delinquent trade lines
DEROG	Number of major derogatory reports
NINC	Number of recent credit inquires
CLAGE	Age (in months) of the oldest trade line
CLNO	Number of trade lines
JOB	Current job, six categories

Simple reports based on the missing value profile chain

The following table shows a lookup table that displays the MV_PROFILE assigned to each variable. This identifies which digit of the missing value profile chain is used for which variables and eases navigation through the %MV_PROFILING outputs.

MV_PROFILE pattern for each variable

The order of the 0s and 1s in the %MV_PROFILE_CHAIN macro depends on the order of the variables names in the &VARS macro variable. If the &VARS parameter is omitted in the macro, all variables of the data set are used in alphabetical order for the %MV_PROFILING macro by default. The macro variable ORDER=POS can be used to force the order to match the order in the data set.

The following output shows the frequencies of the different patterns in the %MV_PROFILING macro.

Output of the %MV_PROFILING macro for the HMEQ data set

First, the output shows that there are 82.6% complete records without missing values. The next line shows that there are 174 observations (2.9%) that have a missing value for the third variable (in this case, variable DEROG). The next entries in the table reveal that in 2.5% of the cases three variables are missing together (namely, variables DELINQ, DEROG, and NINQ) (the second, third, and fourth variables) and in 2.2% of the cases variables DELING and DEROG are missing. This provides first-hand insight not only into the frequency but also into the structure of the missing values.

Displaying the pattern distribution in a TILE-chart

Tile chart for the frequency of the %MV_PROFILING macro

This information is also displayed as a tile chart. A tile chart splits a rectangle that represents the population into smaller rectangles, where the area of each rectangle is proportional to the frequency of the respective category. In this case, this gives a graphical representation of the proportion of the complete records and the records with missing values. Additionally, it can easily be seen that the top four patterns with missing values make up more than 50% of all missing value patterns.

The next two output examples show the distribution of the number of missing values per record and a ranked list of missing values per variable.

Distribution of the number of missing values per record

This table shows that there are 82.6% complete records and that in 6.4% (2.7+1.5+2.2) of the cases three or more variables have a missing value.

Ranked list of missing values per variable

Looking at this table, you can see that target variable BAD does not include any missing values. Combining this information in the MV_Profile chain above lets us conclude that in 2.2% (pattern 0111111_6) of the cases all input variables are missing.

Advanced reports based on the missing value profile chain

In the next step, the structure and patterns of the missing values are analyzed in more detail. In order to do this, a variable clustering and a principal component analysis are applied to binary indicator variables that were created earlier. The binary indicator variables are assigned a value of 1 if the respective variable has a missing value and 0 otherwise. For this analysis, all observations without missing values are excluded from the analysis because they do not provide any additional insight.

Note that from a statistical point of view, variable clustering and principal components analysis are not intended to be used with binary variables. Here the methods are however used just in an explanatory way to provide insight into potential relationships of the missing value structure.

Creating a TREE plot

The tree plot in the next graph displays the result of the variable clustering. Two main clusters with three variables each are produced. The DELINQ, NINQ, and DEROG variables form one cluster, indicating that missing values for these variables occur frequently together.

A preliminary indication of this fact has already been seen from the distribution of the %MV_PROFILING macro. The other cluster comprises the variables CLAGE, CLNO, and JOB. From a business point of view, this analysis is helpful because it identifies which missing value patterns occur together. Reasons for this pattern may be found in the data collection process or in the data availability in general, as well as in the business background of the data.Tree plot for missing values based on variable clusters

The component pattern plot

The component pattern plot shows a similar picture, where the closeness of the variables is derived from the first and second principal component of the missing value variable indicators.

Principal component plot based on missing value indicators

Note that the proportion of the Eigenvalues for each component is shown in brackets on the respective axis.

Usage examples

Consider the following usage examples for the %MV_PROFILING macro:

Create the same output as shown in the example outputs above:

%MV_Profiling (data=hmeq, vars = bad delinq derog ninq clage clno job);

Profile all variables from data set HMEQ:

%MV_Profiling (data=hmeq, vars = _all_)

Profile the variables JOB, REASON, DEBTINC, and VALUE from data set HMEQ:

%MV_Profiling (data=hmeq, vars = job reason debtinc value);

Sample the data for better performance to 10%:

%MV_Profiling (data=hmeq, vars = _all_, sample=0.1);

Using only variables starting with “D,” turn off creating the tile chart, the variable clustering, and the principal component analysis:

%MV_Profiling (data=hmeq,vars = d:,ods = NO, princomp=NO);

Usage information for the %MV_PROFILING macro

In order to profile the structure of missing values and to create these results, the SAS macro %MV_PROFILING has been programmed. The macro requires Base SAS, SAS/STAT, and SAS/GRAPH software. The full code for this macro is documented in appendix A.

Note the following:

In order to produce the principal component chart, ODS HTML has to be activated for the macro. You can do this either by
- submitting the statement ODS HTML; before the macro and the statement ODS HTML CLOSE; after the macro
- running the macro within SAS Enterprise Guide or SAS Studio
In order to produce the tile chart, the graphical device has to be set to ACTIVEX or JAVA in addition to ODS HTML. This can be done either by submitting GOPTIONS DEVICE = JAVA; or GOPTIONS DEVICE = ACTIVEX; in the SAS code or by specifying ActiveX or JAVA as graphic format in the SAS Enterprise Guide options.
The list of variables can also be quoted by a character followed by a colon “:”. This allows you to specify all variables that start with the same character(s). If no variables match the quoted string, the macro stops executing.

The following parameters can be specified with the %MV_PROFILING macro:

DATA

Name of the input data set.

VARS

List of variables in the data set to be analyzed. Variable names are separated by a blank. If _ALL_ is specified, all variables in the data set are used. The default value is _ALL_. A wildcard notation with a colon “:” can also be used (for example, DEMO: causes all variables starting with “DEMO” to be used).

ORDER

Only relevant when VARS=_ALL_ is specified. Possible values are POS or ALPHA. The default value is ALPHA. ALPHA means that the variables of the data set are used in alphabetical order for the %MV_VALUE_CHAIN macro. POS means that the variables are used in the same order as in the data set for the %MV_VALUE_CHAIN macro.

ODS

Specifying ODS = YES creates the tile chart, the VARCLUS tree plot, and the principal component analysis. Default = YES.

VARCLUS

Specifying VARCLUS = YES creates the VARCLUS tree plot. Default = YES.

PRINCOMP

Specifying PRINCOMP=YES creates the principal component analysis. Default = YES.

NCOMP

Defines the number of principal components that are used for the plots of components. NCOMP = 2 creates one plot for the first and second components. NCOMP = 3 creates three plots: one for the first and second components, one for the first and third components, and one for the second and third components, and so forth.

SAMPLE

Specifies a sample proportion to allow you to run missing value profiling on a sample rather than on the entire data set. Default = 1 (100%), no sampling. Values from 0 to 1 are valid.

SEED

Seed value that is used for sampling. Default = 18419.