Missing values are often profiled in a univariate way. This however only tells you how many variables are infected by the "missing value" disease to what extent. It does not give you insight on how many full records you have or if there is a structure of missing data.
However you should be interested to profile the missing value structure of the variables in your analysis data. This can give you insight whether there are relationships in the data that you should be aware of, before you start with the analysis.
This article introduces the %MV_PROFILING macro which allows you to profile missing values in your data in a more comprehensive way. You receive insight into the missing value pattern and potential relationships of the data. The %MV_PROFILING macros has been published with my "Data Quality for Analytics Using SAS" book in 2012. It also has been featured in my SAS Global Forum presentation "Want an Early Picture of the Data Quality Status of Your Analysis Data? SAS® Visual Analytics Shows ..." in 2015.
The output of the %MV_PROFILING macro shows you for example how many FULL RECORDS (records with any missing value in one of the variables), how many records with a missing value in one variable, or with missing values in more than 4 variables exist.
In this article you find usage examples and the interpretation of the output of the %MV_PROFILING macro. It also shows explains the macro parameters.
The missing value profiling (%MV_PROFILING) macro defines a segmentation of the observations based on its missing value structures. For each variable, a dummy variable is created that indicates whether a value is missing or not. This set of indicator variables is then concatenated into a string. The following illustrates the definition of the profile chain.
Note that after the profile chain itself, the total number of missing values per observation is concatenated separated by a ‘_’. This approach shows the number of missing values for each profile and is very useful when larger numbers of variables are analyzed.
Using the %MV_PROFILING macro has many advantages:
Data from the HMEQ data set, which is frequently used in SAS Enterprise Miner or SAS VDMML examples and demos, illustrate the missing value profile chain macros. The data set is available together with the programs and macro in the attachment to this article..
The HMEQ data set contains histories of credit customers with different types of home equity loans. The variables shown in the following table were used to analyze the missing values.
Variable Name |
Description |
BAD |
Defaulting or repaying the loan |
DELINQ |
Number of delinquent trade lines |
DEROG |
Number of major derogatory reports |
NINC |
Number of recent credit inquires |
CLAGE |
Age (in months) of the oldest trade line |
CLNO |
Number of trade lines |
JOB |
Current job, six categories |
The following table shows a lookup table that displays the MV_PROFILE assigned to each variable. This identifies which digit of the missing value profile chain is used for which variables and eases navigation through the %MV_PROFILING outputs.
The order of the 0s and 1s in the %MV_PROFILE_CHAIN macro depends on the order of the variables names in the &VARS macro variable. If the &VARS parameter is omitted in the macro, all variables of the data set are used in alphabetical order for the %MV_PROFILING macro by default. The macro variable ORDER=POS can be used to force the order to match the order in the data set.
The following output shows the frequencies of the different patterns in the %MV_PROFILING macro.
First, the output shows that there are 82.6% complete records without missing values. The next line shows that there are 174 observations (2.9%) that have a missing value for the third variable (in this case, variable DEROG). The next entries in the table reveal that in 2.5% of the cases three variables are missing together (namely, variables DELINQ, DEROG, and NINQ) (the second, third, and fourth variables) and in 2.2% of the cases variables DELING and DEROG are missing. This provides first-hand insight not only into the frequency but also into the structure of the missing values.
This information is also displayed as a tile chart. A tile chart splits a rectangle that represents the population into smaller rectangles, where the area of each rectangle is proportional to the frequency of the respective category. In this case, this gives a graphical representation of the proportion of the complete records and the records with missing values. Additionally, it can easily be seen that the top four patterns with missing values make up more than 50% of all missing value patterns.
The next two output examples show the distribution of the number of missing values per record and a ranked list of missing values per variable.
This table shows that there are 82.6% complete records and that in 6.4% (2.7+1.5+2.2) of the cases three or more variables have a missing value.
Looking at this table, you can see that target variable BAD does not include any missing values. Combining this information in the MV_Profile chain above lets us conclude that in 2.2% (pattern 0111111_6) of the cases all input variables are missing.
In the next step, the structure and patterns of the missing values are analyzed in more detail. In order to do this, a variable clustering and a principal component analysis are applied to binary indicator variables that were created earlier. The binary indicator variables are assigned a value of 1 if the respective variable has a missing value and 0 otherwise. For this analysis, all observations without missing values are excluded from the analysis because they do not provide any additional insight.
Note that from a statistical point of view, variable clustering and principal components analysis are not intended to be used with binary variables. Here the methods are however used just in an explanatory way to provide insight into potential relationships of the missing value structure.
The tree plot in the next graph displays the result of the variable clustering. Two main clusters with three variables each are produced. The DELINQ, NINQ, and DEROG variables form one cluster, indicating that missing values for these variables occur frequently together.
A preliminary indication of this fact has already been seen from the distribution of the %MV_PROFILING macro. The other cluster comprises the variables CLAGE, CLNO, and JOB. From a business point of view, this analysis is helpful because it identifies which missing value patterns occur together. Reasons for this pattern may be found in the data collection process or in the data availability in general, as well as in the business background of the data.
The component pattern plot shows a similar picture, where the closeness of the variables is derived from the first and second principal component of the missing value variable indicators.
Note that the proportion of the Eigenvalues for each component is shown in brackets on the respective axis.
Consider the following usage examples for the %MV_PROFILING macro:
Create the same output as shown in the example outputs above:
%MV_Profiling (data=hmeq, vars = bad delinq derog ninq clage clno job);
Profile all variables from data set HMEQ:
%MV_Profiling (data=hmeq, vars = _all_)
Profile the variables JOB, REASON, DEBTINC, and VALUE from data set HMEQ:
%MV_Profiling (data=hmeq, vars = job reason debtinc value);
Sample the data for better performance to 10%:
%MV_Profiling (data=hmeq, vars = _all_, sample=0.1);
Using only variables starting with “D,” turn off creating the tile chart, the variable clustering, and the principal component analysis:
%MV_Profiling (data=hmeq,vars = d:,ods = NO, princomp=NO);
In order to profile the structure of missing values and to create these results, the SAS macro %MV_PROFILING has been programmed. The macro requires Base SAS, SAS/STAT, and SAS/GRAPH software. The full code for this macro is documented in appendix A.
Note the following:
The following parameters can be specified with the %MV_PROFILING macro:
DATA
Name of the input data set.
VARS
List of variables in the data set to be analyzed. Variable names are separated by a blank. If _ALL_ is specified, all variables in the data set are used. The default value is _ALL_. A wildcard notation with a colon “:” can also be used (for example, DEMO: causes all variables starting with “DEMO” to be used).
ORDER
Only relevant when VARS=_ALL_ is specified. Possible values are POS or ALPHA. The default value is ALPHA. ALPHA means that the variables of the data set are used in alphabetical order for the %MV_VALUE_CHAIN macro. POS means that the variables are used in the same order as in the data set for the %MV_VALUE_CHAIN macro.
ODS
Specifying ODS = YES creates the tile chart, the VARCLUS tree plot, and the principal component analysis. Default = YES.
VARCLUS
Specifying VARCLUS = YES creates the VARCLUS tree plot. Default = YES.
PRINCOMP
Specifying PRINCOMP=YES creates the principal component analysis. Default = YES.
NCOMP
Defines the number of principal components that are used for the plots of components. NCOMP = 2 creates one plot for the first and second components. NCOMP = 3 creates three plots: one for the first and second components, one for the first and third components, and one for the second and third components, and so forth.
SAMPLE
Specifies a sample proportion to allow you to run missing value profiling on a sample rather than on the entire data set. Default = 1 (100%), no sampling. Values from 0 to 1 are valid.
SEED
Seed value that is used for sampling. Default = 18419.
Related Articles
Feedback on the macro is welcome and can be sent to mailto:sastools.by.gerhard@gmx.net.
This macro has been taken from my SAS Press Book "Data Quality for Analytics Using SAS".
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.