Juletip #14 - PROC DATAMETRICS: Data profiling in SAS code

jarno · Posted 12-14-2020 01:53 AM

Data profiling is a great method for understanding new data. In both SAS 9 and Viya, there are GUI tools for evaluating data and creating basic data metrics such as missing values, uniqueness, minimum, maximum, average, frequencies and patterns.

When deploying things to production, running data profiling in code makes it simple to schedule a regularly running data profile. Before SAS Viya 3.5 there has not been an easy way to accomplish this. Enter PROC DATAMETRICS that makes it simple to run a profile on a full table or selected columns. See SAS documentation for PROC DATAMETRICS here:

PROC DATAMETRICS documentation

PROC DATAMETRICS produces an output table with RowID column that references to the chart in the above link, so you will not find descriptions in the output table. What I did was import the descriptions to a SAS table PROC_DATAMETRICS_DICTIONARY as seen in the picture below and then joined it with PROC SQL with the metrics output table using RowID.

To save you the trouble of importing the descriptions, you can download at the end of this post 😉

I made a very simple example on SAS Viya 3.5 using the simplest sample data that is provided with SAS, the famous CLASS-dataset. I have found out it’s best not to use too many columns with PROC DATAMETRICS to keep it understandable. In my example I’m using only three columns: Name, Sex and Age.

Also, to group metrics per column, I’m sorting the output first by ColID, RowId and Count. This results in a bit easier to read output table.

Here is the full SAS code that I used:

CAS;
CASLIB _ALL_ ASSIGN;

PROC DATAMETRICS 
	DATA=CASUSER.CLASS out=CASUSER.PROFILE_METRICS; 
	IDENTITIES qkb='/opt/sas/spre/home/share/refdata/qkb/ci/31' 
	locale='ENUSA' def='Field Content';
   	VARIABLES Name Sex Age;
RUN;

PROC SQL;
	CREATE TABLE WORK.PROFILE_METRICS_JOINED AS
		SELECT
			t1.COLID,
			t2.'RowID'n,
			t1.'COUNT'n,
			t2.Description,
			t1.CHARVALUE,
			t1.DOUBLEVALUE
		FROM
			CASUSER.PROFILE_METRICS t1
			INNER JOIN CASUSER.PROC_DATAMETRICS_DICTIONARY t2 ON (t1.'ROWID'n = t2.'RowID'n)
		ORDER BY
			t1.COLID,
			t2.'RowID'n,
			t1.'COUNT'n DESC;
QUIT;
RUN;

Here is an example of the output data:

Note that in the picture the COLID is 2, so this refers to the second column, Age, in my source data. Frequencies for Age column are depicted here, but the full data contains all the other metrics for selected columns as well.

Have fun with PROC DATAMETRICS and please share if you find it useful with your own data! 😊

As promised, here’s the PROC DATAMETRICS definitions in csv-format which you can easily join with your output data using RowID as join key:

RowID;Description
0;Number Of Columns
1;Number Of Rows
3;Columns In Data Source
4;Date Created
5;Date Modified
6;Encoding
7;Compressed
10;View
11;Lib
12;Source
999;Format Name
1000;Column Name
1001;Data Type
1002;Count Of Values
1003;Name Of The Column Containing The Profiled Column Values
1004;Frequency Distribution (High)
1005;Frequency Distribution (Low)
1006;Maximum Value
1007;Maximum Length
1008;Mean
1009;Median
1010;Minimum Value
1011;Minimum Length
1012;Mode
1013;Number Of Blank Values
1014;Null Count
1016;Outlier High Values
1017;Outlier Low Values
1018;Pattern Frequency Distribution (High)
1019;Pattern Frequency Distribution (Low)
1020;Unique Patterns
1021;Primary Key Candidate
1022;Standard Deviation
1023;Standard Error
1025;Unique Count
1028;Id Analysis
1029;Actual Bool
1030;Actual Date
1031;Actual Integer
1032;Actual Real
1033;Actual String
1034;Actual Designated
1035;Determined Type %
1037;Data Length
1038;Ordinal
1039;Number Of Decimal Places Specified In Format
1040;Null Percent
1041;Unique Percent
1042;Pattern Unique Percent
1043;Blank Percent
1100;Error Message
1101;Metadata Allows Nulls