Data profiling is a great method for understanding new data. In both SAS 9 and Viya, there are GUI tools for evaluating data and creating basic data metrics such as missing values, uniqueness, minimum, maximum, average, frequencies and patterns.
When deploying things to production, running data profiling in code makes it simple to schedule a regularly running data profile. Before SAS Viya 3.5 there has not been an easy way to accomplish this. Enter PROC DATAMETRICS that makes it simple to run a profile on a full table or selected columns. See SAS documentation for PROC DATAMETRICS here:
PROC DATAMETRICS produces an output table with RowID column that references to the chart in the above link, so you will not find descriptions in the output table. What I did was import the descriptions to a SAS table PROC_DATAMETRICS_DICTIONARY as seen in the picture below and then joined it with PROC SQL with the metrics output table using RowID.
To save you the trouble of importing the descriptions, you can download at the end of this post 😉
I made a very simple example on SAS Viya 3.5 using the simplest sample data that is provided with SAS, the famous CLASS-dataset. I have found out it’s best not to use too many columns with PROC DATAMETRICS to keep it understandable. In my example I’m using only three columns: Name, Sex and Age.
Also, to group metrics per column, I’m sorting the output first by ColID, RowId and Count. This results in a bit easier to read output table.
Here is the full SAS code that I used:
CASLIB _ALL_ ASSIGN;
locale='ENUSA' def='Field Content';
VARIABLES Name Sex Age;
CREATE TABLE WORK.PROFILE_METRICS_JOINED AS
INNER JOIN CASUSER.PROC_DATAMETRICS_DICTIONARY t2 ON (t1.'ROWID'n = t2.'RowID'n)
Here is an example of the output data:
Note that in the picture the COLID is 2, so this refers to the second column, Age, in my source data. Frequencies for Age column are depicted here, but the full data contains all the other metrics for selected columns as well.
Have fun with PROC DATAMETRICS and please share if you find it useful with your own data! 😊
As promised, here’s the PROC DATAMETRICS definitions in csv-format which you can easily join with your output data using RowID as join key:
0;Number Of Columns
1;Number Of Rows
3;Columns In Data Source
1002;Count Of Values
1003;Name Of The Column Containing The Profiled Column Values
1004;Frequency Distribution (High)
1005;Frequency Distribution (Low)
1013;Number Of Blank Values
1016;Outlier High Values
1017;Outlier Low Values
1018;Pattern Frequency Distribution (High)
1019;Pattern Frequency Distribution (Low)
1021;Primary Key Candidate
1035;Determined Type %
1039;Number Of Decimal Places Specified In Format
1042;Pattern Unique Percent
1101;Metadata Allows Nulls
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.