BookmarkSubscribeRSS Feed
Barite | Level 11

Hello everyone!


I have a large dataset where I would like to do a couple of different types of analysis on the variables in it.  I have researched proc compare and proc freq but I am not sure if it is the most efficient solution for my needs.  I would like to run this code each night after a dataset has been updated to make sure there are no data issues that occurred and fix any that did.


I would like to:


  1. Write code that summarizes distinct values of values in all variables.  This would be a mix of char, numeric, and date variables.  I know about proc freq, or simply using distinct queries for each value, but is there a function or script that could do it all at once and return a table that includes all of the variables?
  2. I would also like to compare values between variables in the same dataset.  Using proc compare, it gives me each observation where there isn't a match of some sort.  The problem with that is there could be thousands of records that don't match.  Proc compare is going to limit how much can be printed.  I don't need to review by each observation, I just need to know a summary of each value that doesn't match from one variable to the other.

I feel like I am close with proc compare and proc freq, but I am just not quite where I am wanting to be.  I need a summarized report that I can dig into and write further queries to investigate any observations that look out of place.  I am hoping you guys have some suggestions for me.


Thank you!

Diamond | Level 26

1. PROC FREQ with the NLEVELS option sounds like what you want here, although I am skeptical this would be a valuable thing to do with continuous variables. If you use 


tables _all_;




tables _character_;


you get the information you asked for. But once you have this information, then what? What action would you take if you found out that the data set had 23 distinct levels of variable ABCD?


2. PROC COMPARE has me confused, you say "compare values between variables in the same dataset" and I'm just not grasping what you are trying to do here.

Paige Miller
Barite | Level 11

In regards to you reply on #1.  I would just take that summary data and write a query that looked for the observations for say... 'ABC' or like '%BC%' or something like that.  I just needs enough information to drill down.  Does that sound possible with this solution?


On #2...  I combine 2 different datasets into one each night.  The values in some variables should match.  For example... the variable "name" should match "client name" in the same dataset.  I would just need to know the number of observations that don't and possible the distinct values of the mismatches.  Again, enough information when I could write a query to drill down to the observation level to figure out which of them have the mismatches.  Does that make sense?

Diamond | Level 26

PROC FREQ with the NLEVELS option will give you the information you need to create queries like the examples you provided.


In #2, as long as you know which variables should be compared (NAME should be compared to CLIENT_NAME; and other variable pairs as well) then you can write code to do these specific comparisons, but I am not sure PROC COMPARE is the tool to do that (actually, I have used PROC COMPARE only in its most generic form, and I don't really know if you can tell it to compare specific pairs of variables). I would think this is a good case for writing a macro to do the comparisons between specific pairs of variables.

Paige Miller

You haven't actually stated what your business needs are for this analysis, but it appears that one of your aims is to check if your data conforms to certain business rules or not. If this is the case then I have found it useful to implement a "Report by Exception" regime. That means you have logic, like PROC COMPARE, PROC SQL or DATA step statements that check if the rule is obeyed or not and only report where the rules are broken. It's a lot easier to track exceptions rather than pick them out from a lot of correct rules. 

Super User


For your first request please consider this example with a data set you should have available:

proc freq data=sashelp.class;
ods output onewayfreqs=myfreqtable;

data maybe;
   set myfreqtable;
   singlelevel = cats(of f_:);
   keep var singlelevel frequency;

ODS output of the Onewayfreqs will place all of the variables and all of the counts into a single table, Myfreqtable for example. However for many purposes it may be very cumbersome, especially with many variables. Proc freq without a tables statement will use all variables by default.

The second data set I make above uses the Formatted values, that is what the F_ named variables are, to create a single variable that has the only formatted value from each row, and reduces the "table" variable that will have "Table Name" "Table Sex" and such to pull just the name of the variable.

If your date variable(s) do not have a date format assigned I strongly suggest you do so in the Proc Freq code to make the output usable as all of the values in the output are text.


Note: if you have values like unique identification variables or continuous values that don't repeat this gets to be a very long table very quickly.


For number 2 consider this example:

/* make a data set with a 
   second variable that should match
   most of the time but doesn't always
data badclass;
   set sashelp.class;
   /* copy values into new variable*/
   NewSex = Sex;
   /* change some randomly*/
   If rand('integer',3)=2 then do;
      if NewSex='F' Then NewSex='M';
      else NewSex='F';

proc compare data=badclass compare=badclass briefsummary;
   var sex;
   with newsex;

Since there is a random element as to which ones get changed I can't be sure your results would match mine but this is the output I get:

The COMPARE Procedure                                                                             
Comparison of WORK.BADCLASS with WORK.BADCLASS                                                    
NOTE: Values of the following 1 variables compare unequal: Sex^=NewSex                            

Value Comparison Results for Variables                                                            
           ||  Base Value           Compare Value                                                 
       Obs ||  Sex                   NewSex                                                       
 ________  ||  _                     _                                                            
        2  ||  F                     M                                                            
        3  ||  F                     M                                                            
        6  ||  M                     F                                                            
       10  ||  M                     F                                                            
       12  ||  F                     M                                                            
       14  ||  F                     M                                                            
       17  ||  M                     F                                                            
       18  ||  M                     F                                                            

The Briefsummary option considerably reduces the results. The VAR and WITH statements set which variables to compare.

Default behavior to compare the first Var variable with the first With variable, then the seconds, then third. If you want to compare one variable to multiple others repeat it on the Var statement. This example would compare the variable Phone in the Data set with the variables Phone1 and Phone2 in the Compare data set.

var phone phone;
with phone1 phone2;
Super User
/*It is a SAS/IML thing.*/
data have;
 set sashelp.class;
 if _n_ in (2:8) then do;name=sex;age=weight;end;

proc iml;
use have;
read all var _num_ into num[c=n_name];
read all var _char_ into char[c=c_name];
/*For the first question*/
print n_level[c=n_name l='the number of levels for each numeric variable'];
print c_level[c=c_name l='the number of levels for each character variable'];

/*For the second question*/
do i=1 to ncol(num)-1;
 do j=i+1 to ncol(num);
do i=1 to ncol(char)-1;
 do j=i+1 to ncol(char);

print var1 var2 n_mismatch[l='the number of obs are mismatched'];




Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 5 in conversation