I have a table with 376k records. I am using the following PROC Tabulate to count the number of records per a particular variable. The following is the results of the log;
2682 PROC TABULATE DATA=PICreditHire missing;
2683 CLASS FIRST_PI_RESDATE;
2684 VAR CreditHireTotal;
2685 TABLE FIRST_PI_RESDATE,CreditHireTotal*(N='NbrOfClaims');
NOTE: There were 376842 observations read from the data set
However when I sum up the 'NbrOfClaims' field in the resulting output I only get 347k claims, despire the fact that the Tabulate has (correctly) read 376k claims from the source dataset. I've messed round with the MISSING options and eyeballed the source dataset and can't understand why Tabulate is missing some of the records. I've also tried using the PROC FREQ (below) and that has given me the correct results (376k records) as has taking the dataset into MS Access and running a count from there.
Any ideas why PROC TABULATE isn't picking up all the records in the count? I can't use PROC FREQ because I need to have some other analysis in the report (i.e. SUM) I've just reduced it to a Count on this post to make it easier to read.
The problem appears to be that the TABULATE is not including some of the values of 'CreditHireTotal' (Number, Length 8), which is why the PROC freq works as that is just a count against the FIRST_PI_RESDATE. From eyeballing I can't find any values that look strange and if I use CreditHireTotal as the Class variable and count against that I get a full count of the records (376k).
By default, when you cross an analysis variable with a statistic, missing values are ignored in the calculation of statistics. I'm guessing that you have several observations with missing values for the "CreditHireTotal" variable.
If you run the code below, you will see that a copy of SASHELP.CLASS has been altered so that the value the SEX variable for 2 people has been changed and 2 heights have been turned to missing. The MISSING option reveals the 2 missing obs for the SEX variable, but has no effect on the count when it's based on (crossed with) height. The TABULATE program shows the difference between asking for the simple N and asking for the N crossed with the analysis variable.
if name = 'Alfred' or name = 'Alice'
then sex = ' ';
else if _n_ gt 17 then height=.;
proc print data=class;
title 'note missing values';
proc freq data=class;
tables sex height / nocum nopercent;
PROC TABULATE DATA=class missing;
TABLE sex='Gender' all,
height*(N='NbrOfStudents') / box='Ex 1';
table sex='Gender' all,
n='NbrOfStudents' / box='Ex 2';