Why do descriptive statistics differ across SAS Viya applications?

6 Likes

The Backstory

This question came up in a recent conversation with academic colleagues — but the lesson applies far beyond a single dataset or classroom.

The original question was simple:

Why does the MEGACORP2020 dataset produce different descriptive statistics in SAS Information Catalog and SAS Visual Analytics?

More broadly: why can descriptive statistics differ across SAS Viya applications, even when the underlying data is the same?

Some More Pieces

We'll examine the profit variable. And then see that - while the mean is the same - the minimum, maximum, and skewness calculations are different. The proof:

SAS Information Catalog

SAS Visual Analytics

Unpacking the Differences

My response to the professor - written with the help of my good friend, ChatGPT:

Even when two SAS Viya applications point to the same dataset, they may compute descriptive statistics differently because they are designed for different analytical purposes.

Same mean ≠ same computation

In our example, both SAS Information Catalog and SAS Visual Analytics report the same mean, but different values for minimum, maximum, and skewness. This does not indicate an error.

Instead, it reflects how - and on what data - the statistics are computed.

Visual Analytics: full-data analytical computation

In SAS Visual Analytics, descriptive statistics are typically computed from the entire dataset (or from a clearly defined filtered query).

Conceptually, this is similar to running:

proc means data=MEGACORP2020;
run;

or

proc univariate data=MEGACORP2020;
run;

Key characteristics:

Statistics are calculated on all qualifying rows
Results reflect the true min, max, and distribution shape
Suitable for reporting, modeling, and decision-making

Information Catalog: fast profiling using sampling or approximation

SAS Information Catalog serves a different role: data discovery and metadata profiling. Its goal is to quickly help users understand large datasets across many columns.

To remain performant, especially on large tables, Information Catalog may:

Use sampling or approximate profiling for some column statistics
Apply different strategies depending on the statistic and column
Prioritize responsiveness over exact distributional precision

This is why:

The mean often matches (it is relatively stable under sampling)
Min, max, and skewness may differ (they are highly sensitive to outliers and tail values)

In other words, SAS Information Catalog answers:

“What does this column generally look like?”

Visual Analytics answers:

“What are the exact statistics for this analysis?”

Why skewness is especially affected

Skewness depends on:

Higher-order moments
Tail behavior
Extreme values

Even a small sample that misses rare extremes can materially change skewness, while leaving the mean largely unchanged. This makes skewness a strong signal that different row sets were used in the computation.

Key takeaway for students (and analysts)

Identical datasets do not guarantee identical statistics unless the computational context is the same.

Before trusting or comparing summary statistics, always ask:

Was the full dataset used?
Was sampling applied?
Is the tool optimized for exploration or for analysis?

Understanding why numbers differ is often just as important as the numbers themselves.

LGroves · ‎02-12-2026

Assist from my colleague, Cristina Anton, @antonbcristina, is in! She shared that sampling is the DEFAULT in SAS Information Catalog - but that defaults can be changed:

SAS Help Center: Analysis Options

So, wanna run the full data set - then you can update the setting here:

SAS Infromation Catalog Setting Adjustment.png

Thanks Cristina!