BookmarkSubscribeRSS Feed
shridharmanvi
Calcite | Level 5

To monitor the health of a datawarehouse we have to consider a few aspects such as:

1. Daily Data load activities

2. Timeliness of the Data load

3. Errored out/dropped records

4. Completeness etc.

I sure am missing out on other parameters.

I would be thankful if anyone can suggest a few other parameters. 

6 REPLIES 6
TomKari
Onyx | Level 15

I would want to look at it from different perspectives:

1. Meeting requirements

Are my users able to get the information they need out of the warehouse?
Does the data in the warehouse need to be augmented to allow them to do their jobs better?
Do they find any operations difficult, that could be remedied with documentation or utility development?
When they have problems, is the resolution mechanism (help desk?) working efficiently?
How effective is the warehouse for different classes of users: developers, power users, regular users, infrequent users?

2. Quality

Are my ETL processes meeting required error rate thresholds?
If the error profile is changing, is the underlying data changing in a way that required modifying the ETL?
I'm not sure what you mean by completeness, but it probably falls here.

3. Performance

What is the response time for a set of retrievals that exercise the different operations and data required?

4. Capacity

How many people are using the warehouse?
How many concurrent active users are using the warehouse?
Can the warehouse handle projected growth at the required performance?

5. Throughput

Are my daily load/update processes running in the required window?
At projected growth rates, how soon will I exceed my window? Do I have any workarounds (parallel processing, etc.)?

6. Size

How big is my warehouse?
Do I have enough disk and memory for the current size?
At projected growth rates, how soon will I need to upgrade disk and/or memory?
Am I approaching any limits in either the software or hardware, that can't be expanded?

Tom

shridharmanvi
Calcite | Level 5

Hi Tom,

Thanks for your prompt answer. This approach will be very useful to us.

By Completeness i mean :

  We have data load activity happening everyday. The data loaded movs through different layers/stages before being usable by the end user. Completeness means if there are the exact same number of records in all my layers/stages.

for Ex: If layer1 has 1000 records, do layers 2 and 3 also have 1000 records?

After going through the approach you have mentioned, I have a small clarification.

If I have to report the health of my Datawarehouse on this day, how will I prioritize these above mentioned parameters and report on the same? i.e. if I have to construct a report which gives me the status of my datawarehouse, in what priority should I consider the parameters. Will Data Usability be the most weighted parameter?

Awaiting your response.

Shridhar Manvi

LinusH
Tourmaline | Level 20

"If layer1 has 1000 records, do layers 2 and 3 also have 1000 records?"

No, it all depends on how the ETL is designed according to business and IT rules/requirements. You need to understand those in order to analyse this part of your warehouse process.

Weight: this should be in the requirements for the monitoring process. If not, you should be able to gather these from responsible persons. Reducing cost is a common driver (which could include missed opportunities).

Data never sleeps
shridharmanvi
Calcite | Level 5

Hi LinusH,

Yes we have our ETL designed in such a way that even if there are changes involved in layer 2 layer 3, the records go to the next layer with a message saying 'information modified' or something of that sort. But I am pretty sure that all the records move to the next layer unless there is some other unknown issue.

I agree with you on the weights part. But I am not sure if we already have that in place. What would be the possible criterias to consider while giving a weight? I can think of a few like

1. Data usage(Is this data used?if yes by how many people?)

2. Criticality of the Data

It would be great if you can add a few more aspects to be considered.

TomKari
Onyx | Level 15

Hi, Shridharmanvi

Yes, your example of completeness does fall into what I consider the "quality" perspective. Clearly, you need to ensure that (in your example) each process outputs 1000 records. But you might also consider analysing the quantity and type of errors being reported, especially if your ETL processes are transforming data in an automated manner.

To give an example: I live in Ontario (Province) Canada, and my Postal Code is K2P 1S3. If a U.S. based system was processing my address data, it might transform my postal code to all numeric (ZIP code), and my ON province to OH. (Believe me, it's happened). Although "on paper" the system is functioning correctly, a large count of these transformations would indicate a systemic error, and should be investigated (because you're irritating all of your non-U.S. customers).

In terms of the weights for different metrics, that's a business question for your organization (which must involve the participation of the warehouse users). However, I have seen cases where organizations make a priority of measuring the easy ones (technical "facts" like data volumes, memory, disk...), and ignore the important ones, like is the system meeting the needs of users (which can be VERY difficult to determine).

I'm suggesting that the most important questions for you to answer are the ones that will get you fired if the executives don't like the answers! If the Assistant VP for Cost Compliance emails the VP Finance that "he can't find any of the stuff he needs in the warehouse, and why the Hell are we spending so much money on it", I would be VERY worried!

Best,

  Tom

shridharmanvi
Calcite | Level 5

Hi Tom,

  Thanks for your answers. They have been very informative.

Yes I got what you are trying to explain Tom. We will keep that in mind. Also, we are working on assigning weights to each metric.

Thanks again for your answer!!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 693 views
  • 1 like
  • 3 in conversation