01-22-2013 09:26 AM
I need your support and recommendation for my following case,
I'm currently involved in a project which I should apply quality controls over the DWH models. I have the opportunity to choose between DataFlux and DI but I'm not sure which tool will help me in the scenarios I want to apply.
Here are two of the scenarios I want to apply:
- Rule Validation
In this scenario we want to check some rules in the usage stream and draw a graph that describe the rule violation percentage against the whole stream and send this graph by mail. For example if stream contains 100 record and 25 records are violating the rule a graph should be built describing that there’s 25% rule violation.
- Trend Analysis
In this scenario we want to check trend behavior in the last 10 days and if there's any spike in the trend positive or negative a graph should be built and sent by mail showing the last 10 day data and the spike position.
Sorry for the long question but I really need your help.
01-22-2013 10:53 AM
Need some more info on your current situation.
Do you have a data model - and is it already being populated by ETL-jobs?
If yes, what tool is being used for that?
How do you plan to do the analysis - on th fly, or on already loaded data?
What kind of rules do you have?
DataFlux and DI are similar in some cases, but have their main focus on different things.
DI is used for ETL development, and can handle simple controls such as "present in lookup table", value within a range etc.
DataFlux have more sophisticated tools more matching strings that follow patterns, data quality management/work-flow etc, but may not be as strong as an ETL tool.
01-23-2013 05:18 AM
Yes we have data model it’s already being populated by IBM DataStage,
I plan to do the analysis on already loaded data after the ETL process
Regarding the rules, let us take an example there's the usage stream which contains the CDR's with it attributes like charged amount, free amount and local or roaming. One of the rules is to check if any CRD has the free amount > rated amount. If this rules has been violated by 10% a pie chart should be drown showing that 10% of the total usage is violating the free charge rule.
01-24-2013 06:23 AM
Seems odd to have an ETL toll for loading the data, and another for anlayzing the already loaded data.
Nevertheless, this sounds like standard reporting, and I'm not sure if either DIS or DataFlux is the appropriate tool. Maybe using Enterprise Guide could suite your needs?
01-24-2013 10:12 AM
I've just worked a little with df. But it focus on finding and correcting data on the fly from sources. And that is on a record/column level.
It seems from your example that you are more interested in reporting on a summarized level. But, please discuss this with a SAS sales representative in more detail, it's hard for me to give you a bullet proof recommendation from this sparse information.
But as I said before, maybe creating some reports in EG (which could be scheduled), or crate summary tables that you could do dynamic reporting on, using standard BI tools (from the BI Server suite for instance), could be sufficient for you needs.
01-24-2013 12:26 PM
I agree with Linus, it looks like reporting on the status of a DWH and that would be done in a reporting tool, eg. EG or BASE SAS could work, but SQL could probably do most of that as well.
01-24-2013 12:43 PM
as I see usig DI is the best for your case, as you can even create the reports after applying the rules. as Linus and Reeza recommeded Enterprise Guide is good but your data is huge as it is more than 1,000,000,000 so using good techniques and DI Features you can apply ur rules
01-25-2013 05:12 AM
If you mean the DI Studio client, then no.
But in DI Studio you can build routines/jobs that can send e-mails, which rely on existing email functionality within Base SAS.
01-25-2013 05:15 AM
Number of records makes no difference. Both DI and EG use base SAS programming. The difference is the functionality within the user Interface, their intended use, and type of integration with metadata.
01-29-2013 10:52 PM
To also add my 5 cents to your question:
DF is a great tool but more in the area of data cleansing and standardisation (eg. when dealing with addresses and names).
From what you describe I would probabely use EG for data exploration. I would use DIS when it comes to operationalising things - so stuff you want to run on a regular bases in batch and with a scheduler as part of a bigger work-flow.
Both EG and DIS create in the end SAS code. So you can easily start with EG for prototyping. Once you really know what you need you re-build it in DIS in a more stable, cleaner, performance optimised version.