Re: Asking advice on typical solution (for a newbie)

Yu · Posted 03-18-2011 04:41 AM

I need to build a new repository to store data from oracle (our core system).

We target to do some analysis on the sales performance / segmentation study (regarding their sales habit on product, customer types, sales' demographic factors ...)

I assumed we will use the typical star schema which the dimension tables (products, customer, sales) point to the fact table we want to analysis (orders).

The total no of the orders are about 800000. And when it break down to details e.g. product level it includes about 4000000 records. In other words, each order contains about 5 products in average. And the order records will contains some detail info e.g. sales date, order date ...

Besides, each order details (in product level) will generate different sales amount per month
e.g order 1 has 5 products
in 2011 Jan
the record will be like
order_id product_id amount
1 1 10
1 2 5
1 3 8
1 4 10
1 5 5

in 2011 Feb
the record will be like
order_id product_id amount
1 1 8
1 2 5
1 3 0
1 4 0
1 5 12

and so on.

the first question is that it's reasonable to treat this sales table as the fact table? and whether it's correct to put it's details (not the meansures e.g. sale amount) to it (which most of these data are static and not change frequently) or treat it as dimension table?

the second question is how should I handle these dimension table (which not update frequency). should I add product month on each table and keep every set of data as if we need to analysis the historical data? Or we can keep the current copy and only store the data by production month in the OLAP cube levels?
More, for the fact table, should we keep it by production month too, the data volume seems not small to me i.e. 4000000 x 12 for one year.

More, I heard that SAS support the use of change date capture with oracle. Is it a way to handle the aforesaid questions?
Or, I should just use the materialized views for the dimension tables?

Since I am totally new to SAS and data warehouse, really hope anyone experienced can give advice to you.
Thx in advance.

TimStearn_SASProductManagement_ · Posted 03-18-2011 04:02 PM

Yu,

I can reply more completely later, but what you've described is something that fits a typical Star Schema structure very well. In addition to the dimensions you've listed, you'd also want a "month dimension", which would have one record for every month/year that will be covered by your fact data. That way, the grain of your fact table is order item X month/year. You'll most likely want to put the ORDER_ID from the source as a column in your fact table to allow items from the same order to be grouped. You generally do NOT have an "Order Dimension" though - just the key. This what Kimball refers to as a "degenerate dimension". Your users can easily summarized across months or subset by month/year, product, customer, etc with this kind of setup.

What is your target database (SAS datasets, SQL Server, Oracle, etc)? The total records per year is not too large by the standards of fact tables. That being said, if your database supports partitioning, it might be something to consider. If you give me more information on your target, I can give you more particular advice.

Thanks,

Tim Stearn

Yu · Posted 03-19-2011 03:58 AM

Hi Tim,

Thx a lot for your advice.

So, based on my understanding, I can build the fact table with sth like below?

prd_mth sale_date order_id item_id product_id client_id sales_id amount
201101 20101223 1 1 A 123 0000001 100
201101 20101223 1 2 A 123 0000001 50
201101 20101223 1 3 A 123 0000001 20
201102 20101223 1 1 A 123 0000001 80
201102 20101223 1 2 A 123 0000001 30

(sorry, I don't know how to put tab on the table)
and then I create a time-demnesion (hierarchy with YYYYQQMM) based on the sale_date? Should I still need the prd_mth? Or just keep the sale_date?

Since I have the order_id and item_id, I can create another dimension (degenerate dimension) which that is dimensional in nature but stored in a fact table?

And with this structure, I can build the olap cubes for further analysis as I
have
i) the product_id and other dimension table (product) or
ii) client_id with another dimension table (customer) or
iii) even sale_id with the dimension table (sales) ....

So, how should I store the dimension table? only keep the latest version?
or how can I build a olap cubes as if I need to analysis across two or more dimensions?
and if the users drill down the data, can they drill down to the lowest level?

Regarding the target database? Do u mean that where should I store the data (fact / dimension tables) in?
What's the difference between storing in Oracle / SAS datasets?
In my company, most of us are familiar to oracle. Is it a better choice? How about performance wise? or other factors to be consider? In our Oracle, it's a 10g r2 and it supports partitioning. So, we better use partitioning with year / month to keep the data? How about SAS datasets?

Sorry for so many questions. Thx in advance.

TimStearn_SASProductManagement_ · Posted 03-19-2011 09:25 AM

Yu,

No problem - there's a lot to consider. Have you read "The Data Warehouse Toolkit", by any chance? If you haven't, I'd highly recommend it since it addresses most of the type of questions you're asking as well as the implications of possible alternatives.

I misread your post - I shouldn't have recommended a "month dimension". You would of course use a Date Dimension that has one record for each day in your time period. This generally includes many descriptive columns like month, year, quarter, etc to support summarization. I've got several programs I've used to generate such tables in the past (you usually do this programmatically), so I'll dig one up and forward it along.

For your other questions:

I'm not sure what you mean about "prd_mth". In this situation, you generally only keep the sales date in the fact table. If there is a different date besides the sales date, please describe it and I can give further advice.
A "degenerate dimension" is not a separate table. This just means keep the actual ORDER_ID column as a column in the fact table. This because most other columns in the Order table are probably factored out into the fact or other dimension tables (order date is in the fact table, item price, cost in the fact table, item descriptive info in a "Product Dimension", etc). You still need the order ID to be able to group items of the same order, but since there's only one column, you don't create a separate table for it, but keep it as a column in the fact table.
How you store the dimension is determined on a dimension-by-dimension basis. While you could just store current value and overwrite, you often need to "version" your dimension records when specific columns change so that you have a history of these attribute values. This is referred to as "Slowly Changing Dimensions" (either Type 2 or Type 3) . This is an extensive topic and I'd refer you to the "Data Warehouse Toolkit" for specifics. DI Studio contains the Type 2 SCD Loader to assist you with maintaining this kind of model.
I was referring to where you store your dimension/fact tables, yes. Choice of database for a data mart is a HUGE decision and you should take the time to consider many factors. I can't do a complete discussion of this here, but there are many differences between SAS datasets and an RDBMS like Oracle, including security, partitioning, horizontal scalability, bulkloading, query optimization, indexing techniques, etc, etc. The list goes on and on. I'd recommend that you describe your application to a DBA and a SAS expert and get some of their opinions to assist you with your decision.
Cube design is another fairly extensive topic. I can tell you that, yes, you can use the Star Schema we've been discussing as the basis for one or more cubes. When designing a cube, however, you generally DO NOT want to put the lowest level of granualarity in the cube. Cubes are about summarizing data for quick (sub-second) analysis and drill down. If you do put low level granualarity in a cube (like dimension primary keys) you'll have bad performance during build and query. You want to understand the summarization levels your users require for analysis and they generally don't care for analysis purposes about individual records (1 among millions). When they do need to drill down to an individual record, there are other techniques to employ.

As you said, there's a lot to learn. I'd definitely recommend consulting some reference texts like "Data Warehouse Toolkit" and some experts in your organization to aid your decision.

Thanks,

Tim Stearn

Asking advice on typical solution (for a newbie)