Time dimension: Star Schema or Snowflake?

4 replies [Last post]
respinosamilla's picture
Offline
Joined: 17/01/2010
Puntos: 234

 

Hello everyone:

I am preparing the design of a DW for use with Pentaho, and to revise the definition of the time dimension I have some doubts. I explain:

DW is a sales analysis. In the model I have two fact tables, one for sales, whose granularity is at the level of day, customer, product, etc, and a fact table which records the information in the sales forecast (this table has a level of different granularity, where a month and level of client channel (which is one of the attributes of the customer dimension)). The keys to the fact table are not therefore the keys to the dimension table, but a component within a dimension.

 

 

 

Initial Logical Model

Given this, I rather doubt arise about the best design choice to build the database:

1) Having made tables on a different level of granularity, is it necessary or advisable to spend the dimensions involved in a star schema to work properly after the consultation?.

2) What may be the case that aspects of design such as this are conditioned by the tool we will use later?. For example, if I will use Pentaho, or Microstrategy, etc, can be the case that using a tool or another is determining how to model in certain cases?. I have read in a blog for example, Microstrategy recommends switching to a snowflake schema (including the examples provided with the platform, the Analytic Modules, are built that way).

3) Could we generalize that a simple scheme to use a scheme smoothly total star, but at the time that complicate the model with more fact tables and different granularity is better to spend a snowflake?. "To follow the Kimball methodology in terms of the scale formed is also necessary to do that, and so we can address the complex models with multiple fact tables?.

I hope I throw a cable, especially those who have been visiting the construction of DW. It is an issue for me is a bit confusing (besides the examples provided in star schemas are used to illustrate simple, then do not correspond with reality more complex).

 

 

 

 

 

 

respinosamilla's picture
Offline
Joined: 17/01/2010
Puntos: 234

Rolling even more history, look like Microstrategy implements the time dimension in Analytic Modules, as I said before:

Not exactly a snowflake and the date we reach the month, quarter or year (it is a cyclic graph).

In the Blog The Mine Digital is an interesting entry about the dilemma. The funny thing is that we clearly no, but gives us some premises may be useful for the decision.

If you like you can contact me at email and I can gladly help you with questions you have.

Then if you like you can comment.

Regards

 

respinosamilla's picture
Offline
Joined: 17/01/2010
Puntos: 234

After thinking a lot about, reread in reference books and find information, I think I have understood and a solution would be:

Time dimension would have the following structure, including all the usual attributes of a time dimension:

 

As we will have a fact table at the day (level of granularity) and two fact tables at the month (different level), need to have a dimension table for the month. To solve this problem, we have two options:

  • Construction of a scale derived or roll-up: create a subset of the original dimension, taking us all the attributes that affect the level of granularity or higher levels (for example, for the month, we get also the quarter and year). It is a strict subset of the time dimension. Would look like:

 

Dimension for the month resulting in the time dimension

  • Normalize the size and pass it to snowflake: this would be another option. In the picture below, you can see how the scheme would be the case we would have decided to use this solution. As you can see, we have created a dimension for the month (the hospital is needed for monthly fact tables) and moreover we have also brought out the quarters and years (for the case that in the future have other types of analysis that required).

Logic Model - Dimension Standard Time

For our case, we are left with option 1, which is recommended by Kimball. As we see, we've created a set or subset of the time dimension, and all the attributes with the same names in both the original scale as derived dimension. With this philosophy we can build conformed dimensions that help us to implement the concept of data warehouse Bus. The technique can be used for any case we have a different level of granularity. Could also be used in case we have a datamart where we get the dimension with the same structure of attributes but limit the number of values (imagine that we have several channels of sale and a specific analysis, only the scale we get values derived from a customer of these channels).

Dimension Concept Roll-up or derived

For the construction of the new dimension, we may use a standard view or a materialized view.

I think this answered the questions raised. It is not necessary to pass the standard model when the problem of boards made with different granularity, and also the described method we can use whenever a similar scenario arises.

 

Offline
Joined: 07/03/2011
Puntos: 1

Thanks for sharing this post. This is a very helpful and informative material. Good post and keep it up. That’s cool stuff, anyways, a good way to get started to renovate your dreams into the world of reality. Adidas Adizero Adios

Offline
Joined: 13/03/2011
Puntos: 1

It is actually really great to hear that you've actually started on the design of a DW for use with Pentaho. Which actually helps us to revise the definition of the time dimension. This could actually prove to be one of the greatest innovatie achivements in the coustumer care and marketing department. I hope you succeed on what you are up to. Great wishes, hope you bring in great positive changes.

 

Robert Goulet

Google
 
     

Similar entries

  • Areas de datos para la carga de un Data WarehouseWhen designing the architecture of a data warehouse system we have to raise different environments for which data must pass en route to its Data mart or target cube. Given the amount of changes that have to do, and normally the DWH, besides fulfilling its function of support analytical
    requirements, performs a data integration function that will form the Corporate store and they will have to be also consulted the traditional way by the operational systems, it is recommended to create different data areas in the path between source systems and OLAP tools.

    Each of these areas as characterized by the functions performed, how data is organized in the same, and what type of need can service. The area that is 'at the end of the road' is important, but it will not be the only store data that will exploit the reporting tools...

  • Slowly changing dimensions or SCD (Slowly Changing Dimensions) are dimensions in which their data tend to change over time, either occasional or constant, or involving a single record or the entire table. When these changes occur, you can opt to follow one of these two great options:

  •  

    In MySQL, when you delete records from a table, space is reallocated automatically. It is as empty space and forming new attachments will advantage.

    The problem is that if a table perform many DELETE operations, the physical space of the table will become increasingly fragmented and the performance is reduced.

    In the MyISAM and InnoDB, OPTIMIZE TABLE command available to perform an optimization on any table that, among other things, performs an automatic defragmentation of the table.

    It is highly recommended to use this command regularly especially on tables that are more statements of disposal of records.

    As a precaution, keep in mind that during implementation, of course, the table is blocked. You have to remember when you are going to use with large tables and busy.

    Supersimples The syntax is:

     

    OPTIMIZE [NO_WRITE_TO_BINLOG | LOCAL] TABLE mi_tabla1 [, mi_tabla2] ...;
    

    To make a fragmented table optimization can be selected to have free space, probably as a result of DELETE statements:

  • Cuadros de mando de Oracle AnswersOracle, in addition to their famous manager database has a large catalog of software products, many developed internally and acquired through the purchase of other companies.  Within the environment of business intelligence that fact is even more remarkable as just more of Oracle's recent acquisitions were purchases of BI software vendors.
     
    The result is that although each tool has some features and a more appropriate market, there are many overlaps, and costs a little lie when choosing what software or set of tools we could use for our business intelligence project...
  • Datawarehouse loadings are periodic, and aggregate tables are useful to improve efficiency and the response time of our reports, so a physical optimisation resource that can bring big improvements in our system is the utilization of materialized views.

    The materialized view is not more that a view, defined with one SQL sentence, of which in addition to storing his definition, store the data that returns, realizing an initial load and after a concrete time a refresh of the same.

    Like this, if we have a Datawarehouse that updates daily, could use materialized views to go updating intermediate tables that feed our DWH diagrams, or directly to implement aggregate tables that will refresh from our base tables. 

    The creation of this type of views aren't so complex as it can seem, the most important is to have clear the periodicity of the information update, and which method of refresh we will use.

    Also will have to ensure us that our licence of database allows us use them (has to be a version Enterprise).