Fri 28 Apr 2006
Sales Data Mart – Dimensional Model for Retail
Posted by Nishith under Data Warehousing , DecisionStudio-Professional , Open Source Analytics[31] Comments
If you have followed some of the earlier posts, you would remember that a data mart is created as a star schema through a process known as dimensional modeling. In this post we will create a dimensional model for Sales data mart at a hypothetical retailer.
Note: To go through the examples here, you would need MySQL database and DB Designer for database modeling. You can either install DecisionStudio Professional (download from sourceforge), which has both MySQL and DB Designer along with many other analytics goodies, or else you can install them individually from MySQL website and DB Designer website. You would also need to download the sample foodmart database available along with DecisionStudio Professional.
Now let’s assume you are an IT person at FoodMart (a hypothetical retailer) who has decided to build a sales data mart as the first step in rolling out comprehensive analytics. In discussions with the sales department you have figured that the no. of units sold, dollar amount of sales, and the number of unique customers in a segment are the main metrics they look at. Digging deeper you figure that the sales guys are likely to want analysis by product, product category/class, brand, store location (city, state, region, country, …), customer demographics, and also by individual promotions and promotion categories. It may not be explicitly mentioned, but the metrics would also be analyzed by time (day, week, month, quarter, …)
Now that you have figured out the business metrics to be measured, this gives you the facts you would need in the data mart ‘fact table’ for calculating them. Similarly you have figured out the potential segments for analysis, and that gives you the ‘dimensions’ for analysis. The ‘fact table’ linked to the ‘dimension tables’ makes up the ’star-schema’ (because of the star-like structure), also known as the data mart.
With this information in place, we have the high level Dimensional Model for Sales.
Sales_Fact_1998 is the main fact table that has sales information by store/location, product, time, customer, and promotion. Correspondingly there are 5 dimension tables joined to the fact table through foreign keys in the star-schema.
The dimension tables in turn have detailed data that can now be used for defining ad-hoc analysis segments. For example, we can put demographic filters on the customer dimension (say age<30, married, college-educated), choose specific product class(es) in the product table (say Dairy Products), specify a limited time period, and then get our metrics calculated for the ad-hoc segment.
The image below shows the detailed information available in the dimension tables for defining ad-hoc segments.
You can download the data model here, and then open the saved model using DB Designer in DecisionStudio Professional (Start -> Program Files -> DecisionStudio Professional -> Data Analyst -> DB Designer Workbench). You can see other tables in the FoodMart database by scrolling around on the canvas (scroller in top-right corner).
Do note that our dimensional model for sales covers only a small relevant set of tables from the entire FoodMart database. You can load the entire downloaded FoodMart data into MySQL as outlined here, and can query on the data using Query Browser (Start -> Program Files -> DecisionStudio Professional -> Data Analyst -> MySQL Query Browser).
June 2nd, 2006 at 12:53 pm
I chanced upon this link from the follwing link when I was looking for Star Schema Database modeling.
http://en.wikipedia.org/wiki/Star_schema
Yogesh
July 25th, 2006 at 10:10 am
Hi, I really like the post. It really inspires me. But still, I have a big question mark inside my head.
I have to make a reporting tool for automating the reporting process. The reports consist of summary tables and charts to visual the trends happening in the organization. Do I have to make star schema for each measure (or chart) that the user wants to see? Or I do not need any star schema at all, just manipulate it with SQL manipulate statements?
I am a newbie in dimensional modeling and wish to learn more because I am very interested. Thank you.
July 25th, 2006 at 5:46 pm
Well, that depends really on what data you have and what kind of reports/charts you want to prepare. If you have to make only 2-3 very specific reports as a one-time job, dimensional analysis and star-schema could be an overkill. However, if you have multiple reports/analyses on one subject area then it would be better to invest the time in dimensional modeling and creat the star schema. Yes, the same reports can be made by running SQLs, but the benfit of a star schema is as an abstraction layer between the raw data and the final reports. A properly designed star schema can then be used to generate a lot of new reports and charts with minimal effort.
Ideally you would not need more than one star schema (assuming that you are lookin at a single subject area such as sales). Multiple measures can be defined as run-time computations on a single star schema, and each report is a specific collection of measures and dimensions arranged on rows and columns.
So in the sales cube above, I can define multiple measures such as Unit Sales, Dollar Sales, Min/Max/Average Sale price, No of Customers, No of Stores, Total Shelf Area, Total Profit, Unit Sales per Customer, Revenue per Customer, etc. Now we already have various dimensions in the schema, such as product, store, time, etc. so making new reports is a breeze. For example, throw Product Dimension on Rows, Time Dimension on Columns, and choose Dollar Sales as the measure to get a product and time-wise Sales Report. Similarly you can create a lot of other reports as per your needs.
Hope I have been able to answer your question here…
July 26th, 2006 at 7:42 am
Thank you for the answer, Nishith. So what you mean that if I want to prepare more than 3 reports, the star schema approach is more preferable than SQL DML? Yesterday, I tried to design a star schema based on my business process case. Before I continue asking you, hope you don’t mind me asking you so many question, I want to try to explain the business process first.
The business process that I involved in is service/help desk, especially on incident management. There is already an enterprise system which handles the service management in the organization. But the reports generated by the system weren’t suitable for the management’s favour. The manual reporting process is as follows:
1. The system generate raw reports consist of rows and columns on a flat file (CSVs or Excel Spreadsheets).
2. The staffs process the flat files in Excel (usually using Pivot Table-Pivot Chart).
3. Reports are done and the staffs presentate it in front of the vice president every month.
The disadvantage of this process is the time to proccess the reports and the human error factor. So, I’ve been asked to develop somekind of reporting tool which can automate the process, improve the performance of reporting, and improve the quality of information of each reports generated (drill-down feature, or maybe create an accumulative report every year or something like that).
Okay, I hope I didn’t bore you with the story, so here are my four questions:
1.) The input of the reporting tool I’m developing are flat files like .csv and .xls. I’ve created an initial star shema design based on the column headers in those flat files. Is it a correct move? Should I design a star schema from that files, or should I just make one table exactly the same as the files?
2.) I still confused with facts. How can I make good facts? Is it based on the user requirements, or it’s just based on my predictions?
3.) In my initial star schema design, I seperated the time attributes (arrival time, resolved time, closed time, shift) into one TIME dimension table. What if I need the TIME dimension like the sales data mart example above? Should I rename the dimension table from TIME to PROCESSING_TIME (just an example) and make another time dimension table (just like the sales data mart example) called TIME dimension table, or I just append them in one TIME dimension table?
4.) I was wondering, since I first browsing the Internet searching for resources of information about data warehouses, star schemas, snowflake schemas, etc. all I found are always sales business process as their examples. Do you know where to find another business process examples beside sales? So I can compare the concepts between the business process and to understand more of the concepts.
Thank you for your time and willingness, Nishith.
August 15th, 2006 at 6:37 am
I would like to use your Dimensional Model for Sales diagram in an in house tutorial for a client. The source will be attributed to your website. May I have your permission?
September 11th, 2006 at 12:21 pm
it is very good send this type of answers to my id o.k thankyou sir.
November 8th, 2006 at 2:21 am
iam looking jobs as an etl developer .iam involved in designing the mappings.will they ask me about dimensions and facts?pls send me the dimensional model for auto insurance.
November 10th, 2006 at 6:32 pm
Hi Neena,
I guess if you are looking at only ETL, then Dimensional Modeling doesn’t get covered. However, it may be a good idea to know these things as it would give you a better understanding and broader picture of your project.
March 13th, 2007 at 2:33 am
thnks for u r information ….
June 8th, 2007 at 6:55 am
I am considering the following problem relative to databases and wondering if Start Schema is suited for this purpose:
Background: Suppose we want to store product information in a database table. We can create a table with attributes representing the product features (e.g, for a camera, number of pixels, zoom, weight, etc.). However, if you now want to save product information for a different product, say, a laptop, this table will not be right.
So, what is the optimal approach which will allow quick search, update, insert of new products into database?
In this context, is Start schema appropriate? What are the pros & cons?
Any comments that help give a good approach are appreciated.
Sudhakar
June 8th, 2007 at 4:59 pm
Hi Sudhakar,
Star Schema is a conceptual framework used for designing data marts that answer business analysis questions. From your question it appears that you need a transaction processing (OLTP) system and star schemas would be of no help.
You would need to use a normalized database design (as is usual for OLTP systems) to make sure that inserts, updates, etc. are fast. You can keep query speeds good if you are querying on a limited number of parameters (such as product type, name, item no, etc.) which are present in a main table, and then product details etc. appear in normalized detail tables.
Once you do that, your resultant set of tables would couold be more complex than the star schema above.
June 12th, 2007 at 11:11 am
I am set to explorer the BI components in sql 2005. I intially dont have concepts behind BI. After a long time i understood some basic defintion for BI system and its components like Integration service(ETL process), Reporting Service, Analyze Service(OLAP) and AMO.
I’d love to share my question with you!
1. Whats the differences/similarity betweens OLAP and ETL and Reporting process?
My understanding about the above one is as follows.
Using OLAP we create data source view from a typhical RDBMS system.
From source view we classify dimensions and facts table.
From those two table (Dimension&Fact) we create cube.
From Cube or one step ahead using star schema/snow flake we extract information and shown in the screen.(using ADOMD.NET)
Here in the above steps which step is identified as ETL and for reporting?
please guide me.
Thank you.
Gowrishankar
July 20th, 2007 at 1:57 am
Nishith, solid answers and a beautiful way to explain. I am looking for documents that will explain the way you have explained the details. Can you provide me information on such sites.
Thanks,
Sachin.
July 20th, 2007 at 11:44 am
Gowrishankar,
ETL is to take the data out of source (OLTP) systems and load it into a Data Mart/Warehouse for analytical and reporting purposes. In your example you are building your cubes directly on top of the source system DBMS, and hence there is no ETL step involved.
Hi Sachin,
Thanks for your kind comments. Unfortunately, good information is hard to come by. When I was scouring the net I couuld hardly find anything worth mentioning, and hence this blog.
I guess Google and loads of patience should do the trick. If there are some specific queries, we could try too help.
Nishith
August 4th, 2007 at 8:44 am
Hi Nishith,
I am doing an example project for retail business for my understanding of OLAP modeling.
I learnt a lot from your data model for retail sales. I have kept transaction dimension because I thought a sale has to be part of a transaction. So I have transaction_id, date_id, product_id and location_id as foreign keys in the fact table. Am I doing it right?
And also how can I get a report of top 5 selling products and botttom 5 selling products in a month for a location from your model?
And also I want to get a report of average sales per transaction.
Can I get both these reports from the same fact table or shoud I do 2 separate fact tables?
Thanks for very good information.
Sumukha
September 19th, 2007 at 2:22 pm
Hello Nitish,
Here I need to create a sales datamart for clicksteam, can u please suggest what are most important dimenions required.
January 26th, 2008 at 12:57 am
r u mean data mart is nohing but diemension table
January 29th, 2008 at 9:23 am
I need thesis papers for sale system with OLAP using data mart
February 4th, 2008 at 11:21 pm
Nishith,
Very nice discussion you have posted here – clear and good example of a star schema and how it is constructed.
In your example the fact table is organized by time as most are – what do you call this dimension? in other words is it the dimension of the fact table (time)?
Do all rows in the fact table require a value in the dimension table?
For example: Maybe I have one store that for whatever reason does not have value for promotion ID (and nulls are allowed in this field on my fact table), should I create a default value for that store in promotion ID and store the default in dimension table to keep the construct?
Or do I need to create a separate star schema for all stores with promotion code and then one for all stores which does not include promotion code dimension table?
Or do I have separate fact tables for stores with promotions and without and in that case what happens when I want to report on all the data (a union all being a performance concern)?
I hope this makes sense and thanks for reading – I’ve often times run across instances like this when dealing with transactional data and find opinions vary about using default values in dimension tables.
February 4th, 2008 at 11:38 pm
let me try that second question again… sorry for double post but the above wasn’t clear
Do all rows in the fact table require a value in the dimension_ID column corresponding to all dimension tables?
For example: Maybe I have one store that for whatever reason does not have value for promotion ID.
Should I;
create a default value for that store in promotion ID which is inserted in the ETL process and store a corrsponding set of values for the default in the dimension table?
Or
Do I need to create a separate star schema for all stores with promotion code and then one for all stores which does not include promotion code dimension table using the same fact table allowing nulls in promotion id?
Or
Do I have separate fact tables for stores with promotion codes and those without, and if so how do I view all stores at once?
or
Do I have separate fact tables for all stores and for all stores with promotion codes (seems redundent)?
I hope this makes sense and thanks for reading – I’ve often times run across instances like this when dealing with transactional data and find opinions vary about using default values in dimension tables.
February 13th, 2008 at 12:29 am
bobby,
depends on what you are ultimately trying to measure, but generally, no you do not need to do anything to Nishith’s model. IF a store is not running a Promotion on a given product during a given day/week, it won’t have a row (transaction) in the fact table. If on the other hand, you are trying to get a total product view across all stores in a given week with a drill down to various promotions, then you will want to set some default value in Promotion dim that means “not promoted”.
mike
February 17th, 2008 at 7:56 pm
Mike,
Thank you for the answer…
however I am not understanding why you say that “if a store is not running a Promotion on a given product during a given day/week you won’t have a row (transaction) in the fact table”
In my understanding if the store sold a product with no promotion then there would in fact be a row (transaction) in the fact table – it would have valid values for other measures including store_id, customer_id, product_id, etc.
Again my understanding is that one could measure sales across any of these dimensions. So I might have a report for sales by store_id or sales by product_id.
I would want to see my row from the store that does not have a promotion id in these reports.
However I am still left with two choices when I go to add the dimension of promotion id;
1. Left outer join between fact and promotion id table
2. not picking up the row (transaction) or rows that don’t have a promotion id
both seem bad options so my recommendation to clients is often to use a default value.
however many clients have refused this recommendation as bad practice (i.e. inserting a value when no value was present on the source data).
I am just wondering if there is one “best practice” way to do this when assembling star schemas.
Thanks again
bobby
February 20th, 2008 at 7:21 am
Hi Bobby,
You have correctly pointed out that when you make a report by a particular dimension (say promotion_id), then the report will not show data for records in the fact table having promotion_id column as blank.
This is a common enough situation, and the solution you have recommended to clients is almost always the right solution (use default values, say NULL, with a corresponding row in the Promotions dimension table).
Outer joins are expensive, and you definitely do not need multiple star-schemas.
I can understand some of the clients being apprehensive, and its upto you to educate them and keep reminding that the data warehouse is unlike any other database they have dealt with in small, yet very critical, ways.
The differences between OLTP and OLAP systems need to be understood by IT Management to roll out the best solutions instead of blindly following dogmas that are only for OLTP solutions. For example, a banking system database would need to be backed up at least daily, whereas a data warehouse with only a monthly batch refresh of data does not need a daily backup.
February 21st, 2008 at 4:56 am
Bobby,
Nisrith’s comment on setting a NULL value is also what I had described you would likely want to / need to do. To clarify my statement (and your question), since a row in the fact table has a multi-part key, you MUST have a legitimate value for the Promotion Id key, otherwise you have no choice but to NOT load the row in the fact table to maintain integrity.
Final thought on the “bad practice” of setting default values. Sounds like your clients are concerned about setting a default value when the quality of the source data is unknown. In other words, the product may or may not have been promoted – no one knows for sure. In this case, it is ok to set a value that means “unknown” vs. “not promoted”. In this way, you can always create a report listing of rows with “unknown” for follow-up resolution with the source.
Mike
February 21st, 2008 at 7:56 am
Yes Mike, you are right. Most likely the concerns about default value are related to the quality of the source data.
Bobby,
In some cases it helps to point out that the Data Mart contains processed and transformed data – and replacing NULLS with alternate value can be considered a transformation.
Also, if you are making a Mart/Warehouse, it is extremely important to keep a detailed and updated Data Dictionary. These default values, etc. can be noted down against the respective columns in the Data Dictionary. A common IT Management apprehensions is that the knowledge (of default values, say) would be lost once the concerned resources are no longer around. This apprehension can be addressed to some extent by keeping the Data Dictionary updated, relevant, available and utilized.
March 3rd, 2008 at 10:33 am
Hello
I am working as a database engineer in a reputed MNC.i m working over cognos tool.now the work assigned to me is to develop a sales data mart.please help me our howe to make it and from where to start.
thank u
May 20th, 2008 at 9:06 pm
WELL SAID…
May 26th, 2008 at 12:42 pm
hey, guys i have assignment about Dimentions maps and star scema.
i need a help. I have table from my professor and i am not getting what to do here. pls anyone can help and e-mail me for conatct at nir_prince@hotmail.com will be appriciated.
pls someone help me.
Nirav
May 26th, 2008 at 4:55 pm
Really good Nishith !
August 13th, 2008 at 1:08 am
Read this book:
http://www.amazon.com/Dimensional-Data-Warehousing-MySQL-Tutorial/dp/0975212826/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1218569860&sr=8-1
September 3rd, 2008 at 12:25 pm
Hi,
I am contributing towards designing a retail datawarehouse. the functional areas include stores, CRM, merchandising, category management etc. I need help in the dimensional modelling phase. Can anyone help please??