Cloud Computing: Building a Data Powerhouse: The Four Layers That Drive Data Warehouse Success

Operational Data Store Layer

It is the first layer in the data warehouse, and it only contains the raw data. At this level, the data belonged to structured data and semi-structured data. Most importantly, there is no operation carried out, and keep the data in the original state. The focus of data engineers is to ensure the connectivity between this stage and original data sources. In addition, the engineers ensure they receive the correct data and only filter out the missing values.

Technically, engineers synchronize the database tables to feed Hive tables or use Scoop or Flink to collect the data. It is also possible data engineers collect the data into the HDFS by Flume or Kafka.

Data Warehouse Detail Layer

At this stage, the data engineers focus on reshaping the data, making it consistent, accurate, and clean. Try to remove the impurities from the data. Moreover, engineers extract the data from the first stage and perform dimensional modeling, standardizing the data for use in business intelligence systems. For example, clean the multiple tables to extract the behaviors of business entities and detail the dimension of the business activities. Technically, data engineers parse the log data by using Kafka to create a Hive SQL table for user information or any other information.

Data Warehouse Aggregation Layer

This stage is used to create the summaries of data that are collected over the second stage. For example, data engineers generate aggregation on the problem domain to improve the query performance, like summarizing the data based on the number of users, orders, and other factors. In addition, use the statistics to find out the granularities like day, week, and month. There are many more methods, but engineers are mostly concerned with user or product behaviors, such as visitor count and the number of orders.

Application Data Service Layer

Now, you are reaching the stage where you connect your data to financial needs. It’s time to design and reshape the data for application scenarios, such as reports and recommendations. MySQL and Redis are commonly used to execute queries and generate real-time user preference reports.