
Data Engineers typically build pipelines but they often do a lot more than this and so if you’re thinking of becoming a Data Engineer or are already one and just looking to broaden your knowledge then I prepared some high level topics here to help with that. In summary these are:
- Data Lake
- Data Warehouse
- Data Mart
- Star Schema
- Snowflake Schema
- Fact
- Dimension
- Slowly Changing Dimension (SCD)
- Data Pipeline
- ETL
So let’s now expand these a bit.
What is a Data Lake?
A Data Lake is a centralised place or “repository” where data is firstly located in its raw form before it is then picked up and transformed into something else. Not be be confused with the source databases, it will house copies of that data which will involve some kind of pipeline in order to put it there.
High level types of data that may exist in a Data Lake would be:
- Structured Data, for example relational database data
- Unstructured Data, for example image data, emails or PDF’s
- Semi-Structured Data, for example CSV, JSON, XML or logs
The data could be anything – customer data, transactional data, message data, image data, audio data or video data are all types that you may upload to a Data Lake.
An example of a Data Lake might be a bucket in Google Cloud Storage.
What is a Data Warehouse?
A Data Warehouse contains a processed version of the source data that helps to create a single version of the truth.
What do I mean by “single version of the truth“?
To better understand that, we have to know the purpose behind having a Data Warehouse? It is to bring all the data together from different sources or silos to help create one place where that data can be queried accurately having been cleaned and made available for further analysis and decision making.
So, a simple hypothetical example would be to:
- Extract the customer data from the CRM system
- Extract the sales order data from the SOP or POS system
- Combine these two data sets together into one schema
In that example, instead of two reports being created and then doing some bespoke funky thing in a spreadsheet to combine the two sets together, the data has already been cleaned and merged making reporting on it far easier. Just query the data from the warehouse and visualise it quickly.
The Data Warehouse will typically grow to be very large in size, allowing great scope for historical data analysis. The source systems on the other hand may be periodically purged of old data based on data retention policies.
Decision makers can obtain a view of how all the business units are performing via the “single version of the truth”
A Data Warehouse will typically be built based on one of two different database modelling techniques – Star or Snowflake comprising of Facts and Dimensions.
What is a Data Mart?
A Data Mart is a smaller, “focused” set of tables that once again contain data that has been cleaned and merged in order to provide easy access to data to allow for analysis and decision making but has been targeted towards a particular subset of the business – for example a business unit.
There are two types of Data Mart, one that is dependent on the Data Warehouse and one that isn’t.
A dependent Data Mart, will be populated by data from the Data Warehouse. There will be another pipeline that populated the data mart with fresh data from the data warehouse.
An independent Data Mart however will not depend on a Data Warehouse. It is likely that in this case that a Data Warehouse does not exist in this business, instead marts will have been created to focus on different business units with the decision made that a warehouse isn’t needed.
Each business is different and so requirements may vary but in some ways the independent data mart offers an advantage over the dependent data mart as it may be faster to build these instead of building a warehouse first.
A Data Mart will typically be built based on either the Star or Snowflake modelling approaches comprising of Facts and Dimensions.
What is a Fact?
A Fact within a data warehouse or data mart is reference to a table that holds predominantly measurable data. For example a table called “sales_order” would house records containing metric fields such as sales price and sales quantity.
The sales_order table will have references to the dimension tables to allow them to be filtered and aggregated.
What is a Dimension?
A Dimension within a data warehouse or data mart is reference to a table that holds data that is used for filtering facts. For example a table named “customer” would be populated with all the customer records within the business.
Other examples might be “date”, “region”, “sales_team”. All such tables can be used to provide a way to aggregate fact data. Reports could then be created such as “Sales By Region”, “Sales By Team”, “Top 10 Customers” etc.
What is a Star Schema?
A Star Schema is the simplest form of a data model that is often found in either a data mart or data warehouse.
As the name suggests, the model is likened to a star shape. If you were to put the objects on paper, then you would see a central fact table surrounding by dimension tables joining to it.
The approach does involve relational database design techniques but typically forms a more de-normalised approach whereby the preservation of minimal data redundancy is sacrificed for speed.
The dimension tables are likely to contain repeating values instead of relationships to other dimension tables. For instance you might want a “date” dimension. This would contain fields such as quarter description, month description which would be string values that repeat for each date. In a more normalised approach, there would be reference tables for quarter and month to perform those same description lookups.
An illustration can be found here on Wikipedia
What is a Snowflake Schema?
A Snowflake schema is a more normalised approach to data warehouse design.
There will again be a central fact table with dimension tables relating to it. There will be more dimensions than a star schema with dimensions relating to dimensions. Preservation of minimal data redundancy is a highly regarded consideration and speed is less important.
Imagine a snowflake and you can get a feel for how deep the relationships in the schema could go.
An illustration can be found here on Wikipedia
What is a Slowly Changing Dimension?
A Slowly Changing Dimension (SCD) is a description that relates to dimension values that may be affected by an update.
As we know, a dimension is used for filtering and grouping of fact data.
If for example there was a “sales_rep” dimension in your schema that related to a fact table to allow for reports such as “Top 5 Team Performance”, the sales_rep dimension would provide the names of the employees involved in the sales process. It may also contain their location and of course, the team that they belong to. Let’s illustrate this with some dummy data:
Dimension: sales_team
| id | team | 
|---|---|
| 1 | New Business | 
| 2 | Sales Retention | 
Dimension: sales_rep
| id | employee_name | team_id | 
|---|---|---|
| 1 | Sally | 1 | 
Fact: sales
| order_ref | sales_date | price | quantity | product_id | sales_rep_id | 
|---|---|---|---|---|---|
| SO12345 | 2022-09-05 | 66.23 | 5 | 1 | 1 | 
| SO12345 | 2022-09-05 | 153.78 | 10 | 2 | 1 | 
| SO12346 | 2022-10-21 | 45.09 | 1 | 3 | 1 | 
| SO12346 | 2022-10-21 | 34.78 | 3 | 4 | 1 | 
Let’s now imagine an example where Sally, who is in sales team “New Business” is re-assigned to sales team “Sales Retention”. What will happen to Sally’s sales? The manager of sales team “New Business” would want to keep Sally’s sales against their team because they go towards that teams target. But Sally is moving team and her dimension record team identifier, once changed would automatically move her sales to to “Sales Retention”.
An SCD helps to take care of this problem. There are six different types of SCD which warrants a post in itself but in this example, we could use a type 2 SCD to allow Sally’s historical sales to remain in New Business but her future sales would be allocated to Sales Retention.
Dimension: sales_team
| id | employee_name | team_id | start_date | end_date | current | 
|---|---|---|---|---|---|
| 1 | Sally | 1 | 2022-01-01 | 2022-10-01 | 0 | 
| 1 | Sally | 2 | 2022-10-02 | NULL | 1 | 
In the above illustration, the sales_rep dimension has been modified to include 3 additional fields to segment Sally’s data by date. Joins and filters can then be applied between sales and sales_team to correctly allocate the sales revenue to the correct team for the correct dates.
What is a Data Pipeline?
Data Pipelines are the systems created by Data Engineers that allow data to be moved from source to staging to destination. In this case, sources will vary; relational databases, NoSQL, file based, message streaming etc. The destinations could be the Data Lake, Data Warehouse or Data Mart.
What is ETL?
The acronym that stands for Extract, Transform and Load that applies to Data Engineering practices.
- Extract – extraction of source data
- Transform – cleaning and mapping of data
- Load – inserting cleaned/mapped data to destination
ETL can be a time consuming process dependent on how difficult it is to clean and map the data. The result of that effort allows the business to analyse their data as a whole and make decisions from that.
What is ELT?
A small change to the above but a significant one. The Data Engineer will provide a pipeline between source and destination but not apply cleaning and mapping. This process would then be left to an analytics engineer who will perform such operations as required.
- Extract – extraction of source data
- Load – inserting cleaned/mapped data to destination
- Transform – cleaning and mapping of data
The advantage with this one is that a Data Engineer does not have to spend a lot of their time cleaning the data and can then be re-assigned to other projects. In this case, it is not vital that data is made usable immediately for analysis. Analysis can happen on a more ad-hoc basis. This may mean that some data is never looked at. These are business decisions made to meet requirements.
Hope you found this useful!
Photo by Waldemar Brandt on Unsplash