2 Preprocess Layer
Clean up of the Source. Still no business logic.
- This step cleans up source data within each table.
- Strictly no joins at this point.
- The only operations that happen here are
- harmonizing data (e.g. converting upper case to lower case)
- Casting (or safe-casting) data types (e.g. convert strings to dates)
- deduplication (e.g. making sure that @googlemail and @gmail are considered to be the same value, making sure that Berlin and Berlinn is the same value, trimming leading and trailing spaces etc)
- formatting dates (create iso-weeks, months, quarters, years)
- harmonizing timestamps (e.g. all UTC timestamps)
- establish consistent handling of unknown values (e.g. take NULL values from the source and do not convert them to “n/a” or similar. If a source delivers a value that means “unknown”, it should be converted to NULL here so that unknown values are NULL consistently and throughout)
- Ideally, this layer reads only those columns from the source that are needed later. However, this generates too much overhead to do it on a greenfield setup. I would first select everything and then only drop columns once issues come up.
- It is recommended not to rename columns at this step. This step is still a source-aligned representation of the data. The translation of this data to business-aligned / analytics-aligned objects happens in the next step.
- Strictly no SELECT(*). Instead, every column is selected explicitly. This is because we want to have control over which data is propagated downstream in case the sources change.
3 Object Layer
Translating data from operational systems into an analytical view (Components of Data Products)
- The structure of data in the source systems often differs significantly from the way business users look at data because source systems are optimized for throughput and reliability while a DWH needs to be optimized for analytical purposes.
- The purpose of this layer, therefore is to “translate” the raw data into business objects (e.g. coaches) and activities (e.g. bookings) that are relevant to the business.
- This layer typically stores components of data products but often not finished data products that are ready for end-user consumption
- Three important but sometimes conflicting paradigms should be considered when building models in this layer:
- Don't Repeat Yourself (DRY). This paradigm prescribes that every piece of transformation logic should be built as early as possible in the data pipeline so that it can be consumed by downstream models and does not need to be repeated. This creates efficient code and minimizes the risk that important business logic is forked and duplicated inadvertently.
- Keep it Simple, Stupid (KISS). This paradigm prescribes that we want to keep the lineage of our pipelines as simple and clean as possible so that data flows are easy to understand.
- Prepare for Decentralized Ownership (PDO). This paradigm is based on the assumption that most companies should, over time, move from centralized ownership of all parts of the data pipeline to decentralizing parts of the pipeline to increase the efficiency of acting on stakeholder requests and the quality of the models. Migrating from centralized ownership to decentralized ownership or hub-and-spoke models is prohibitively expensive and complex and it makes sense to follow some guidelines with little extra cost to prevent these large costs from occurring in the future.
- Here are a few important applications of these principles:
- All KPIs that can be created without any joins should be created in this layer (e.g. net_revenue) so that they are available in downstream models
- All KPIs that require joins of different tables from the preprocessing layer within the same source should be created here
- All KPIs that require joins between tables from different sources that are owned by the same domain should be avoided but can be acceptable if the domain ownership is obvious (e.g. a Join between a source providing facebook marketing costs and another source providing adwords marketing cost is acceptable since both sources are quite obviously owned by the marketing domain. However, as the organization becomes larger Facebook Marketing and Adwords Marketing could be owned by different domains / departments).
- All KPIs that require a Join between tables from different sources AND different domains are not allowed (e.g. joining customers, bookings, and invoices to calculate customer lifetime value). The reason is that in a decentralized setup, all objects in this layer could be provided decentrally in such a way that the data provider (domain owner / source owner) does the preprocessing and KPI calculation. The data provider can only do those calculations that are within the jurisdiction of his domain.
- dbt models are organized in folders that should ideally represent business domains.
- Variables should be used to make sure that business logic (especially CASE WHEN statements) are centralized at one point so that they don’t diverge when business logic changes. Variables are defined in dbt in the dbt_project.yml file.
- To prevent accidentally filtering or duplicating data, referential integrity between fact tables and dimension tables needs to be established in this layer. This means that every dim_ table must include all keys that are available in a fact_ table that refers to the dim_table. E.g. there can’t be a customer_id in fact_bookings that does not exist in dim_customers.
- Primary Keys also need to be established in this layer and need to be tested on uniqueness and NOT NULL (if the database doesn’t enforce primary keys, e.g. BigQuery, Snowflake).
- Naming conventions make it easy and intuitive to query tables:
- Fact tables have a prefix fact_
- Dimension tables have a prefix dim_
- All tables and column names are separated by underscores
- All table names and KPI names are in Plural
- All keys are the name of the table in singular followed by an underscore followed by “id” (e.g. customer_id in dim_customers)
- Abbreviations for column names should be avoided (e.g. contribution_margin vs cm)
- The column_name should explain as much as possible the meaning of the column (e.g. booking_date_utc, booking_date_local_time instead of booking_date to highlight the timezone without having to consult the documentation).
Note that the DRY and PDO principles are conflicting because DRY would propose to calculate all KPIs in this layer while PDO prohibits some joins that are necessary to do so. If in doubt, I suggest to prioritize DRY over PDO.
- Objects should eventually be normalized in the 3rd normal form
- Violating best practices to gain speed (e.g. not normalizing) is fine as long as the violation/tech debt is documented in the code or the .yml files.
- One 3NF principle that I typically violate in early stages is outrigger dimensions (when one dimension depends on another dimension)
- Slowly changing dimensions (e.g. SCD2) can be used in this layer (as opposed to the Datamarts layer). This is because the objects layer should only be accessible to skilled analytics engineers who know how to query SCD2 historized dimensions.
- There is no differentiation between PII and non-PII, i.e. the layer contains PII- and non-PII data. This should be handled via IAM (internal access management in BigQuery)
4 Datamarts Layer
Purpose: Finalizing the calculation of KPIs and Dimensions on unit grain and providing granular, actionable content to analysts and business stakeholders (Data Products)
- This layer is still on the granularity of objects (e.g. customers) and activities (e.g. bookings) but creates KPIs that require the combination of multiple objects and activities (e.g. customer lifetime value for a coach, P&L for a location, etc. should be calculated here)
- This is the layer that would connect to stakeholder-facing systems (Google Sheets, Reporting Tools, Campaign Management Systems, and Interface to Marketing Tools such as Adwords). The layers upstream of the Datamarts Layer should not be visible to non-BI-team-members (i.e. Business Analysts should only see the Datamarts and Reports Layer)
- Thus, I consider this layer to contain finished, end-user / business-user facing data products that feed data-driven services
- The datamarts layer is heavily de-normalized (flattened, one big table approach) for quick and simple querying. No complex joins, no lead/lag lookups or other window functions should be necessary on top of this layer).
- SCD2 Dimensions are not ok here as they are too complex to understand for analysts and they always get the joins wrong (e.g. forgetting to filter on the valid from/valid_to). Instead, the current state of a dimension and historic states should be flattened into the fact object
- There is no differentiation between PII and non-PII. This should be handled via IAM
- A potential issue with splitting objects and datamarts is increased cloud cost. Incremental loads are important for large data volumes but I would usually start with full loads to reduce complexity.