This document describes data modeling and data warehouse layering best practices in resource- and time-constrained Startups and Scale-ups 🚀
1 Source Layer
Purpose: 1:1 Representation of the Source Data. Comparisons to source systems are always easily possible. No black-box transformation logic.
- This is a layer in the DWH that stores all the data in the format the data source “knows it” with as little transformation / as “raw” as possible (ideally no transformations).
- Other than the following layers, this layer is usually separated across several resources in a Cloud Data Warehouse. For example, in BigQuery every Source loads data into tables in separate datasets.
- Ideally, at this step, “contracts” between the owner of the data source and the DWH owner are being established and eventually automated. The goal of a data contract is to ensure data quality and preventing downstream data pipelines from breaking by capturing expectations around schema, business logic, SLAs between a data provider (e.g. the IT department) and a data consumer (e.g. BI).
- A data contract is an API-like agreement between data producers and consumers, with each constraint enforced programmatically in the developer workflow. It captures expectations around schema, business logic, SLAs, or other forms of data governance.
- The enforceability of the contract is its most essential component. It is not simply an agreement on paper - that’s documentation. It’s also not simply a monitor on downstream data sets - that’s a test. The programmatic enforcement is what makes expectations into a contract.
- In startups, some components of a Data Contract are optional: as long as there are not dozens of internal data providers, programmatic enforcement and an API-like agreement are not needed.
- What is needed though are:
- 1. Documentation of expectations around schema, business logic, SLAs
- 2. Automatic detection when those expectations are violated
- 1) can be achieved by defining .yml files in the Preprocess Layer (see next chapter about the Preprocess Layer) in dbt. This .yml file defines the expected schema of the source and also describes simple semantic rules (e.g. definition of Primary Keys, Expected Data Freshness Rules, Definition of expected values, e.g. status names or names for payment methods on which downstream jobs rely upon).
- For 2) there will be automated assertions verifying these rules and the source owners have access to the docs that are auto-generated by dbt. In an ideal world scenario, the assertions would already run on the side of the source at compile time but this simple implementation will be a good start and we don’t need anything more complex at this point.
2 Preprocess Layer
Clean up of the Source. Still no business logic.
- This step cleans up source data within each table.
- Strictly no joins at this point.
- The only operations that happen here are
- harmonizing data (e.g. converting upper case to lower case)
- Casting (or safe-casting) data types (e.g. convert strings to dates)
- deduplication (e.g. making sure that @googlemail and @gmail are considered to be the same value, making sure that Berlin and Berlinn is the same value, trimming leading and trailing spaces etc)
- formatting dates (create iso-weeks, months, quarters, years)
- harmonizing timestamps (e.g. all UTC timestamps)
- establish consistent handling of unknown values (e.g. take NULL values from the source and do not convert them to “n/a” or similar. If a source delivers a value that means “unknown”, it should be converted to NULL here so that unknown values are NULL consistently and throughout)
- Ideally, this layer reads only those columns from the source that are needed later. However, this generates too much overhead to do it on a greenfield setup. I would first select everything and then only drop columns once issues come up.
- It is recommended not to rename columns at this step. This step is still a source-aligned representation of the data. The translation of this data to business-aligned / analytics-aligned objects happens in the next step.
- Strictly no SELECT(*). Instead, every column is selected explicitly. This is because we want to have control over which data is propagated downstream in case the sources change.
3 Object Layer
Translating data from operational systems into an analytical view (Components of Data Products)
- The structure of data in the source systems often differs significantly from the way business users look at data because source systems are optimized for throughput and reliability while a DWH needs to be optimized for analytical purposes.
- The purpose of this layer, therefore is to “translate” the raw data into business objects (e.g. coaches) and activities (e.g. bookings) that are relevant to the business.
- This layer typically stores components of data products but often not finished data products that are ready for end-user consumption
- Three important but sometimes conflicting paradigms should be considered when building models in this layer:
- Don't Repeat Yourself (DRY). This paradigm prescribes that every piece of transformation logic should be built as early as possible in the data pipeline so that it can be consumed by downstream models and does not need to be repeated. This creates efficient code and minimizes the risk that important business logic is forked and duplicated inadvertently.
- Keep it Simple, Stupid (KISS). This paradigm prescribes that we want to keep the lineage of our pipelines as simple and clean as possible so that data flows are easy to understand.
- Prepare for Decentralized Ownership (PDO). This paradigm is based on the assumption that most companies should, over time, move from centralized ownership of all parts of the data pipeline to decentralizing parts of the pipeline to increase the efficiency of acting on stakeholder requests and the quality of the models. Migrating from centralized ownership to decentralized ownership or hub-and-spoke models is prohibitively expensive and complex and it makes sense to follow some guidelines with little extra cost to prevent these large costs from occurring in the future.
- Here are a few important applications of these principles:
- All KPIs that can be created without any joins should be created in this layer (e.g. net_revenue) so that they are available in downstream models
- All KPIs that require joins of different tables from the preprocessing layer within the same source should be created here
- All KPIs that require joins between tables from different sources that are owned by the same domain should be avoided but can be acceptable if the domain ownership is obvious (e.g. a Join between a source providing facebook marketing costs and another source providing adwords marketing cost is acceptable since both sources are quite obviously owned by the marketing domain. However, as the organization becomes larger Facebook Marketing and Adwords Marketing could be owned by different domains / departments).
- All KPIs that require a Join between tables from different sources AND different domains are not allowed (e.g. joining customers, bookings, and invoices to calculate customer lifetime value). The reason is that in a decentralized setup, all objects in this layer could be provided decentrally in such a way that the data provider (domain owner / source owner) does the preprocessing and KPI calculation. The data provider can only do those calculations that are within the jurisdiction of his domain.
- dbt models are organized in folders that should ideally represent business domains.
- Variables should be used to make sure that business logic (especially CASE WHEN statements) are centralized at one point so that they don’t diverge when business logic changes. Variables are defined in dbt in the dbt_project.yml file.
- To prevent accidentally filtering or duplicating data, referential integrity between fact tables and dimension tables needs to be established in this layer. This means that every dim_ table must include all keys that are available in a fact_ table that refers to the dim_table. E.g. there can’t be a customer_id in fact_bookings that does not exist in dim_customers.
- Primary Keys also need to be established in this layer and need to be tested on uniqueness and NOT NULL (if the database doesn’t enforce primary keys, e.g. BigQuery, Snowflake).
- Naming conventions make it easy and intuitive to query tables:
- Fact tables have a prefix fact_
- Dimension tables have a prefix dim_
- All tables and column names are separated by underscores
- All table names and KPI names are in Plural
- All keys are the name of the table in singular followed by an underscore followed by “id” (e.g. customer_id in dim_customers)
- Abbreviations for column names should be avoided (e.g. contribution_margin vs cm)
- The column_name should explain as much as possible the meaning of the column (e.g. booking_date_utc, booking_date_local_time instead of booking_date to highlight the timezone without having to consult the documentation).
Note that the DRY and PDO principles are conflicting because DRY would propose to calculate all KPIs in this layer while PDO prohibits some joins that are necessary to do so. If in doubt, I suggest to prioritize DRY over PDO.
- Objects should eventually be normalized in the 3rd normal form
- Violating best practices to gain speed (e.g. not normalizing) is fine as long as the violation/tech debt is documented in the code or the .yml files.
- One 3NF principle that I typically violate in early stages is outrigger dimensions (when one dimension depends on another dimension)
- Slowly changing dimensions (e.g. SCD2) can be used in this layer (as opposed to the Datamarts layer). This is because the objects layer should only be accessible to skilled analytics engineers who know how to query SCD2 historized dimensions.
- There is no differentiation between PII and non-PII, i.e. the layer contains PII- and non-PII data. This should be handled via IAM (internal access management in BigQuery)
4 Datamarts Layer
Purpose: Finalizing the calculation of KPIs and Dimensions on unit grain and providing granular, actionable content to analysts and business stakeholders (Data Products)
- This layer is still on the granularity of objects (e.g. customers) and activities (e.g. bookings) but creates KPIs that require the combination of multiple objects and activities (e.g. customer lifetime value for a coach, P&L for a location, etc. should be calculated here)
- This is the layer that would connect to stakeholder-facing systems (Google Sheets, Reporting Tools, Campaign Management Systems, and Interface to Marketing Tools such as Adwords). The layers upstream of the Datamarts Layer should not be visible to non-BI-team-members (i.e. Business Analysts should only see the Datamarts and Reports Layer)
- Thus, I consider this layer to contain finished, end-user / business-user facing data products that feed data-driven services
- The datamarts layer is heavily de-normalized (flattened, one big table approach) for quick and simple querying. No complex joins, no lead/lag lookups or other window functions should be necessary on top of this layer).
- SCD2 Dimensions are not ok here as they are too complex to understand for analysts and they always get the joins wrong (e.g. forgetting to filter on the valid from/valid_to). Instead, the current state of a dimension and historic states should be flattened into the fact object
- There is no differentiation between PII and non-PII. This should be handled via IAM
- A potential issue with splitting objects and datamarts is increased cloud cost. Incremental loads are important for large data volumes but I would usually start with full loads to reduce complexity.
5 Reports Layer
Aggregating Datamarts Objects for specific use cases
- This layer can include aggregations of the Datamarts Layer for specific use cases (usually used for performance or compliance reasons or when KPIs need to be dynamically aggregated for different dimensions, e.g. unique customers)
- This layer strictly does not show any PIIs
- Generally, I would avoid building too much in this layer and instead use a semantic layer such as the one built into dbt, Looker, cube.dev, Lightdash to generate KPI definitions and Dimensions as dynamic aggregation rules. I think that this is really important to avoid an abundance of aggregate tables grouped by different dimensions. For example counting weekly active users by week requires a different table than counting weekly active users by week and location type since a user can be active in multiple location types per week and thus weekly active users can’t be summed up. A semantic layer can dynamically handle the aggregation rules so that data can be drilled down to any combination of dimensions while keeping aggregation correct without duplication of code (DRY principle!)
- There is strictly no complex business logic in this layer, only simple aggregations and filtering (COUNT, SUM, AVERAGE etc)
P.S.: Among others, we are discussing in detail how to set up these layers for your specific use case in our Masterclass - "Create Massive Business Impact with your Data Team".