How data teams turn data into dollars and make their CEO love them - A Checklist ✅

Very few data teams ever reach the “Promised Land” of data analytics.

Every data team that wants to make an impact on the business, needs to achieve a balance of control and agility in decision-making support.

Screenshot 2024 10 14 at 08.54.29

On the control side, data teams need to ensure that information they provide is:

  • Trustworthy
  • Understandable

On the agility side, data teams need to ensure that information they provide is:

  • Available
  • Accessible

Unfortunately, most data teams never achieve this balance.

The result:

❌ There is an overwhelming amount of ad-hoc requests from the business

❌ The data team is often fire-fighting and has no bandwidth to deal with requests promptly

❌ Business teams are not empowered to self-serve even the most basic analytics

❌ The data team is seen as a dashboard factory

❌ The data team has no time at all to work on strategic projects 

❌ Business leaders feel that the impact of the data team on the business is low 

❌ Data Team members are burned out and quit silently

❌ Business leaders lose trust in the data team and the data team gets downsized

This situation is avoidable if data teams focus on the right initiatives at the right time!

From the data jungle to the promised land of data

In the early stages, before the first data team exists, every company starts in the data analytics jungle.

Screenshot 2024 10 14 at 09.06.09

There is some agility because business users create tons of Excel sheets by exporting data directly from source systems.

But there is zero control.

The information being generated in the Excel sheets soon cannot be trusted anymore. On top of that, Excel reaches its scalability limits very soon.

At this point, many data teams make a critical mistake: They want to go from the jungle straight to the promised land.

They pull data from all different sources into one database and build a neverending number of ad-hoc queries directly on this raw data to satisfy the information needs of business stakeholders.

This is a road to disaster. 

Instead of reaching the promised land, the data team becomes a bottleneck.

The company has lost control over their data and now also lost its agility from the early days.

Once stuck in the bottleneck, it’s difficult to escape.

Screenshot 2024 10 14 at 09.15.10

Over my 17 years building data teams in high-growth environments I have made a ton of mistakes and stepped into a few piles of 💩.

I either did things the wrong way or I did the right things in the wrong sequence.

By now, I have built data teams in more than 40 companies. 

I would say, the first 30 of those with varying levels of success and some pretty rough failures until I finally nailed it in the last 10 or so companies and directed them towards the promised land. 

Here is my Step-by-Step Guide how I build my data teams and data infrastructure today. 

Note that this list is specifically designed for building the first data team and infrastructure in a scale-up on a greenfield but the learnings can also be useful at different stages of company- and data-team-maturity. 

I would argue that any company at any stage should strive to implement all points on the checklist so you can also use it to audit your progress.

The Control Side

When you are in the jungle, the correct approach is to laser-focus on the Control Side first, build a Fortress, and then focus on the Agility Side later.

Screenshot 2024 10 14 at 09.16.51

The following points show you what you need to establish a fortress.

It is important to follow the checklist in this sequence and without skipping steps!

✅ Establish a clear and mutual understanding of KPIs and Dimensions for the first iteration of your data infrastructure

A great tool to do that is what I call the KPI / Dimension Map.

This is a cross-table in a Google Sheet that defines all metrics (e.g. Revenue after Returns and Cancellations) and all dimensions that this metric can be combined with (e.g. Country -> the resulting combination is Revenue after Returns and Cancellations by Country). 

This map has hundreds of possible combinations and will also show which KPI / Dimension combinations don’t make sense.

The map becomes the blueprint for your first data infrastructure data model.

A few things are very important to consider:

  • Don’t spend too much time deeply defining KPIs yet. They will change a million times anyways
  • Don’t spend too much time defining ownership of the KPIs yet. No one knows that yet. Let the data team own everything but make it clear that the business must take over ownership of KPIs eventually.
  • Focus only on use cases that are relevant over the next 6 months! This is very important! Make it clear what will NOT be part of the first iteration of the data infrastructure
  • Make sure to interview and involve ALL business stakeholders in this process

✅ Clearly understand which metrics and dimensions can be sourced without problems and from where

Ask yourself:

  • What data sources do I need for my first data infrastructure?
  • How do KPIs and dimensions map to the various data sources?
  • What are the leading sources in case there is a conflict (e.g. Orders from Shopify vs Google Analytics)?
  • Which problems does the data in the source have? 
  • Are there manual processes involved in generating my data?

✅ Decide for a cloud data warehouse solution

Don’t overthink this step. For 95% of startups it simply doesn’t matter what technology you choose here in the next 2-3 years from a cost or performance perspective. Choose what you’re comfortable with and get started.

✅ Define a simple process to get your data from your source into a cloud data warehouse

You need to make two decisions:

  • How often do I want to refresh data in my data warehouse
  • What’s my update strategy

Again, don’t overthink, and keep it simple.

I see so many teams already struggling at this point because they try to stream data with an incremental update strategy from day 1.

For 95% of startups, it is enough to do simple full loads once per day or every four hours.

The more often you load data and the more complex your update strategy, the more likely things will break. I have built data warehouses in the past that had multiple load failures EVERY day from the first day of operation. 

Today, my data projects typically don’t have a single failure in the first 6 months.

✅ Decide if you want to buy or build the extraction and load pipelines

Once again, in 95% of use cases, you are better off buying a SaaS application that manages connectors for you rather than building yourself.

A lot of data engineers hate this thought. They would rather roll up their sleeves and start coding than throwing money over the fence and have a SaaS provider take care of this part of the data pipeline.

I understand their concerns. I usually spend more time on this question than on choosing the cloud data warehouse technology. 

Some “darlings of the modern data stack” like Fivetran are incredibly easy to get started with but become prohibitively expensive very quickly, so you need to be careful.

Fivetran is great for really low data volumes. I use Fivetran for Data Action Mentor because I know that my data volumes will probably forever stay within Fivetran’s 500,000 monthly active records (MAR) free tier limit, making it the only best-in-class tool with a free tier at the time of writing.

My only data source that is likely to exceed 500,000 MARs is Google Analytics which I stream to my BigQuery DWH via the native BigQuery connection in Google Analytics.

There are great niche tools out there, such as hibase.co, which has a workflow-based pricing model. Their cost doesn’t scale as much as pricing models based on number of processed data volumes.

There are also great open-source contenders such as dlthub.com. 

✅ Only load from your sources what you need for downstream data products

This is not feasible from the very beginning because you often don’t know yet what exactly you need even if you have defined the desired consumer-facing data products well enough.

Mapping the source data to your desired data products on a piece of paper is a waste of time at this point. It’s better to just get started.

I usually start loading everything until I’m fairly confident that I know what I don’t need and then I get rid of that asap. 

The more data points you load, the more susceptible you are to breakages if the source system changes.

✅ Use purpose-built transformation tooling from Day 1

Tools like dbt and Dataform are much loved for a reason. Don’t start with scheduled queries in BigQuery or stacking views on views. It pays out to invest some extra time to build your pipelines with dbt or Dataform. Projects built on scheduled queries or views get messy very quickly.

✅ Data Modeling best practices

Establish data modeling best practices from Day 1. Feel free to steal my approach:

1. Source Layer: 1:1 Representation of the source data without any transformation logic between the source and this layer.

2. Preprocess Layer: Clean-up and harmonization of the source (deduplication, date formatting, declaration of data types, timezone harmonization, consistent handling of unknown values)

3. Objects Layer: Translate data from the way operational systems “know” the data to how the business “knows” the data and establish referential integrity between objects. This layer is dimensionally modeled but doesn’t take the Kimball approach too seriously. I normalize to third normal form (3NF) but not rigorously.

4. Datamarts Layer: Finalize the calculation of KPIs and Dimensions across business domains on unit grain and provide granular, actionable content to analysts and business stakeholders

5. Reports Layer: Includes aggregation rules for metrics such as Customer Counts, Average Order Value etc. Only simple aggregations such as COUNT, SUM, AVG are allowed in this layer.

Don’t implement a complex semantic layer (i.e. KPI definitions as code) using tools like Looker, cube.dev or the dbt metrics layer yet. This is useless, as metrics definitions will change too frequently and rapidly. Wait until definitions are established.

The timing is tricky: Establishing a semantic layer too early leads to constant changes to the semantic layer, establishing it too late will create a wild mess of use-case specific aggregation tables.

✅ Software engineering best practices

I summarize three things under this:

  1. Version Control
  2. Test-driven Development
  3. Environment Separation

You need 1) and 2) from Day 1.

3) can be implemented later. If you use dbt you will already have your own development environment. There is usually no need to establish an additional test environment in the early stages of your data infrastructure - the cost of reduced speed outweighs the benefit of enhanced stability - especially if you do 2) well. 

These are the first automated tests you should implement (in this sequence):

  • Uniqueness and NOT NULL of primary keys
  • Referential integrity for foreign keys
  • Sanity checks Datamarts vs Preprocess (e.g. Sum of orders, Count of users equal between datamarts and preprocess)
  • Stale sources: check for each table in preprocess if data has the expected recency (e.g. orders every day)
  • Consistency between datamarts (it is sometimes necessary to sum up the same KPI in multiple datamarts, especially for a customer profile table, e.g. orders in a customer profile table. Check between datamarts that KPIs are consistent)
  • Check for manual errors: check all data that is sourced manually (e.g. utm parameter mappings)
  • Check for unexpectedly changing data: the number of orders for past days should never change. Create a check for that

✅ Simple data contracts

“The data contract is an API-like agreement between data producers and consumers, with each constraint enforced programmatically in the developer workflow. It captures expectations around schema, business logic, SLAs, or other forms of data governance.

The enforceability of the contract is its most essential component. It is not simply an agreement on paper - that’s documentation. It’s also not simply a monitor on downstream data sets - that’s a test. The programmatic enforcement is what makes expectations into a contract.”

This is the definition that “Mr. Data Contract” Chad Sanderson uses.

Here’s my take on which components of data contracts you should have from Day 1:

  • API-like: not needed from Day 1, unless you’re already using tools like Snowplow, that have data contracts baked in
  • Agreement: needed from Day 1 and often missing. The biggest issue in scale-ups is that data producers and data consumers don’t talk enough. Often this can be solved sufficiently by talking to each other and documenting in a scaleable way (for example with dbt’s auto-documentation)
  • Constraints enforced programmatically in the developer workflow: not needed from Day 1. Just talking to each other can prevent many issues and automated tests in combination with alerting and alert handling rules can cover the remaining issues.
  • Captures expectations around schema, business logic, SLAs, or other forms of data governance: needed from Day 1. Be explicit with your internal data providers about what data you need and what you use it for.

You should always know the development stage of your internal source systems, a.k.a. your backend database. 

Some backend databases are so stable and well-designed that you hardly need to do any transformations on them and pipelines never break.

Others are in a maturity stage that will completely mess up your data pipelines every other day. 

Know the maturity stage of your sources and plan accordingly! 

This point on the checklist was one of the most critical points during my time as Head of Business Analytics at Rocket Internet and it led to pipeline breakages every day simply because the backend database developed so rapidly. 

✅ Don’t forget about data privacy

At a minimum, start by understanding what personal data you collect, why you need it, and how it flows through your systems. This clarity is crucial for determining lawful basis and compliance needs.

Understand the different drivers of “lawful basis” and assess the lawful basis for each data processing activity. Is it consent? Contract? Legitimate interest?

If unsure, try to keep your DWH clean of personal data that has no lawful basis and clarify those with your data protection officer.

This topic has a lot of grey areas but it can be tackled pragmatically and doesn’t need to be a road block.

Recap Control Side

✅ Establish a clear and mutual understanding of KPIs and Dimensions for the first iteration of your data infrastructure

✅ Clearly understand which metrics and dimensions can be sourced without problems and from where

✅ Decide for a cloud data warehouse solution

✅ Define a simple process to get your data from your source into a cloud data warehouse

✅ Decide if you want to buy or build the extraction and load pipelines

✅ Only load from your sources what you need for downstream data products

✅ Use purpose built transformation tooling from Day 1

✅ Data Modeling best practices

✅ Software engineering best practices

✅ Simple data contracts

✅ Don’t forget about data privacy

The Agility Side

The control side focused on setting up the foundation to provide decision makers with trustworthy and understandable information and to establish the data team as a central decision-support hub.

The agility side now focuses on increasing availability and accessibility of information so that business teams can iteratively ask questions to data and receive reliable answers instantly. 

Screenshot 2024 10 14 at 09.18.58

This is the path to more innovation and higher precision decisions. 

✅ Establish KPI Ownership 

Remember how we said that you should not worry about KPI ownership yet when we discussed the control side?

This changes now.

A data team will never have the best business domain experts and KPIs must be defined and owned by the business. The data team can support this process but should hand over the responsibility of KPI definitions to the right business teams.

✅ Create a mapping from the vision of the business to data products

The below image shows how this looked like for the Data Action Mentor Business in early 2024.

Screenshot 2024 10 14 at 09.20.17

The business defines the WHAT and the data team defines the HOW. 

For example, in the beginning of building Data Action Mentor, I had many manual processes that created a huge burden of manual work. Every time a Mentee requested a call with a Mentor, I manually coordinated the call. One of my Key Objectives was to reduce manual burden and I wanted to measure that objective by the number of weekly hours spent on boring repetitive tasks. 

As a business owner for the KPI “weekly hours spent on boring repetitive tasks” I defined the services that I needed to achieve my goal and I decided that I needed a Call Management Automation Service (among others).

Since I am also the data team leader of my own company, I decided that I need to build a Mentee 360 Profile Mart, a Mentor 360 Profile Mart and a Call Request Mart in the datamarts layer of my Data Warehouse to be able to provide the service with the necessary data.

I then broke down the marts into the tables that I need to build in the Objects layer and so on. 

These steps provide clarity on the business purpose of each object in the data warehouse. 

I hardly see any team doing this exercise and I think it’s one of the main reasons why so many data teams fail.

✅ Build a KPI Tree

While the Mapping above gives you a high-level overview of how your data products contribute to business goals, you don’t get a deeper level understanding which levers you can use to influence outcome metrics (such as revenue, customer retention).

This is critical.

I often see too much focus on outcome metrics and a lack of understanding which factors influence these outcomes.

This is another reason data teams fail because they don’t provide actionable guidance. 

The best tool to get this understanding is a KPI Tree (sometimes called KPI graph). The KPI Tree for Data Action Mentor looks like this.

Screenshot 2024 10 14 at 09.21.32

✅ Establish roles & responsibilities between the data team and business teams

Most data teams lack the correct roles & responsibilities to provide value to the business.

They are either too technical and lack business understanding or they are strong on the business side but lack technical skills to get things done and keep things running.

The goal of your organizational setup is to minimize communication gaps between each role.

The image below demonstrates that.

Screenshot 2024 10 14 at 09.22.47

In an ideal world you have dedicated colleagues for each of these roles but in a new data team this is not realistic. 

I usually start a data team with someone who is a mix of an analytics engineer, analyst and product owner and then add a data engineer next. 

It is absolutely critical that the first data team member has strong communication skills and interpersonal skills and an interest in the business model of the company

I also like to establish a super user within most business teams early on. 

In a next step, I often decentralize teams even further and move analysts into the business teams but for immature organizations (i.e. Startups before Series B/C) I found the above setup most suitable.

✅ Avoid building software for specific use cases in the data team

Very often, data teams are burdened with Martech and Web Tracking jobs such as defining and implementing tracking events and tag management tools. 

They also often build tooling for financial reporting or even ERP systems.

I try to avoid this as building these systems is not typically the key strength of a data team and distracts from a focus on decision-making support. 

✅ Provide a user-facing layer that is accepted by Super Users

When you built your fortress and established data modeling best practices, you are already 80% there.

Now, don’t fall into the trap of building Tableau or Power BI or Google Data Studio on top of your tables and only provide access on aggregate tables via these dashboarding tools.

Firstly, most business users absolutely hate Tableau, Power BI etc. They rather decide by using their gut than using Tableau. 

Secondly, it is hard to establish user trust if they can’t access KPIs on a high level granularity (e.g. on user or order grain)

Thirdly, dashboarding tools’ purpose is not to answer questions - their job is to prompt questions. 

Focus first on tools that can answer questions! Those tools are typically the tools that business users love and data people hate - such as Excel / Google Sheets or Amplitude for product people.

In 90% of the cases when building a new data infrastructure, I start by giving Super Users access to the datamarts layer via Google Sheets. 

Since most KPIs are already created in the datamarts layer, you can somewhat control the mess that Excel and Google Sheets usually create.

On top of that, I use Google Groups to manage permissions on sheets and folders and establish a simple, yet effective governance layer.

✅ Run your data team like a startup

I don’t often see data teams who do this last point well.

To me, it is the most crucial point of all.

As a data leader, you are not only a data specialist, you also need to be a marketer and a product manager. Think of your data infrastructure, your dashboards and your analyses as products that need to be marketed. Your stakeholders are your customers.

As with every product you are facing market risk and product risk.

👉 Product Risk: Am I able to build what I am planning to build?

👉 Market Risk: Do people really want and need what I am planning to build?

It doesn’t matter if you are creating

🧘‍♀️ an online yoga course

✈️ an airplane

💉 a new vaccine

📊 or … dashboards and tables

You always have product risk and market risk. Some products have higher market risk, others have higher product risk.

  • A drug that cures cancer has almost zero market risk but a very high product risk.
  • Most data analytics products have almost zero product risk but very high market risk.

With today's modern data tools, there’s pretty much nothing you can’t build = Zero product risk.

Yet, we data people often tend to ignore market risk and start crafting and creating solutions waaaay too quickly.

Don’t fall into this trap. Spend a lot of time with your customers and maximize this time by using tools such as Data Team office hours, a Newsletter, Events etc.

You need to be visible and stay visible if you want to create impact.

Summary

Here comes the complete checklist, so you can hang it on the wall:

✅ Establish a clear and mutual understanding of KPIs and Dimensions for the first iteration of your data infrastructure

✅ Clearly understand which metrics and dimensions can be sourced without problems and from where

✅ Decide for a cloud data warehouse solution

✅ Define a simple process to get your data from your source into a cloud data warehouse

✅ Decide if you want to buy or build the extraction and load pipelines

✅ Only load from your sources what you need for downstream data products

✅ Use purpose built transformation tooling from Day 1

✅ Data Modeling best practices

✅ Software engineering best practices

✅ Simple data contracts

✅ Don’t forget about data privacy

✅ Establish KPI Ownership

✅ Create a mapping from the vision of the business to data products

✅ Build a KPI Tree

✅ Establish roles & responsibilities between the data team and business teams

✅ Avoid building software for specific use cases in the data team

✅ Provide a user-facing layer that is accepted by Super Users

✅ Run your data team like a startup

We cover all points from this checklist and many more in our Masterclass "Create massive business impact with your data team".