Stop Your Data Tests From Becoming a Nightmare

“Tests were so complex that maintaining them was harder than maintaining the workflows they were supposed to test.”

This was our reality at Rocket Internet, where we went from one extreme to another in our testing journey. As the Head of Business Analytics responsible for building data infrastructure across portfolio companies like Zalando or Delivery Hero, I’ve seen firsthand how testing can make or break your data stack.

Why This Matters

Too many analytics and data engineers skip testing altogether, assuming it’s QA’s responsibility. 

Others, like my team at Rocket Internet, overcompensate with hundreds of failing tests that eventually get ignored. 

The sweet spot lies in between: a minimum viable test suite that catches the most critical issues without becoming a maintenance burden.

Testing Do’s and Don’ts

Before we jump into your Minimum Viable Test Suite, let’s have a look at a few general guidelines when setting up tests.

Before jumping into this, it makes sense that you spend 5 minutes to read my ​data warehouse layering guide​ , so that you understand the terms that I'm referencing in this episode. You can also continue reading and just look them up when you need them.

Ok, now let's get into this:

1 Tests should only use simple SQL statements. If a test requires too many lines of code or complex nested queries, the test will likely have high maintenance costs

2 Use different levels of severity for tests and define what they mean (for example: Warnings should be reviewed but don’t stop us from committing and merging code, Errors must not be committed to production)

3 Tests should only trigger Errors or Warnings when there is a real problem if we don’t act on the alert. For example, if the source is missing one out of 10 million rows, that’s not a serious problem and it shouldn’t trigger a warning, let alone an error. Too many errors and warnings will lead to notification fatigue and no one will act on the notifications.

4 Always test only for the root cause of an issue and not for the symptom. For example, if you know that there is a manual process somewhere that leads to an increase in CoGs (cost of goods sold) because a user adds duplicated product IDs into a manual input sheet, then don’t write a test for unusual fluctuations of CoGs but rather write a test that checks for duplicated product_ids.

5 Prevention > Detection. Before writing a test that checks for a known issue, always ask yourself if that issue can be prevented. The example above could be prevented by adding a key constraint to product ids or - even better - make it hard or impossible to add duplicate ids in the manual input sheet (e.g. by adding a duplicate checker into the sheet).

6 Clearly separate tests of data sources vs tests of processes in the data warehouse. Adding to points 4 and 5, make sure that it’s clear if you’re testing issues that are under your control vs under the control of a data provider. Ideally, you can prevent or outsource source tests to the data provider (Data Contracts).

7 Always run test before committing. Establish rules about what happens if tests fail (e.g. no commit when there is 1 Error, commit when there are Warnings)

8 Don’t design tests that are dependent on each other, e.g. if you verify that the customer count between the datamarts layer and the preprocess layer is consistent, you don’t need to also verify if the customer count between the objects layer and datamarts layer is consistent (if one fails, the other will automatically also fail)

9 Each test should be documented and contain information about what the test does, what the expected result is, what needs to be done if the result is not as expected, and - if known - what potential causes are (e.g. Revenue gets wrongly increased by X%). Ideally, this documentation happens right within the notification to the person who needs to act.

10 Avoid tests that are subject to frequent change. Checking that the VAT rates for German transactions are always correctly 19% could make sense if there is a history of the payment provider not providing VAT amounts which leads to wrong net revenues but checking if the price for a certain item is always X will lead to high maintenance costs if the price changes often.

11 Separate between tests and business alerts. This one is similar to 6. For business alerts, typically business users need to act, not the data team. You should distinguish between alerts that get triggered because of issues with the data pipelines vs issues with the business performance.

The Minimum Viable Test Suite: Layer by Layer

Let’s explore the essential tests for each layer of your data warehouse that will cover 80-90% of potential issues.

Source and Preprocess Layer

Key tests to implement:

→ Stale Data Detection

  • Monitor update frequency patterns for each source table
  • Alert when last update exceeds normal patterns

→ Manual Entry Validation

  • Crucial for data sources requiring human input
  • Example: Check for duplicate entries in sourcing team’s materials cost sheet
  • Implement source-level validation when possible (like Google Sheets data validation)

→ Schema Change Monitoring

  • Watch for unexpected ‘deleted_at’ columns popping up
  • Critical for maintaining data integrity and filtering logic

Objects Layer

Essential validations:

→ Primary Key Integrity 

  • Verify uniqueness and non-null constraints
  • Foundation for reliable data relationships

→ Referential Integrity 

  • Ensure foreign keys in fact tables match dimension entries
  • Prevent accidental data loss during joins

→ Technical Debt Monitoring 

  • Measure impact when compromises are necessary
  • Keep track of potential future issues

→ Structured Data Validation 

  • For JSON extractions, verify critical fields
  • Monitor for unexpected null values in important variables

Data Marts Layer

Critical checks:

→ Primary Key Verification

  • Double-check integrity wasn’t broken by upstream changes
  • Maintain data reliability at the presentation layer

→ Sanity Checks

  • Compare simple counts and sums between marts and pre-process layer
  • Keep validation code straightforward
  • Avoid complex logic replication

→ Historical Data Stability

  • Monitor unchangeable metrics (like historical revenue)
  • Use hardcoded dates for consistent verification (e.g. does total revenue change for last year - it shouldn’t).

→ Cross-Reference Validations

  • Ensure KPI consistency across different mart tables
  • Example: Compare Total Revenue in the customer profile datamart vs the orders datamart

Bottom Line

Testing doesn’t have to be overwhelming or insufficient. Start with these essential tests that cover most critical scenarios. They form a foundation that’s maintainable and effective.

Remember: the goal isn’t to test everything but to test the right things. Your future self (and your stakeholders) will thank you for investing in this minimum viable test suite today.

Build it into your development workflow. Make it a habit. Because finding out about data issues from your stakeholders is much more expensive than catching them early through systematic testing.

Want to join us?

Join 500+ future data leaders for tips, strategies, and resources to build impactful data teams and live a better life 🏝

Error. Your form has not been submittedEmoji
This is what the server says:
There must be an @ at the beginning.
I will retry
Reply

Whenever you need me, here's how I can help you:

Data Action Mentor Masterclass: Create massive business impact with your data team. 

This class is built for ambitious data professionals and data leaders. I spent 17 years building data teams for high-growth companies such as Rocket Internet, Zalando, Takeaway.com, Lazada (acquired by Alibaba), and many more. In this class, I am sharing all my lessons, failures, and successes so that you can make your stakeholders and CEO happy and accelerate your data career.  

Impactful Data Teams for Scale-ups

I build data infrastructure and data teams with immediate business impact for global b2c scale-ups and grown-ups in e-commerce, insurance, fintech, and consumer subscription. My proven approach has helped dozens of scale-ups. I build the infrastructure at a fixed price and then empower the business to move rapidly from data to action. If you know a consumer internet scaleup that needs an impactful data team, hit me up!

Data Audits

I run Data Audits to support you in transforming your data team into a strong business partner. You will get a tailor-made list with action items that will help you create massive business impact with your data team.

Knowledge Base

I am committed to adding actionable, free content to our Data Action Mentor knowledge base to help you on your journey to create massive business impact with your data team