Overview
A test suite earns its cost in two places: catching the regression you would have shipped, and documenting what the code is supposed to do. Read general-principles first.
The pyramid is dead; ship a testing trophy
The old guidance was a pyramid: many unit tests, fewer integration, a handful of e2e. Modern services invert the middle. The trophy shape:
- Static checks at the base (types, lint, schema validation).
- A thick integration layer above (the function under test plus a real database, a real HTTP server).
- A few golden-path e2e tests at the top (one happy login, one happy checkout).
- Unit tests reserved for genuinely tricky pure logic (a parser, a pricing rule, a date calculation).
Most “unit” tests of glue code are testing the doubles, not the system. Integration tests against a Postgres instance in Docker catch the bugs that mocked-repository unit tests miss. See fastapi for the HTTP layer and observability for the telemetry layer.
Zero flake budget
A flaky test is worse than no test. It teaches the team to ignore failures, and the one real failure hides in the noise. A flake is a P1: quarantine on first observed failure, file a ticket with the seed and stack, then land a deterministic version within a sprint or delete the test.
Deterministic tests, no exceptions
A test that depends on wall-clock time, network availability, or unseeded randomness will eventually fail for a reason unrelated to your code.
- Freeze time with
freezegun,vi.useFakeTimers, or a clock parameter. - Stub external HTTP with
respx,nock, or a recorded VCR cassette. - Seed every random source; pass a
Randominstance into the function under test. - Run tests in random order locally to catch hidden ordering coupling.
Test names describe behavior, not implementation
A test name is read more often than the assertion. Write it as the rule the test enforces:
# Bad
def test_calculate_total_1():
# Better
def test_total_excludes_voided_line_items():The diff that breaks the test should be obvious from the name alone. See python for pytest patterns and typescript for Vitest patterns.
Snapshot tests are usually a smell
Snapshot tests catch the diff but not the intent. A passing snapshot proves the output did not change, not that the output is correct. The first run freezes whatever bug the code had on day one. Use snapshots only for stable serialization formats (a JSON-LD blob, a compiler IR) or generated assets where a human inspects the diff. Avoid them for React component trees, formatted strings, or anything a developer will “update” without reading.
Mocking is a smell signaling tight coupling
A test that mocks five collaborators is testing the mocks, not the code. Heavy mocking means the function knows too much about its environment. Prefer fakes (an in-memory FakeUserRepo) over Mock objects asserting call counts. Prefer a real dependency in a container (real Postgres in Docker, real Redis) over any double. If a function genuinely needs five collaborators, it does too much; split it.
Coverage is a floor, not a target
A 100% coverage suite that exercises only the happy path is a 0% useful suite. Coverage is a sanity check that critical paths run at all, not a goal to optimize against. Set a floor (60 to 80 percent) and fail CI below it. Never gate a PR on coverage going up; gate it on the four cases from general-principles.
Test what users would notice
The bar is whether a user, an integrator, or an upstream caller would notice the difference. Tests for private helpers churn every refactor. Tests for the public contract survive the refactor. For LLM-facing systems, write evals at the contract layer (prompt input, JSON output, citation), not at every internal call. See evaluation for the eval pattern and rag for the retrieval contract.
One assertion idea per test
A test asserts one rule. Multiple assert lines are fine when they all express the same idea (status is 200 and the body matches the schema). A test that asserts five unrelated rules fails opaquely; the next reader cannot tell which rule the diff broke. Split it.