Testing your Python Code with Hypothesis

I can think of a several Python packages that greatly improved the quality of the software I write. Two of them are pytest and hypothesis. The former adds an ergonomic framework for writing tests and fixtures and a feature-rich test runner. The latter adds property-based testing that can ferret out all but the most stubborn bugs using clever algorithms, and that’s the package we’ll explore in this course.

In an ordinary test you interface with the code you want to test by generating one or more inputs to test against, and then you validate that it returns the right answer. But that, then, raises a tantalizing question: what about all the inputs you didn’t test? Your code coverage tool may well report 100% test coverage, but that does not, ipso facto, mean the code is bug-free.

One of the defining features of Hypothesis is its ability to generate test cases automatically in a manner that is:


Repeated invocations of your tests result in reproducible outcomes, even though Hypothesis does use randomness to generate the data.


You are given a detailed answer that explains how your test failed and why it failed. Hypothesis makes it clear how you, the human, can reproduce the invariant that caused your test to fail.


You can refine its strategies and tell it where or what it should or should not search for. At no point are you compelled to modify your code to suit the whims of Hypothesis if it generates nonsensical data.

So let’s look at how Hypothesis can help you discover errors in your code.

Installing & Using Hypothesis

You can install hypothesis by typing pip install hypothesis. It has few dependencies of its own, and should install and run everywhere.

Hypothesis plugs into pytest and unittest by default, so you don’t have to do anything to make it work with it. In addition, Hypothesis comes with a CLI tool you can invoke with hypothesis. But more on that in a bit.

I will use pytest throughout to demonstrate Hypothesis, but it works equally well with the builtin unittest module.

A Quick Example

Before I delve into the details of Hypothesis, let’s start with a simple example: a naive CSV writer and reader. A topic that seems simple enough: how hard is it to separate fields of data with a comma and then read it back in later?

But of course CSV is frighteningly hard to get right. The US and UK use '.' as a decimal separator, but in large parts of the world they use ',' which of course results in immediate failure. So then you start quoting things, and now you need a state machine that can distinguish quoted from unquoted; and what about nested quotes, etc.

The naive CSV reader and writer is an excellent stand-in for any number of complex projects where the requirements outwardly seem simple but there lurks a large number of edge cases that you must take into account.

def naive_write_csv_row(fields):
    return ",".join(f'"{field}"' for field in fields)

def naive_read_csv_row(row):
    return [field[1:-1] for field in row.split(",")]

Here the writer simply string quotes each field before joining them together with ','. The reader does the opposite: it assumes each field is quoted after it is split by the comma.

A naive roundtrip pytest proves the code “works”:

def test_write_read_csv():
    fields = ["Hello", "World"]
    formatted_row = naive_write_csv_row(fields)
    parsed_row = naive_read_csv_row(formatted_row)
    assert fields == parsed_row

And evidently so:

$ pytest test.py::test_write_read_csv
test.py::test_write_read_csv PASSED   [100%]

And for a lot of code that’s where the testing would begin and end. A couple of lines of code to test a couple of functions that outwardly behave in a manner that anybody can read and understand. Now let’s look at what a Hypothesis test would look like, and what happens when we run it:

import hypothesis.strategies as st
from hypothesis import given

@given(fields=st.lists(st.text(), min_size=1, max_size=10))
def test_read_write_csv_hypothesis(fields):
    formatted_row = naive_write_csv_row(fields)
    parsed_row = naive_read_csv_row(formatted_row)
    assert fields == parsed_row

At first blush there’s nothing here that you couldn’t divine the intent of, even if you don’t know Hypothesis. I’m asking for the argument fields to have a list ranging from one element of generated text up to ten. Aside from that, the test operates in exactly the same manner as before.

Now watch what happens when I run the test:

$ pytest test.py::test_read_write_csv_hypothesis
E       AssertionError: assert [','] == ['', '']
test.py:44: AssertionError
----- Hypothesis ----
Falsifying example: test_read_write_csv_hypothesis(
FAILED test.py::test_read_write_csv_hypothesis - AssertionError: assert [','] == ['', '']

Hypothesis quickly found an example that broke our code. As it turns out, a list of [','] breaks our code. We get two fields back after round-tripping the code through our CSV writer and reader — uncovering our first bug.

In a nutshell, this is what Hypothesis does. But let’s look at it in detail.

Understanding Hypothesis

Using Hypothesis Strategies

Simply put, Hypothesis generates data using a number of configurable strategies. Strategies range from simple to complex. A simple strategy may generate bools; another integers. You can combine strategies to make larger ones, such as lists or dicts that match certain patterns or structures you want to test. You can clamp their outputs based on certain constraints, like only positive integers or strings of a certain length. You can also write your own strategies if you have particularly complex requirements.

Strategies are the gateway to property-based testing and are a fundamental part of how Hypothesis works. You can find a detailed list of all the strategies in the Strategies reference of their documentation or in the hypothesis.strategies module.

The best way to get a feel for what each strategy does in practice is to import them from the hypothesis.strategies module and call the example() method on an instance:

>>> import hypothesis.strategies as st
>>> st.integers().example()
>>> st.lists(st.floats(), min_size=5).example()

You may have noticed that the floats example included inf in the list. By default, all strategies will – where feasible – attempt to test all legal (but possibly obscure) forms of values you can generate of that type. That is particularly important as corner cases like inf or NaN are legal floating-point values but, I imagine, not something you’d ordinarily test against yourself.

And that’s one pillar of how Hypothesis tries to find bugs in your code: by testing edge cases that you would likely miss yourself. If you ask it for a text() strategy you’re as likely to be given Western characters as you are a mishmash of unicode and escape-encoded garbage. Understanding why Hypothesis generates the examples it does is a useful way to think about how your code may interact data it has no control over.

Now if it were simply generating text or numbers from an inexhaustible source of numbers or strings, it wouldn’t catch as many errors as it actually does. The reason for that is that each test you write is subjected to a battery of examples drawn from the strategies you’ve designed. If a test case fails, it’s put aside and tested again but with a reduced subset of inputs, if possible. In Hypothesis it’s known as shrinking the search space to try and find the smallest possible result that will cause your code to fail. So instead of a 10,000-length string, if it can find one that’s only 3 or 4, it will try to show that to you instead.

Filtering and Mapping Strategies

You can tell Hypothesis to filter or map the examples it draws to further reduce them if the strategy does not meet your requirements:

>>> st.integers().filter(lambda num: num > 0 and num % 8 == 0).example()

Here I ask for integers where the number is greater than 0 and is evenly divisible by 8. Hypothesis will then attempt to generate examples that meets the constraints you have imposed on it.

You can also map, which works in much the same way as filter. Here I’m asking for lowercase ASCII and then uppercasing them:

>>> st.text(alphabet=string.ascii_lowercase, min_size=5).map(lambda x: x.upper()).example()

Having said that, using either when you don’t have to (I could have asked for uppercase ASCII characters to begin with) is likely to result in slower strategies.

A third option, flatmap, lets you build strategies from strategies; but that deserves closer scrutiny, so I’ll talk about it later.

Composing Strategies

You can tell Hypothesis to pick one of a number of strategies by composing strategies with | or st.one_of():

>>> st.lists(st.none() | st.floats(), min_size=3).example()
[2.00001, None, 1.1754943508222875e-38]

An essential feature when you have to draw from multiple sources of examples for a single data point.

Constraints & Satisfiability

When you ask Hypothesis to draw an example it takes into account the constraints you may have imposed on it: only positive integers; only lists of numbers that add up to exactly 100; any filter() calls you may have applied; and so on. Those are constraints. You’re taking something that was once unbounded (with respect to the strategy you’re drawing an example from, that is) and introducing additional limitations that constrain the possible range of values it can give you.

But consider what happens if I pass filters that will yield nothing at all:

>>> st.integers().filter(lambda num: num > 0).filter(lambda num: num < 0).example()
Unsatisfiable: Unable to satisfy assumptions of example_generating_inner_function

At some point Hypothesis will give up and declare it cannot find anything that satisfies that strategy and its constraints.

Make sure your strategies are satisfiable

Hypothesis gives up after a while if it’s not able to draw an example. Usually that indicates an invariant in the constraints you’ve placed that makes it hard or impossible to draw examples from. In the example above, I asked for numbers that are simultaneously below zero and greater than zero, which is an impossible request.

Writing Reusable Strategies with Functions

As you can see, the strategies are simple functions, and they behave as such. You can therefore refactor each strategy into reusable patterns:

import string

def generate_westernized_name(min_size=2):
    return (st.text(alphabet=string.ascii_letters, min_size=min_size)
            .map(lambda name: name.capitalize()))

def test_create_customer(first_name):
    # ... etc ...

The benefit of this approach is that if you discover edge cases that Hypothesis does not account for, you can update the pattern in one place and observe its effects on your code. It’s functional and composable.

One caveat of this approach is that you cannot draw examples and expect Hypothesis to behave correctly. So I don’t recommend you call example() on a strategy only to pass it into another strategy.

For that, you want the @composite decorator.

@composite: Declarative Strategies

If the previous approach is unabashedly functional in nature, this approach is imperative.

The @composite decorator lets you write imperative Python code instead. If you cannot easily structure your strategy with the built-in ones, or if you require more granular control over the values it emits, you should consider the @composite strategy.

Instead of returning a compound strategy object like you would above, you instead draw examples using a special function you’re given access to in the decorated function.

from hypothesis.strategies import composite

def generate_full_name(draw):
    first_name = draw(generate_westernized_name())
    last_name = draw(generate_westernized_name())
    return (last_name, first_name)

This example draws two randomized names and returns them as a tuple:

>>> generate_full_name().example()
('Mbvn', 'Wfyybmlc')

Note that the @composite decorator passes in a special draw callable that you must use to draw samples from. You cannot – well, you can, but you shouldn’t – use the example() method on the strategy object you get back. Doing so will break Hypothesis’s ability to synthesize test cases properly.

Because the code is imperative you’re free to modify the drawn examples to your liking. But what if you’re given an example you don’t like or one that breaks a known invariant you don’t wish to test for? For that you can use the assume() function to state the assumptions that Hypothesis must meet if you try to draw an example from generate_full_name.

Let’s say that first_name and last_name must not be equal:

from hypothesis import assume

def generate_full_name(draw):
    first_name = draw(generate_westernized_name())
    last_name = draw(generate_westernized_name())
    assume(first_name != last_name)
    return (last_name, first_name)

Like the assert statement in Python, the assume() function teaches Hypothesis what is, and is not, a valid example. You use this to great effect to generate complex compound strategies.

I recommend you observe the following rules of thumb if you write imperative strategies with @composite:

Avoid filtering drawn examples yourself

If you want to draw a succession of examples to initialize, say, a list or a custom object with values that meet certain criteria you should use filter, where possible, and assume to teach Hypothesis why the value(s) you drew and subsequently discarded weren’t any good.

The example above uses assume() to teach Hypothesis that first_name and last_name must not be equal.

Separate functional and non-functional strategies

If you can put your functional strategies in separate functions, you should. It encourages code re-use and if your strategies are failing (or not generating the sort of examples you’d expect) you can inspect each strategy in turn. Large nested strategies are harder to untangle and harder still to reason about.

Only write @composite strategies if you must

If you can express your requirements with filter and map or the builtin constraints (like min_size or max_size), you should. Imperative strategies that use assume may take more time to converge on a valid example.

@example: Explicitly Testing Certain Values

Occasionally you’ll come across a handful of cases that either fails or used to fail, and you want to ensure that Hypothesis does not forget to test them, or to indicate to yourself or your fellow developers that certain values are known to cause issues and should be tested explicitly.

The @example decorator does just that:

from hypothesis import example

@given(fields=st.lists(st.text(), min_size=1, max_size=10))
def test_read_write_csv_hypothesis(fields):
    # ... etc ...

You can add as many as you like.

Hypothesis Example: Roman Numeral Converter

Let’s say I wanted to write a simple converter to and from Roman numerals.

    "I": 1,
    "V": 5,
    "X": 10,
    "L": 50,
    "C": 100,
    "D": 500,
    "M": 1000,

def to_roman(number: int):
    numerals = []
    while number >= 1:
        for symbol, value in SYMBOLS.items():
            if value <= number:
                number -= value
    return "".join(numerals)

def test_to_roman_numeral_simple(number):
    numeral = to_roman(number)
    assert set(numeral) and set(numeral) <= set(SYMBOLS.keys())

Here I’m collecting Roman numerals into numerals, one at a time, by looping over SYMBOLS of valid numerals, subtracting the value of the symbol from number, until the while loop’s condition (number >= 1) is False.

The test is also simple and serves as a smoke test. I generate a random integer and convert it to Roman numerals with to_roman. When it’s all said and done I turn the string of numerals into a set and check that all members of the set are legal Roman numerals.

Now if I run pytest on it seems to hang. But thanks to Hypothesis’s debug mode I can inspect why:

$ pytest -s --hypothesis-verbosity=debug test_roman.py::test_to_roman_numeral_simple
Trying example: test_to_roman_numeral_simple(

Ah. Instead of testing with tiny numbers like a human would ordinarily do, it used a fantastically large one… which is altogether slow.

OK, so there’s at least one gotcha; it’s not really a bug, but it’s something to think about: limiting the maximum value. I’m only going to limit the test, but it would be reasonable to limit it in the code also.

Changing the max_value to something sensible, like st.integers(max_value=5000) and the test now fails with another error:

$ pytest test_roman.py::test_to_roman_numeral_simple
Falsifying example: test_to_roman_numeral_simple(

It seems our code’s not able to handle the number 0! Which… is correct. The Romans didn’t really use the number zero as we would today; that invention came later, so they had a bunch of workarounds to deal with the absence of something. But that’s neither here nor there in our example. Let’s instead set min_value=1 also, as there is no support for negative numbers either:

$ pytest test_roman.py::test_to_roman_numeral_simple
1 passed in 0.09s

OK… not bad. We’ve proven that given a random assortment of numbers between our defined range of values that, indeed, we get something resembling Roman numerals.

One of the hardest things about Hypothesis is framing questions to your testable code in a way that tests its properties but without you, the developer, knowing the answer (necessarily) beforehand. So one simple way to test that there’s at least something semi-coherent coming out of our to_roman function is to check that it can generate the very numerals we defined in SYMBOLS from before:

def test_to_roman_numeral_sampled(numeral_value):
    numeral, value = numeral_value
    assert to_roman(value) == numeral

Here I’m sampling from a tuple of the SYMBOLS from earlier. The sampling algorithm’ll decide what values it wants to give us, all we care about is that we are given examples like ("I", 1) or ("V", 5) to compare against.

So let’s run pytest again:

$ pytest test_roman.py
Falsifying example: test_to_roman_numeral_sampled(
    numeral_value=('V', 5),
FAILED test.py::test_to_roman_numeral_sampled -
  AssertionError: assert 'IIIII' == 'V'

Oops. The Roman numeral V is equal to 5 and yet we get five IIIII? A closer examination reveals that, indeed, the code only yields sequences of I equal to the number we pass it. There’s a logic error in our code.

In the example above I loop over the elements in the SYMBOLS dictionary but as it’s ordered the first element is always I. And as the smallest representable value is 1, we end up with just that answer. It’s technically correct as you can count with just I but it’s not very useful.

Fixing it is easy though:

import operator

def to_roman(number: int):
    numerals = []
    g = operator.itemgetter(1)
    ordered_numerals = sorted(SYMBOLS.items(), key=g, reverse=True)
    while number >= 1:
        for symbol, value in ordered_numerals:
            if value <= number:
                number -= value
    return "".join(numerals)

Rerunning the test yields a pass. Now we know that, at the very least, our to_roman function is capable of mapping numbers that are equal to any symbol in SYMBOLS.

Now the litmus test is taking the numeral we’re given and making sense of it. So let’s write a function that converts a Roman numeral back into decimal:

def from_roman(numeral: str):
    carry = 0
    numerals = list(numeral)
    while numerals:
        symbol = numerals.pop(0)
        value = SYMBOLS[symbol]
        carry += value
    return carry

@given(number=st.integers(min_value=1, max_value=5000))
def test_roman_numeral(number):
    numeral = to_roman(number)
    value = from_roman(numeral)
    assert number == value

Like to_roman we walk through each character, get the numeral’s numeric value, and add it to the running total. The test is a simple roundtrip test as to_roman has an inverse function from_roman (and vice versa) such that :

assert to_roman(from_roman('V')) == 'V'
assert from_roman(to_roman(5)) == 5
By the way …

Invertible functions are easier to test because you can compare the output of one against the input of another and check if it yields the original value. But not every function has an inverse, though.

Running the test yields a pass:

$ pytest test_roman.py::test_roman_numeral
1 passed in 0.09s

So now we’re in a pretty good place. But there’s a slight oversight in our Roman numeral converters, though: they don’t respect the subtraction rule for some of the numerals. For instance VI is 6; but IV is 4. The value XI is 11; and IX is 9. Only some (sigh) numerals exhibit this property.

So let’s write another test. This time it’ll fail as we’ve yet to write the modified code. Luckily we know the subtractive numerals we must accommodate:

    "IV": 4,
    "IX": 9,
    "XL": 40,
    "XC": 90,
    "CD": 400,
    "CM": 900,

def test_roman_subtractive_rule(numeral_value):
    numeral, value = numeral_value
    assert from_roman(numeral) == value
    assert to_roman(value) == numeral

Pretty simple test. Check that certain numerals yield the value, and that the values yield the right numeral.

With an extensive test suite we should feel fairly confident making changes to the code. If we break something, one of our preexisting tests will fail.

def from_roman(numeral: str):
    carry = 0
    numerals = list(numeral)
    while numerals:
        symbol = numerals.pop(0)
        value = SYMBOLS[symbol]
            value = SUBTRACTIVE_SYMBOLS[symbol + numerals[0]]
        except (IndexError, KeyError):
        carry += value
    return carry

The rules around which numerals are subtractive is rather subjective. The SUBTRACTIVE_SYMBOLS dictionary holds the most common ones. So all we need to do is read ahead of the numerals list to see if there exists a two-digit numeral that has a prescribed value and then we use that instead of the usual value.

1def to_roman(number: int):
2    numerals = []
3    g = operator.itemgetter(1)
4    ordered_numerals = sorted(
6        key=g,
7        reverse=True,
8    )
9    while number >= 1:
10        for symbol, value in ordered_numerals:
11            if value <= number:
12                numerals.append(symbol)
13                number -= value
14                break
15    return "".join(numerals)

The to_roman change is simple. A union of the two numeral symbol dictionaries is all it takes . The code already understands how to turn numbers into numerals — we just added a few more.

By the way …

This method requires Python 3.9 or later. Read how to merge dictionaries

If done right, running the tests should yield a pass:

$ pytest test_roman.py
5 passed in 0.15s

And that’s it. We now have useful tests and a functional Roman numeral converter that converts to and from with ease. But one thing we didn’t do is create a strategy that generates Roman numerals using st.text(). A custom composite strategy to generate both valid and invalid Roman numerals to test the ruggedness of our converter is left as an exercise to you.

In the next part of this course we’ll look at more advanced testing strategies.


Hypothesis is a capable test generator

Unlike a tool like faker that generates realistic-looking test data for fixtures or demos, Hypothesis is a property-based tester. It uses heuristics and clever algorithms to find inputs that break your code.

Hypothesis assumes you understand the problem domain you want to model

Testing a function that does not have an inverse to compare the result against – like our Roman numeral converter that works both ways – you often have to approach your code as though it were a black box where you relinquish control of the inputs and outputs. That is harder, but makes for less brittle code.

Hypothesis augments your existing test suite

It’s perfectly fine to mix and match tests. Hypothesis is useful for flushing out invariants you would never think of. Combine it with known inputs and outputs to jump start your testing for the first 80%, and augment it with Hypothesis to catch the remaining 20%.

Liked the Article?

Why not follow us …

Be Inspired Get Python tips sent to your inbox

We'll tell you about the latest courses and articles.

Absolutely no spam. We promise!