Python Pattern Matching Examples: ETL and Dataclasses

Author Mickey Petersen

In Mastering Structural Pattern Matching I walked you through the theory of Structural Pattern Matching, so now it’s time to apply that knowledge and build something practical.

Let’s say you need to process data from one system (a JSON-based REST API) into another (a CSV file for use in Excel). A common task. Extracting, Transforming, and Loading (ETL) data is one of the things Python does especially well, and with pattern matching you can simplify and organize your business logic in such a way that it remains maintainable and understandable.

Let’s get some test data. For this you’ll need the requests library.

>>> resp = requests.get('https://demo.inspiredpython.com/invoices/')
>>> assert resp.ok
>>> data = resp.json()
>>> data[0]
{'recipient': {'company': 'Trommler',
               'address': 'Annette-Döring-Allee 5\n01231 Grafenau',
               'country_code': 'DE'},
 'invoice_id': 15134,
 'currency': 'JPY',
 'amount': 945.57,
 'sku': 'PROPANE-ACCESSORIES'}

Objectives

The data – feel free to use the demo URL provided in the example above – is a list of invoices for our fictional company that sells propane (and propane accessories.)

As part of any serious ETL process, you must consider the quality of the data. For this, I want to flag entries that may require human intervention:

Find mismatched payment currencies and country codes. For instance, the example above lists the payment currency as JPY but the country code’s German.
Ensure the invoice IDs are unique and that they are all integers less than 50000.
Map each invoice to a dedicated Invoice dataclass, and each invoice recipient to a Company dataclass.

Then,

Write the quality-assured invoices to a CSV file.
Everything that fails that test is flagged and put in different CSV for manual review.

An important note though.

In a real application there would be a validation layer that checks the input data for obvious data errors, like integers in a string field, or missing fields. For brevity I will won’t include that part, but you should use a package like marshmallow or pydantic to formalize the contract you (the consumer) have with the data producer(s) you interface with to catch (and act on) these mistakes.

But, for the sake of argument, let’s assume the input data meets these basic standards. But it is not the job of a library like marshmallow to validate that, say, there the country code and currency is correct.

Getting the API Data

Let’s start by formalizing the extraction of the data I did earlier:

import requests


def get_invoices(url):
    response = requests.get(url)
    # Raise if the request fails for any reason.
    response.raise_for_status()
    return response.json()

Here I let requests raise an exception if the response is anything except a 200 OK from the server. I also naively assume the response body is JSON, as it’s just a demonstration.

Defining the dataclasses

Now let’s define the dataclasses. Two will suffice: a Company dataclass to hold details about the invoice recipient; and an Invoice dataclass that’ll reference the recipient company and the invoice details themselves:

from dataclasses import dataclass
from typing import Optional


@dataclass
class Company:

    company: str
    address: str
    country_code: str


@dataclass
class Invoice:

    invoice_id: int
    currency: str
    amount: float
    sku: str
    recipient: Optional[Company]

Each dataclass is the canonical representation of either a company or an invoice once the data is transformed from its source format.

Separating your concerns

One thing I want to do – to aid with testing – is separate the processing of the company from that of the invoice:

1def process_raw_records(records):
2    invoices = []
3    for record in records:
4        match record:
5            case {"recipient": raw_recipient, **raw_invoice}:
6                recipient = process_raw_recipient(raw_recipient)
7                invoice = process_raw_invoice(raw_invoice)
8                invoice.recipient = recipient
9                invoices.append(invoice)
10            case _:
11                raise ValueError(f"Cannot parse structure {record}")
12    return invoices

This function loops over every raw record in records. For each record it’ll attempt to match the structure of record against the declared pattern you see in the first case statement. The pattern I wrote is a bit diffuse, so let me explain why it looks the way it does.

I want to split the processing of the invoice and the recipient. To do this I declare a pattern that must have at least the key "recipient" and everything else – if there is anything else – into **raw_invoice. If the pattern does not match record it is, of course, skipped; in that case the default pattern _ is triggered which raises an Exception.

Recall that **something is the keyword notation in Python that usually expands a dictionary into key=value pairs for use in function calls or inside a dictionary. Here it means the literal opposite: collect key-value pairs and store them in the dictionary something.

The pattern matching engine is clever enough to understand that notation, and it neatly separates the logic that figures out what goes where to each respective function. That has a couple of benefits:

Separation of Concerns and Ease of Testability: I can test process_raw_recipient, process_raw_invoice and process_raw_records as a whole, or separately, to induce various test scenarios without having to awkwardly try and come up with a list of records that matches the set of behaviors I expect in my tests.
Each function is standalone and can be used for other things: You can invoke – and parse – both invoices and recipients separately. Imagine you had another API endpoint called /companies/ that you wanted to correlate the invoice recipients against. Now you can separately pull that data and seamlessly reuse the process_raw_recipient function.

Now let’s take a look at each processor.

def process_raw_recipient(raw_recipient):
    match raw_recipient:
        case {"company": company, "address": address, "country_code": country_code}:
            return Company(company=company, address=address, country_code=country_code)
        case _:
            raise ValueError(f"Cannot parse invoice recipient {raw_recipient}")


def process_raw_invoice(raw_invoice):
    match raw_invoice:
        case {
            "invoice_id": invoice_id,
            "currency": currency,
            "amount": amount,
            "sku": sku,
        }:
            return Invoice(
                invoice_id=invoice_id,
                currency=currency,
                amount=amount,
                sku=sku,
                recipient=None,
            )
        case _:
            raise ValueError(f"Cannot parse invoice {raw_invoice}")

These two functions each take raw dictionaries containing either an invoice recipient or the invoice itself.

Each respective case statement represents the declarative form of the dictionary I want to match. process_raw_recipient expects three keys: "company", "address" and "country_code".

In process_raw_invoice it’s the same situation but with different keys, of course, though I do specifically set recipient=None when I create the Company object. Why? Well, I don’t want this function to worry about the recipient or how it’s created:

The process_raw_invoice function should only process invoices

As far as that function’s concerned, it’s none of its business if there is a recipient or not.

I could make it call process_raw_recipient and assign the Company instance I get back, but then I’d tightly couple the parsing of an invoice record to that of a company.

The process_raw_records function is the controller

Meaning, it is responsible for looping over each raw record; determining what it is; and correctly assembling the final form that we want. It’s very likely that function would grow over time to handle more things: remittance advice, purchase orders, etc.

With that out of the way, the basic extraction and most of the transformation is complete. Running the code works fine, too:

>>> for result in process_raw_records(get_invoices("https://demo.inspiredpython.com/invoices/")):
        print(result)
Invoice(invoice_id=19757, currency='USD', amount=692.3, sku='PROPANE-ACCESSORIES',
        recipient=Company(company='Rosemann Freudenberger GmbH & Co. KGaA',
                          address='Eberthweg 56\n30431 Artern',
                          country_code='DE'))
 # ... etc ...

Implenting the Quality Assurance Rules

Now that leaves the final parts of the transformation and loading. Earlier I described a few business rules I want to implement to quality-assure the data. I could do it with just the dictionaries and that would be fine in this example, but if you’re building something like this yourself, you are probably dealing with data that’s far more complex. Having a few simple, structured objects that you can stick properties and other helper methods on makes it a lot easier.

Luckily, using dataclasses does not impair our ability to use pattern matching. So let’s implement the first business rule:

Finding mismatched currencies and country codes

So let’s say I want to flag certain country code and currency combinations for human review in case someone in the accounting department messed up and picked the wrong currency field by mistake. That happens more often than you think.

1def validate_currency(invoice: Invoice):
2    match invoice:
3        case Invoice(currency=currency, recipient=Company(country_code=country_code)):
4            match (currency, country_code):
5                case ("USD" | "GBP" | "EUR", _):
6                    return True
7                case ("JPY", "JP"):
8                    return True
9                case ("JPY", _):
10                    return False
11                case _:
12                    raise ValueError(
13                        f"No validation rule matches {(currency, country_code)}"
14                    )
15        case _:
16            raise ValueError(f"Cannot parse structure {invoice}")

The validate_currency function takes a single invoice and returns either True or False, if it is able to infer if the currency is valid or not; or ValueError if there was a general error.

By the way …

Remember that you declare a pattern in a case statement. Python works out the nitty-gritty of how to match the subject against the pattern for you. Python, in this case, does not create instances of Invoice or Company but instead interrogates their internal structure to determine how to match them against the subject.

The really neat thing about pattern matching in Python is the ability to pick out attributes from object structures like the code above does. I only specify the things I want to pattern match, and because you can nest structures you are free to specify the full “contract” that your code must have with the data it requires.

Right, so if there’s a match – i.e., we pass an Invoice object with a Company in the recipient attribute – then we can proceed to the actual validation routine.

With the two bound names currency and country_code I fashion them into a tuple for no other reason than to make it easier for us, the humans, to read the intent of the code. I could just as easily turn it into a dictionary or some other structure — but a tuple is nice and easy to read.

The case statements capture the actual business rules and, I must say, in a very clean and readable manner. Let’s look at them piecemeal.

case ("USD" | "GBP" | "EUR", _):
    return True

This rule matches any tuple where the currency part of the tuple is one of "USD", "GBP", or "EUR". The second part of the tuple, the country_code, is _ indicating a wildcard pattern — meaning, it does not matter what its value is. It could be anything.

From our fictional business’s perspective the rule means that if you denominated your invoice in either of those three currencies then it does not matter what the recipient’s country is: a lot of multinationals denominate their invoices in either of those three, so the code returns True indicating it’s valid.

The next two rules relate to the Japanese Yen specifically:

case ("JPY", "JP"):
    return True
case ("JPY", _):
    return False

The first declares that if you’re using Japanese Yen but paying a Japanese company then that’s sensible as Japanese companies would probably prefer to be paid in their own currency. However, if that is not the case, the first case statement fails to match and the second one matches anything with the wildcard _, which then returns False, indicating the validation check fails.

case statements are tested in the order you wrote them in. Check for the most explicit and specific patterns first, and put the more generic “fallback” cases at the end. Ask yourself what happens if you invert the order of the two case statements above?

Catching duplicate Invoice IDs

The second and final business rule is checking for duplicate invoice IDs. Another pernicious issue that can cause total mayhem if you’re not careful.

MAX_INVOICE_ID = 50000


def validate_invoice_id(invoice: Invoice, known_invoice_ids):
    match invoice:
        case Invoice(
            invoice_id=int() as invoice_id
        ) if invoice_id <= MAX_INVOICE_ID and invoice_id not in known_invoice_ids:
            known_invoice_ids.add(invoice_id)
            return True
        case Invoice(invoice_id=_):
            return False
        case _:
            raise ValueError(f"Cannot parse structure {invoice}")

Like the previous business rule, I match just the attributes I care about. Here it’s invoice_id. But I also assert that the named binding must be an integer by writing int() as invoice_id. Python will do some basic type checking to ensure that, indeed, it’s an integer, as our business rule prescribes. Additionally, I added a guard to check that the invoice ID is less than the maximum we can support, and that we haven’t seen it before.

I have opted to make it possible to supply an existing set of known invoice IDs. That is particularly useful, say, if you have a live system full of invoice IDs you want to check against also.

If that case statement matches, we make a note of the invoice ID by adding it to the set of known IDs and return True.

If the rule fails but there’s still an attribute called invoice_id, we simply return False to flag it for review by a human later.

Putting it all together

import csv
from dataclasses import asdict


def retrieve_invoices(url, known_ids=None):
    if known_ids is None:
        known_ids = set()
    validated_invoices = []
    flagged_invoices = []
    for invoice in process_raw_records(get_invoices(url)):
        if not all(
            [validate_currency(invoice), validate_invoice_id(invoice, known_ids)]
        ):
            flagged_invoices.append(invoice)
        else:
            validated_invoices.append(invoice)
    return validated_invoices, flagged_invoices


def store_invoices(invoices, csv_file):
    fieldnames = [
        # Recipient Company
        "company",
        "address",
        "country_code",
        # Invoice
        "invoice_id",
        "currency",
        "amount",
        "sku",
    ]
    w = csv.DictWriter(csv_file, fieldnames=fieldnames, extrasaction="ignore")
    w.writeheader()
    w.writerows(
        [{**asdict(invoice), **asdict(invoice.recipient)} for invoice in invoices]
    )


def main():
    validated, flagged = process_invoices("https://demo.inspiredpython.com/invoices/")
    with open("validated.csv", "w") as f:
        store_invoices(validated, f)
    with open("flagged.csv", "w") as f:
        store_invoices(flagged, f)

All there’s left to do is to tie it all together. The retrieve_invoices function fetches the raw invoices and calls out to the processor code I wrote earlier. It also applies the business rules and based on the outcome of those checks, it separates them into flagged_invoices or validated_invoices.

Finally it stores the invoices into two distinct CSV files. Python’s dataclasses module comes with a handy asdict helper function that pulls the typed attributes out of the object into a dictionary again so the CSV writer module knows how to store the data. And that’s it.

Summary

Pattern Matching is a natural way of expressing the structure of data and extracting the information you want: As this demo project showed you, it’s easy to capture business rules that pertain to the structure of your data and extract the information you need from it at the same time. And it’s easy to add or amend rules.
Patterns are declarative: Like I mentioned in Mastering Structural Pattern Matching, it’s the most important concept to take away from all of this. Writing Python is imperative. You tell Python what to do and when. But with a pattern you declare the result you want and leave the thinking to Python. For instance, I did not write any existence checks in validate_currency to check if an invoice has a recipient at all! I leave that to Python so I can focus on writing the actual business logic.

Table of Contents