Python Pattern Matching Examples: ETL and Dataclasses
In Mastering Structural Pattern Matching I walked you through the theory of Structural Pattern Matching, so now it’s time to apply that knowledge and build something practical.
Let’s say you need to process data from one system (a JSON-based REST API) into another (a CSV file for use in Excel). A common task. Extracting, Transforming, and Loading (ETL) data is one of the things Python does especially well, and with pattern matching you can simplify and organize your business logic in such a way that it remains maintainable and understandable.
Let’s get some test data. For this you’ll need the
The data – feel free to use the demo URL provided in the example above – is a list of invoices for our fictional company that sells propane (and propane accessories.)
As part of any serious ETL process, you must consider the quality of the data. For this, I want to flag entries that may require human intervention:
Find mismatched payment currencies and country codes. For instance, the example above lists the payment currency as
JPYbut the country code’s German.
Ensure the invoice IDs are unique and that they are all integers less than
Map each invoice to a dedicated
Invoicedataclass, and each invoice recipient to a
Write the quality-assured invoices to a CSV file.
Everything that fails that test is flagged and put in different CSV for manual review.
An important note though.
In a real application there would be a validation layer that checks the input data for obvious data errors, like integers in a string field, or missing fields. For brevity I will won’t include that part, but you should use a package like
pydantic to formalize the contract you (the consumer) have with the data producer(s) you interface with to catch (and act on) these mistakes.
But, for the sake of argument, let’s assume the input data meets these basic standards. But it is not the job of a library like
marshmallow to validate that, say, there the country code and currency is correct.
Getting the API Data
Let’s start by formalizing the extraction of the data I did earlier:
Here I let requests raise an exception if the response is anything except a
200 OK from the server. I also naively assume the response body is JSON, as it’s just a demonstration.
Defining the dataclasses
Now let’s define the dataclasses. Two will suffice: a
Company dataclass to hold details about the invoice recipient; and an
Invoice dataclass that’ll reference the recipient company and the invoice details themselves:
Each dataclass is the canonical representation of either a company or an invoice once the data is transformed from its source format.
Separating your concerns
One thing I want to do – to aid with testing – is separate the processing of the company from that of the invoice:
This function loops over every raw record in
records. For each
record it’ll attempt to match the structure of
record against the declared pattern you see in the first
case statement. The pattern I wrote is a bit diffuse, so let me explain why it looks the way it does.
I want to split the processing of the invoice and the recipient. To do this I declare a pattern that must have at least the key
"recipient" and everything else – if there is anything else – into
**raw_invoice. If the pattern does not match
record it is, of course, skipped; in that case the default pattern
_ is triggered which raises an Exception.
**something is the keyword notation in Python that usually expands a dictionary into
key=value pairs for use in function calls or inside a dictionary. Here it means the literal opposite: collect key-value pairs and store them in the dictionary
The pattern matching engine is clever enough to understand that notation, and it neatly separates the logic that figures out what goes where to each respective function. That has a couple of benefits:
- Separation of Concerns and Ease of Testability
I can test
process_raw_recordsas a whole, or separately, to induce various test scenarios without having to awkwardly try and come up with a list of
recordsthat matches the set of behaviors I expect in my tests.
- Each function is standalone and can be used for other things
You can invoke – and parse – both invoices and recipients separately. Imagine you had another API endpoint called
/companies/that you wanted to correlate the invoice recipients against. Now you can separately pull that data and seamlessly reuse the
Now let’s take a look at each processor.
These two functions each take raw dictionaries containing either an invoice recipient or the invoice itself.
case statement represents the declarative form of the dictionary I want to match.
process_raw_recipient expects three keys:
process_raw_invoice it’s the same situation but with different keys, of course, though I do specifically set
recipient=None when I create the
Company object. Why? Well, I don’t want this function to worry about the recipient or how it’s created:
process_raw_invoicefunction should only process invoices
As far as that function’s concerned, it’s none of its business if there is a recipient or not.
I could make it call
process_raw_recipientand assign the
Companyinstance I get back, but then I’d tightly couple the parsing of an invoice record to that of a company.
process_raw_recordsfunction is the controller
Meaning, it is responsible for looping over each raw record; determining what it is; and correctly assembling the final form that we want. It’s very likely that function would grow over time to handle more things: remittance advice, purchase orders, etc.
With that out of the way, the basic extraction and most of the transformation is complete. Running the code works fine, too:
Implenting the Quality Assurance Rules
Now that leaves the final parts of the transformation and loading. Earlier I described a few business rules I want to implement to quality-assure the data. I could do it with just the dictionaries and that would be fine in this example, but if you’re building something like this yourself, you are probably dealing with data that’s far more complex. Having a few simple, structured objects that you can stick properties and other helper methods on makes it a lot easier.
Luckily, using dataclasses does not impair our ability to use pattern matching. So let’s implement the first business rule:
Finding mismatched currencies and country codes
So let’s say I want to flag certain country code and currency combinations for human review in case someone in the accounting department messed up and picked the wrong currency field by mistake. That happens more often than you think.
validate_currency function takes a single invoice and returns either
False, if it is able to infer if the currency is valid or not; or
ValueError if there was a general error.
Remember that you declare a pattern in a
case statement. Python works out the nitty-gritty of how to match the subject against the pattern for you. Python, in this case, does not create instances of
Company but instead interrogates their internal structure to determine how to match them against the subject.
The really neat thing about pattern matching in Python is the ability to pick out attributes from object structures like the code above does. I only specify the things I want to pattern match, and because you can nest structures you are free to specify the full “contract” that your code must have with the data it requires.
Right, so if there’s a match – i.e., we pass an
Invoice object with a
Company in the
recipient attribute – then we can proceed to the actual validation routine.
With the two bound names
country_code I fashion them into a tuple for no other reason than to make it easier for us, the humans, to read the intent of the code. I could just as easily turn it into a dictionary or some other structure — but a tuple is nice and easy to read.
case statements capture the actual business rules and, I must say, in a very clean and readable manner. Let’s look at them piecemeal.
This rule matches any tuple where the
currency part of the tuple is one of
"EUR". The second part of the tuple, the
_ indicating a wildcard pattern — meaning, it does not matter what its value is. It could be anything.
From our fictional business’s perspective the rule means that if you denominated your invoice in either of those three currencies then it does not matter what the recipient’s country is: a lot of multinationals denominate their invoices in either of those three, so the code returns
True indicating it’s valid.
The next two rules relate to the Japanese Yen specifically:
The first declares that if you’re using Japanese Yen but paying a Japanese company then that’s sensible as Japanese companies would probably prefer to be paid in their own currency. However, if that is not the case, the first case statement fails to match and the second one matches anything with the wildcard
_, which then returns
False, indicating the validation check fails.
case statements are tested in the order you wrote them in. Check for the most explicit and specific patterns first, and put the more generic “fallback” cases at the end. Ask yourself what happens if you invert the order of the two
case statements above?
Catching duplicate Invoice IDs
The second and final business rule is checking for duplicate invoice IDs. Another pernicious issue that can cause total mayhem if you’re not careful.
Like the previous business rule, I match just the attributes I care about. Here it’s
invoice_id. But I also assert that the named binding must be an integer by writing
int() as invoice_id. Python will do some basic type checking to ensure that, indeed, it’s an integer, as our business rule prescribes. Additionally, I added a guard to check that the invoice ID is less than the maximum we can support, and that we haven’t seen it before.
I have opted to make it possible to supply an existing set of known invoice IDs. That is particularly useful, say, if you have a live system full of invoice IDs you want to check against also.
case statement matches, we make a note of the invoice ID by adding it to the set of known IDs and return
If the rule fails but there’s still an attribute called
invoice_id, we simply return
False to flag it for review by a human later.
Putting it all together
All there’s left to do is to tie it all together. The
retrieve_invoices function fetches the raw invoices and calls out to the processor code I wrote earlier. It also applies the business rules and based on the outcome of those checks, it separates them into
Finally it stores the invoices into two distinct CSV files. Python’s
dataclasses module comes with a handy
asdict helper function that pulls the typed attributes out of the object into a dictionary again so the CSV writer module knows how to store the data. And that’s it.
- Pattern Matching is a natural way of expressing the structure of data and extracting the information you want
As this demo project showed you, it’s easy to capture business rules that pertain to the structure of your data and extract the information you need from it at the same time. And it’s easy to add or amend rules.
- Patterns are declarative
Like I mentioned in Mastering Structural Pattern Matching, it’s the most important concept to take away from all of this. Writing Python is imperative. You tell Python what to do and when. But with a pattern you declare the result you want and leave the thinking to Python. For instance, I did not write any existence checks in
validate_currencyto check if an invoice has a recipient at all! I leave that to Python so I can focus on writing the actual business logic.