Python Pattern Matching Examples: Working with Paths and Files

Manipulating file and path strings is dreary work. It is a common activity, particularly in data science where the file structure may contain important semantic clues like a date or the source of the data. Contextualizing that information is usually done with a mixture of if statements and liberal use of pathlib’s Path or os.path, but the structural pattern matching feature in Python 3.10 is there to cut down on the tedium.

Consider a directory structure that looks a bit like this:

cpi/<country>/by-month/<yyyy-mm-dd>/<filename>.<ext>
cpi/<country>/by-quarter/<yyyy-qq>/<filename>.<ext>

Where cpi means Consumer Price Index; country is the ISO-3166 code for a country; yyyy-mm-dd is the ISO date for the particular month; yyyy-qq is the year and quarter; and filename is an arbitrary filename and ext is an extension.

Ordinarily, you’d just split the path and write some quick logic that picks out what you need, and that’ll work fine for simple things, but if you have to deal with dozens of variadic fields in the file path, that approach will not scale. So let’s look at a way that will scale using the match and case keywords.

Dispatching to the correct reader by country

The first consideration – as this is just an example – is separating the logic that parses the file paths from the logic that processes the files. The vast majority of “structured” data, like CPI indices, vary greatly by the body responsible for generating them — and there may well be more than one source of truth. So in the example above, the country field is something we cannot wish away or pretend will work everywhere.

Let’s flesh out a few skeleton functions that do the latter. I won’t cover the hypothetical parsing itself, but Python Pattern Matching Examples: ETL and Dataclasses lays out an example that shows you how you can.

from pathlib import Path
import datetime

def read_cpi_series_by_month(country_code: str, filepath: Path, observation_date: datetime.date):
    match country_code:
        case "GB":
            if observation_date < datetime.date(year=2000, month=1, day=1):
                return read_legacy_uk_cpi_series(filepath, observation_date)
            return read_uk_cpi_series(filepath, observation_date)
        case "NO" | "SE" | "DK" | "FI":
            return read_nordic_cpi_series(filepath, observation_date)
        # ... etc ...
        case _:
            raise ValueError(f'There is no valid CPI Series read for {country_code}')

This controller function takes as input a country_code; a filepath to the underlying data; and an observation_date. I’ve added a couple of examples to demonstrate what such a controller could look like. At this point I’m not interested in the file logic. It pays to think about the core of the application before I worry about that. Here there are a couple of key points:

Reading a time series file is a product of the country and the observation date

It’s possible (well, an ironclad certainty in real life!) the data format will change over time. Other complications could include determining the correct reader based on parts of the filename or extension – but more on that later – so there’s room for that, too.

Combining rules makes it easier to understand what is going on

Some countries may share the same data format, so I may as well combine them into one case statement to save on “cognitive load” for any future developer who may come across it. Adding or removing countries is thus also very easy.

I can still use if statements when it makes sense to do so

I could have made the if statement a guard by putting it in the case statement itself. I opted not to, but for complex rules you may want to do that, particularly if you have many rules that are similar but differ only slightly.

Fail immediately if there is no valid reader

I raise ValueError for brevity, but a custom exception would be better in a real application.

So that takes care of the controller that’ll read the contents of the file. Now let’s move up a layer and think about how to get the information out of our hypothetical directory structure.

Matching directory and file paths

Now, unfortunately, the pattern matching engine does not support complex in-string pattern matching like, say, regular expressions, so we’ll have to come up with another way of giving structured data to the pattern matching engine.

The two most obvious methods is os.path.split() and pathlib.Path. I prefer the latter (see Common Path Patterns for more information) as it’s much easier to reason about.

The Path class can split a filepath into the constituent parts that make up the full file path:

>>> Path('cpi/DK/by-month/2007-08-01/ts_cpi_by_month.xlsx').parts
('cpi', 'DK', 'by-month', '2007-08-01', 'ts_cpi_by_month.xlsx')

Which, to my eyes, looks like a very useful structure to pattern match against.

import re
def parse_ts_structure(filepath: str | Path):
    structure = Path(filepath).parts
    match structure:
        case ("cpi", country_code, "by-month", date, filename) if (
                len(country_code) == 2 and re.match(r'^\d{4}-\d{2}-\d{2}$', date)
        ):
            observation_date = parse_date(date)
            read_cpi_series_by_month(country_code, filepath, observation_date)
        case ("cpi", country_code, "by-quarter", date, filename) if (
                len(country_code) == 2 and re.match(r'^\d{4}-Q\d$', date)
        ):
            observation_date = parse_quarter_date(date)
            read_cpi_series_by_quarter( ... )
        case _:
            raise ValueError(f'Cannot match {structure}')

The function takes either a string or Path and turns it into a tuple of parts that should look a bit like this:

(<data source>, <iso country>, <frequency>, <observation date>, <filename>)

In each case statement I make a literal match against "cpi" because that is the only data source we (currently) support, but it’s easy to imagine that list growing very long indeed in a real application.

Unlike the previous example I added guards instead of regular if statements, and there is a good reason for that:

I am guarding the pattern I want to match against to ensure it has the basic structure I expect

Each of the two checks only validate that the structure is what I superficially want it to be:

  1. a country_code must be a two-digit ISO code for a country, but I do not care at that point in time whether it’s a legitimate country;

  2. and, I use a quick’n’dirty regular expression to ensure the date structure looks like an ISO date. Note, again, that I am not checking if the date is valid — only that it meets the prescribed YYYY-MM-DD (or YYYY-QN) format.

So, I could make them if statements inside each case block, but then I would have to raise exceptions if the either of the two checks fail. I can now – though I haven’t for brevity’s sake – check if the country_code that did pass the guard is, in actual fact, a real country or not. The same goes for the date: 9999-99-99 would pass the guard but not the parse_date function.

Summary

Pattern Matching is useful even for mundane activities

Dealing with files and paths is all too common, and pattern matching can cut down on the never-ending warren of if statements that inevitably follows

A lot of problems are simpler if you find a commonality or shared structure to them

Here the problem is a directory structure with a lot of context trapped in the directory names, but it could be anything. Recall that it is Path(...).parts that turned a generic string into a structure that a computer (and human!) can easily reason about.

Picture of a coiled snake and the main logo of Inspired Python

Liked the Article?

Why not follow us …

Be Inspired Get Python tips sent to your inbox

We'll tell you about the latest courses and articles.

Absolutely no spam. We promise!