Python Pattern Matching Examples: Working with Paths and Files
Manipulating file and path strings is dreary work. It is a common activity, particularly in data science where the file structure may contain important semantic clues like a date or the source of the data. Contextualizing that information is usually done with a mixture of
if statements and liberal use of pathlib’s
os.path, but the structural pattern matching feature in Python 3.10 is there to cut down on the tedium.
Consider a directory structure that looks a bit like this:
cpi means Consumer Price Index;
country is the ISO-3166 code for a country;
yyyy-mm-dd is the ISO date for the particular month;
yyyy-qq is the year and quarter; and
filename is an arbitrary filename and
ext is an extension.
Ordinarily, you’d just split the path and write some quick logic that picks out what you need, and that’ll work fine for simple things, but if you have to deal with dozens of variadic fields in the file path, that approach will not scale. So let’s look at a way that will scale using the
Dispatching to the correct reader by country
The first consideration – as this is just an example – is separating the logic that parses the file paths from the logic that processes the files. The vast majority of “structured” data, like CPI indices, vary greatly by the body responsible for generating them — and there may well be more than one source of truth. So in the example above, the
country field is something we cannot wish away or pretend will work everywhere.
Let’s flesh out a few skeleton functions that do the latter. I won’t cover the hypothetical parsing itself, but Python Pattern Matching Examples: ETL and Dataclasses lays out an example that shows you how you can.
This controller function takes as input a
filepath to the underlying data; and an
observation_date. I’ve added a couple of examples to demonstrate what such a controller could look like. At this point I’m not interested in the file logic. It pays to think about the core of the application before I worry about that. Here there are a couple of key points:
- Reading a time series file is a product of the country and the observation date
It’s possible (well, an ironclad certainty in real life!) the data format will change over time. Other complications could include determining the correct reader based on parts of the filename or extension – but more on that later – so there’s room for that, too.
- Combining rules makes it easier to understand what is going on
Some countries may share the same data format, so I may as well combine them into one
casestatement to save on “cognitive load” for any future developer who may come across it. Adding or removing countries is thus also very easy.
- I can still use
ifstatements when it makes sense to do so
I could have made the
ifstatement a guard by putting it in the
casestatement itself. I opted not to, but for complex rules you may want to do that, particularly if you have many rules that are similar but differ only slightly.
- Fail immediately if there is no valid reader
ValueErrorfor brevity, but a custom exception would be better in a real application.
So that takes care of the controller that’ll read the contents of the file. Now let’s move up a layer and think about how to get the information out of our hypothetical directory structure.
Matching directory and file paths
Now, unfortunately, the pattern matching engine does not support complex in-string pattern matching like, say, regular expressions, so we’ll have to come up with another way of giving structured data to the pattern matching engine.
The two most obvious methods is
pathlib.Path. I prefer the latter (see Common Path Patterns for more information) as it’s much easier to reason about.
Path class can split a filepath into the constituent parts that make up the full file path:
Which, to my eyes, looks like a very useful structure to pattern match against.
The function takes either a string or
Path and turns it into a tuple of parts that should look a bit like this:
(<data source>, <iso country>, <frequency>, <observation date>, <filename>)
case statement I make a literal match against
"cpi" because that is the only data source we (currently) support, but it’s easy to imagine that list growing very long indeed in a real application.
Unlike the previous example I added guards instead of regular
if statements, and there is a good reason for that:
- I am guarding the pattern I want to match against to ensure it has the basic structure I expect
Each of the two checks only validate that the structure is what I superficially want it to be:
country_codemust be a two-digit ISO code for a country, but I do not care at that point in time whether it’s a legitimate country;
and, I use a quick’n’dirty regular expression to ensure the date structure looks like an ISO date. Note, again, that I am not checking if the date is valid — only that it meets the prescribed
So, I could make them
if statements inside each
case block, but then I would have to raise exceptions if the either of the two checks fail. I can now – though I haven’t for brevity’s sake – check if the
country_code that did pass the guard is, in actual fact, a real country or not. The same goes for the date:
9999-99-99 would pass the guard but not the
- Pattern Matching is useful even for mundane activities
Dealing with files and paths is all too common, and pattern matching can cut down on the never-ending warren of
ifstatements that inevitably follows
- A lot of problems are simpler if you find a commonality or shared structure to them
Here the problem is a directory structure with a lot of context trapped in the directory names, but it could be anything. Recall that it is
Path(...).partsthat turned a generic string into a structure that a computer (and human!) can easily reason about.