Game Boy Emulator: Designing the CPU

In Game Boy Emulator: Writing the Z80 Disassembler we learned how to write an instruction decoder and disassembler. It’s an important first step towards writing a Game Boy emulator. Assembly language – or the binary machine code form of it anyway – is the language of the CPU, and so we must represent, in software, a facsimile of a real CPU that can execute these machine code instructions for us.

So let’s start off with a quick overview of what exactly a CPU is and does, and how we’ll emulate it.

What is a CPU, exactly?

What’s a CPU, and what does it take to make one that is – loosely speaking, here – general enough that you can write any program you like?

Well, you need very little, as it turns out. It’s both a thought experiment and a real focus of study for Computer Scientists. It’s an interesting field, and it’s divided into many different areas, focused around the theory of computation and the model of computation. Computer systems in wide use today, including our Game Boy system, is a manifestation of these theories and models of computation.

Almost all of our computer systems, today, are practical implementations of these theoretical concepts: the von Neumann Architecture and a pragmatic fusion of these Register Machines, which themselves are offsprings of the Turing Machine.

But why is that interesting here, to us, now? Because from the wellspring of these concepts is a system that is general – to be specific, computationally universal – enough to represent and execute any program. These Turing-complete systems (or programming languages running on such a system) dictate if we can write any program we so desire.

Thus, from a series of theoretical concepts CPU makers have had a blueprint, of sorts, that provide a theoretical grounding for what must be present to allow us, the developers, to write software — even if we have a finite amount of memory and time, unlike their theoretical counterparts. This manifests itself as an instruction set – you’re familiar with them a bit now, having written a disassembler and decoder – and some other fundamental concepts I’ll talk about in a moment. But all of it culminates in layers of abstraction that we all build our own software on top of.

Once you understand how a CPU works you can apply that knowledge to most other CPUs or virtual machines. Luckily, that is not too difficult with a Z80 CPU.

If you had to design the simplest possible CPU that can compute any program, it must have some form of:

A stream of instructions to fetch, decode, and execute

There is a – possibly infinite – stream of instructions the CPU must fetch; decode; and then execute. Every execution mutates the state of the CPU or peripherals, like the memory banks.

A method of keeping track of the current instruction the CPU is executing

When an instruction is executed the CPU must proceed, somehow, to the next logical instruction it is to read. That may or may not be the instruction immediately following the current one. It could be any instruction in a potentially infinite sequence of instructions. The CPU keeps track of where it is in something called the program counter, or PC.

When an instruction is decoded, its program counter is advanced. When the instruction is executed, that instruction may in turn alter the program counter, like jumping forward or backward a relative number of positions, or to an absolute location.

Ask yourself why an instruction would want to modify the program counter directly

The ability to recall and store facts

One such fact is the Program Counter as that has to be stored somewhere. Aside from that, you may have some amount of memory available to the CPU also.

Typically, things like the PC is stored in a register with a defined size, in bits. Some registers are general purpose and can be used for anything the programmer desires. Others serve specific purposes (like the aforementioned program counter) that aid programmers in certain tasks or serves as a way for the CPU to communicate its state to the programmer. There is a finite number of registers, and each register is a scarce resource.

As the registers are so important to the operation of most CPU designs, they have instructions tailored to their specific needs to speed up execution and save on size (having to specify the registers an opcode should use would take several bytes extra of storage per instruction.)

However, you can build a CPU without registers! But that doesn’t mean the alternative is RAM as you’d recognize it from your own computer. It could just as easily be punch cards, or some other medium that you can store and recall from!

Basic programming concepts like arithmetic operations and conditional checking

Arithmetic is more often than not only addition and subtraction; and conditional checking is often implemented as a special form of subtraction: R = A - B, so that when R is zero it indicates equality. The outcome of the arithmetic or conditional operation is then stored. Where depends on the CPU architecture.

A known, fixed state when the system is first started

That means a static starting position for the program counter and any other “state” the CPU depends on.

An instruction set that serves as the language of the CPU

This is how we interact with the CPU and tell it what it needs to do. How many instructions there are varies greatly, and some concepts you can express using other instructions, like removing addition in favor of only having subtraction.

And that’s it. With these capabilities your CPU is universal enough to compute anything (limited, of course, only by your memory.)

In fact, there’s a One-instruction set computer that is universal and yet has just a single instruction that, given enough patience, can be used to build anything. And if you scrutinize the requirements – there are several ways of implementing such a computer – you will see that all of them will need exactly what you see above. Even if the theory of computation does not interest you, I suggest you have a quick look. It’s a pretty remarkable testament to how little is required for universal computation to succeed.

Let’s move on and cover what the Z80 can do.

The Z80 CPU


For starters, you have a number of registers at your disposal. As you may recall, the Z80 is an 8-bit processor with a 16-bit address bus. That means there has to be a way of operating on both 8-bit and 16-bit numbers, and the way to do both is using a clever design that lets you read some of the registers as either an 8-bit or 16-bit word depending on which register you’re reading.

If you go to pandocs’s CPU Registers and flags you’ll see a list of registers and their names:

  1. AF, consisting of A, the accumulator register; and F, an internal register that holds the flags.

  2. BC, a general purpose register

  3. DE, a general purpose register

  4. HL, a general purpose register and it has a large number of instructions that simplifies iterative code, like loops; 16-bit arithmetic; and data loading, bit manipulation, and more.

  5. SP, the stack pointer. Used for call stacks and makes it possible for the CPU to natively support function calls and return values. Has a number of specialized instructions that makes it easier to do this.

  6. PC, the Program Counter. Used to track the location of the next executing instruction. Refers to a memory address.

Of note is the concept of “high” and “low” registers. For instance, the general purpose register BC is composed of the high byte B and low byte C, as its name alludes to.

That means you can request the full 16 bit value from BC or just the upper or lower 8 bits with either B or C, respectively. That’s pretty nifty, and is a very useful way of operating selectively on part of a value instead of the whole one.

For instance, if you have the value BC = 0x1234 you can interpret it as that, or as B = 0x12 or C = 0x34.

Keep in mind that not all registers support this mode of operation, and one in particular is completely inaccessible, despite being named. That’s the F register; it’s only available for programmers indirectly through other instructions.

Separating a 16-bit word into two blocks of 8 bits is a clever mechanism that lets you calculate numbers larger than what the CPU could easily do on its own.

This is a hypothetical example as the Z80 does come with a few 16-bit instructions, but let’s pretend it is only able to reason about 8 bits of data at a time. 0 through to 255.

What if you want to loop 5000 times, well in excess of what the CPU is physically able to keep track of with a single byte? How do you work around that?

Let’s say we set BC=0 and we want to stop when we reach 5000, or when BC = 0x1388:

  1. Check if B = 0x13 and C = 0x88. If yes, we’re done and can exit the loop.

  2. Increment C by 1.

  3. If C = 0x0 (you’d check the zero flag, but that explanation is for later!) then increment B by 1.

  4. Goto 1

We need a way of storing and recalling the values of the Z80’s registers

So, we need a way of representing all these registers in Python. Luckily that is trivial for an emulator writer: we have variables. So we’ll need a variable for each register, but a method of reading and writing to the high and low parts of a 16-bit register also.


When the CPU executes instructions it has side effects. Some of those side effects may yield important information about the new state the CPU finds itself in after executing each instruction. For instance, if you ask the CPU to compare two numbers, how would it communicate the outcome of that compare instruction back to you?

The answer is the flags register. As the pandocs documentation shows, there’s four flags, each occupying a single bit on the F register portion of AF. You may note that not all eight bits are mapped to a flag; the rest are unused.

Look back to the disassembly from the snake game from the last chapter:

1;; disassembly
20150 NOP
30151 DI
40152 LD       SP, 0xfffe
50155 LD       B, 0x80
60157 LD       C, 0x0
70159 LDH      A, (0x44)
8015B CP       0x90
9015D JR       NZ, 0xfa
10015F DEC      C
110160 LD       A, C
120161 LDH      (0x42), A
130163 DEC      B
140164 JR       NZ, 0xf3
150166 XOR      A
160167 LDH      (0x40), A
170169 LD       A, 0x0

Pay attention to lines 8 and 9.

CP 0x90 is a compare instruction. It does a compare against A, so that if A - 0x90 = 0 the zero flag is set (bit 7 in F is 1); or else it is unset. JR NZ, 0xFA is a jump instruction that jumps if the zero flag is unset.

Terminology can vary slightly. “Set” means a bit is enabled, i.e., it has a value of 1. Reset or unset then means that it’s 0.

Flags, then, serve the role of informing the programmer when particular events occur following the execution of an instruction. The role each flag plays will become apparent once you start emulating the instructions.

We need a way of setting and resetting bits in a bit field

The Z80 makes use of several flags, and they’re all stored in the F register, encoded as individual bits in a bit field. You cannot access this register directly.

Instruction Decoder

Building on the decoder we wrote in the last chapter, the CPU needs a stream of instructions to fetch, decode, and execute. As we’re not quite ready to work on the memory and memory banks just yet, we’ll leave the implementation as it is, and go back and revisit it later.

Once the CPU is ready and running, it must send the PC to the decoder so it can fetch that instruction from (later memory, for now the ROM stream we read in) and return the new address for the PC and, of course, the instruction the CPU must execute:

  1. Ask the decoder for the instruction at PC

  2. The decoder fetches and decodes the instruction and returns it, along with the next logical position in the stream.

  3. The CPU executes the instruction (which may alter the PC, as it’s a read-writable register)

Building the CPU

It’s time to flesh out our skeleton classes so the core of the emulator can take shape. Let’s start with the registers.

Representing the Registers and Flags

As I mentioned earlier, you need to think of registers as nothing more than variables. However, the only two snags is the high and low concept of a 16-bit word, and that flags are bitfields stored in the low word of AF.

So if you have the 16-bit number 0xABCD and you wanted the high and low words separately, you’ll need to twiddle some bits, as the expression goes.

A Quick Primer on Bit Manipulation

I’ll quickly cover what we need, but a deeper dive is forthcoming once we look at some of the instructions in depth. Having said that, I urge you to experiment with this as an intuitive understanding is essential.

So to get or set arbitrary bits we’ll need to understand a few basic things about binary and bitwise operations. Consider this binary representation of the number 0xAB:

| 128 | 64  | 32  | 16  |  8  |  4  |  2  |  1  |
+-----+-----+-----+-----+-----+-----+-----+-----+ = 0xAB
|  1  |  0  |  1  |  0  |  1  |  0  |  1  |  1  |

Let’s say I wanted 0xA of 0xAB. I could achieve this by shifting right four bits, as I’d want the high part of 8 bits (which is 4 bits):

>>> hex(0xAB >> 4)

Because most computers count bits from right-to-left, and because shifting moves bits N positions left or right, the right shift effectively erases the bits it replaces:

| 128 | 64  | 32  | 16  |  8  |  4  |  2  |  1  |
+-----+-----+-----+-----+-----+-----+-----+-----+ = 0xA
|  0  |  0  |  0  |  0  |  1  |  0  |  1  |  0  |
Shift right 4 ->        \-----------------------/

As we’re shifting to the right, the values are instead padded with 0s and is therefore redundant, in much the same way that writing 00042 is the same as 42 in decimal.

Contrast that with a left shift which does the opposite:

>>> hex(0xA << 4)

Which shifts left and again pads in the direction you’re shifting. Note, though, the new value 0xA0. Much like multiplying by 10 in decimal adds a zero, so too does a left shift.

Left shifting is equivalent, then, to multiplication, and right shifting to division in powers of two. Therefore, an enterprising programmer with nothing more than bit shifts and add and subtract can simulate multiplication and division even if the CPU does not support such operations.

Try dividing and multiplying with numbers that’re powers of two to replicate bit shifting.

So that’s the high word. What if I want the lower word? I cannot shift, because that’ll just erase the number. Instead I need a way of extracting the bits — an approach that will serve us well going forward.

One way to get the bits you want is to ask if the bits we want are set, and then return them. That’s easily done thanks to the bitwise & operator and a concept called bit masking.

Bit Masking

Bitwise operations work much like their logical counterparts you’d use in an if statement. Bitwise operations operate on each bit and does a logical and on each pair of bits.

|  8  |  4  |  2  |  1  |  Bit
|  0  |  1  |  1  |  0  |  A
+-----+-----+-----+-----+  &
|  1  |  0  |  1  |  1  |  B
+-----+-----+-----+-----+  =
|  0  |  0  |  1  |  0  |  C
>>> 0b0110 & 0b1011

Consider what happens when you bitwise & those two numbers together. Because a logical and is applied to each bit in order, the result is the outcome of these bitwise operations.

A clever programmer can then use a mask – a number – that returns just bits they need. As you can see from the example, only bit 2 is 1 in A and B, so the result is 2,

So a mask is useful if you want to get (or set!) bits.

Masks traditionally go on the right-hand side

It’s not a hard-and-fast rule, but if you hardcode constants (as you’ll see later) it’s customary to use the right-hand side for the mask and the left-hand side for the value to mask against.

It’s fine if you do it the other way around for stylistic reasons or to mirror an algorithm you’re replicating, but be consistent about it.

Checking if one or more bits are set with bitwise AND

So, if I wanted the low bits instead, I could use the bitwise & operator to pick out the bits I wanted.

Given the number 0xAB I can get the lower four bits with a mask that has bits 1, 2, 4, and 8 set:

>>> 0b1111

Which, when I & it with 0xAB:

>>> hex(0xAB & 0b1111)

Which is the answer we want.

Setting bits with bitwise OR

Ask yourself what logical or does – it returns True if either or both of the left or right-hand side is True. This also holds for bitwise |.

Applying the concept of masks again, and you can use the same concept to set bits:

|  8  |  4  |  2  |  1  |  Bit
|  0  |  0  |  1  |  0  |  A
+-----+-----+-----+-----+  |
|  1  |  0  |  1  |  1  |  B
+-----+-----+-----+-----+  =
|  1  |  0  |  1  |  1  |  C

Given a mask we can now set bits arbitrarily with |.

>>> bin(0b0010 | 0b1011)

Where 0b1011 is indeed the bitwise | of the two inputs.

Toggling bits with bitwise XOR

XOR, or eXclusive OR, does not have a logical counterpart in Python. Its use is mostly, but not always, relegated to bitwise operations or as a specialist tool in algorithms as it possesses an interesting trait .

By the way …

That trait makes it possible to create cryptographically unbreakable one-time pads due to its unique ability to toggle bits.

Namely, that the result of an XOR (^) is only non-zero (remember, no booleans here!) if one of the left- or right-hand side is non-zero. I.e., unlike bitwise AND both sides cannot be non-zero or zero; and unlike bitwise OR, only one side can be zero, the other must be non-zero.

Applying the concept of masks again, let’s toggle some bits

|  8  |  4  |  2  |  1  |  Bit
|  0  |  0  |  1  |  0  |  A
+-----+-----+-----+-----+  ^
|  1  |  0  |  1  |  1  |  B
+-----+-----+-----+-----+  =
|  1  |  0  |  0  |  1  |  C
>>> bin(0b0010 ^ 0b1011)

Note how we toggled the bits, except for the one where both the value and mask is 0.

Bitwise operators return the result of the operation

Instead of coalescing the answer to True or False we are given the outcome of the operation instead. Very useful, that!

Bitwise operators and masks are useful tools to get, set, reset, or toggle bits

Masks are useful tools, and anything can be a mask, including another value you want to test against.

Bitwise AND checks if something is set; bitwise OR is used to set bits; bitwise XOR is used to toggle bits

You can combine bit shifting with bitwise operators

This lets you shift a single bit to the position you want to check against. Try 1 << 7 and look at its binary, decimal and hexadecimal forms.

The Registers and Flags

OK, with a primer on bits it’s time to represent the registers. You can use a simple dictionary, or explicit variables or attributes on a class — it’s up to you.

I’m going to use a modified dictionary-style notation, as it makes it easy to read and reason about. But by all means, pick a method you think makes most sense to you.

Let’s start with some mappings of the registers and flags we need, and how they relate to one another.

REGISTERS_LOW = {"F": "AF", "C": "BC", "E": "DE", "L": "HL"}
REGISTERS_HIGH = {"A": "AF", "B": "BC", "D": "DE", "H": "HL"}
REGISTERS = {"AF", "BC", "DE", "HL", "PC", "SP"}
FLAGS = {"c": 4, "h": 5, "n": 6, "z": 7}

These constants map a low register, like C, to BC. That way our code can check if a register is low, high, a 16-bit register, or a flag, with a simple chain of if statements.

from import MutableMapping

class Registers(MutableMapping):

    AF: int
    BC: int
    DE: int
    HL: int
    PC: int
    SP: int

    def values(self):
        return [self.AF, self.BC, self.DE, self.HL, self.PC, self.SP]

    def __iter__(self):
        return iter(self.values())

    def __len__(self):
        return len(self.values())

Here I’m laying out the groundwork for a custom dictionary-style class. I’m inheriting from MutableMapping, an abstract base class, that ensures I override all the right methods to provide dictionary-style access (registers["PC"] = 42.)

I also use a dataclass, and I make it so each main register is part of the constructor.

Next, it’s time to define the accessor:

def __getitem__(self, key):
    if key in REGISTERS_HIGH:
        register = REGISTERS_HIGH[key]
        return getattr(self, register) >> 8
    elif key in REGISTERS_LOW:
        register = REGISTERS_LOW[key]
        return getattr(self, register) & 0xFF
    elif key in FLAGS:
        flag_bit = FLAGS[key]
        return self.AF >> flag_bit & 1
        if key in REGISTERS:
            return getattr(self, key)
            raise KeyError(f"No such register {key}")

The code is didactic on purpose; but the intent is simple. I check each collection of registers in turn and, when I find it, I apply the correct bitwise operation as needed:

  1. High registers are bit-shifted 8 bits to the right, which yields the value we want

  2. Low registers are bitwise ANDed with the mask 0xFF as that matches the low 8 bits.

  3. Flags are instead shifted by their position in AF so the bit we care about is put in the right-most position where I can bitwise AND with 1 to check if it is set or not.

  4. Requests for 16-bit registers is a simple matter of returning that value, unmodified.

  5. Everything else is a KeyError

Setting a register is more or less the same as getting, but the bitwise operators are now |.

def __setitem__(self, key, value):
    if key in REGISTERS_HIGH:
        register = REGISTERS_HIGH[key]
        current_value = self[register]
        setattr(self, register, (current_value | (value << 8)) & 0xFFFF)
    elif key in REGISTERS_LOW:
        register = REGISTERS_LOW[key]
        current_value = self[register]
        setattr(self, register, (current_value | value) & 0xFFFF)
    elif key in FLAGS:
        assert value in (0, 1), f"{value} must be 0 or 1"
        flag_bit = FLAGS[key]
        self.AF = self.AF | (value << flag_bit)
        if key in REGISTERS:
            setattr(self, key, value & 0xFFFF)
            raise KeyError(f"No such register {key}")

def __delitem__(self, key):
    raise NotImplementedError("Register deletion is not supported")
  1. Setting a high register is a case of taking the value and shifting it to the high position 8 bits to the left. I use bitwise | to ensure we keep whatever else is in the register (current_value)

  2. For low registers I simply apply bitwise OR, as there is no shifting required.

  3. For the flags, I shift the value to the bitwise position it belongs to, and again use bitwise OR to merge.

  4. For 16-bit registers it’s a simple case of setting the value

  5. Everything else is a KeyError

You may have noticed the & 0xFFFF at the end of each assignment. It’s a safe guard against values that may overflow the 16-bit size limit of the main registers. The emulator we’re writing has 16 bits per register, but Python does not care about that. Masking against 0xFFFF ensures we never overflow.

Why does that mask work and how does it prevent overflowing? Try adding 1 to 0xFFFF and look at the binary representation before and after you mask.

Now for some tests. I’m again using hypothesis to generate strategies:

import hypothesis.strategies as st
from hypothesis import given

def make_registers():
    def make():
        return Registers(AF=0, BC=0, DE=0, HL=0, PC=0, SP=0)
    return make

    value=st.integers(min_value=0, max_value=0xFF),
def test_registers_high(make_registers, field, value):
    registers = make_registers()
    high_register, full_register = field
    registers[high_register] = value
    assert registers[full_register] == value << 8

If you’re unfamiliar with pytest fixtures, or the particular pattern you see, have a look at the pytest factory fixture pattern.

The test is simple. I create a Registers class and test that the value we set – which is randomly drawn by Hypothesis – is what we expect by manually shifting it and checking the 16-bit-sized register. (This is why it pays to keep the registers separate and easy to reuse in a test.)

Testing the low and full-sized registers is easy enough now that you’re familiar with the structure of the test and how to validate your results. Try writing them yourself.

Testing the flags is similarly simple:

    starting_value=st.integers(min_value=0, max_value=0xFF),
    value=st.sampled_from((0, 1)),
def test_flags(make_registers, starting_value, field, value):
    flag, bit = field
    registers = make_registers()
    registers[flag] = value
    assert registers["F"] == (value << bit)

The end result is a simple dict-style class that lets you query low, high, flags and full registers using their name.


The CPU is the, well, central part of the emulator, so it must be extensible, as we’ll be adding to it over time; it must be easy to swap components in or out, to facilitate testing; and it should encapsulate the state of the CPU, which will come to include a number of things we’ve not yet talked about.

You can do this with nothing but functions, but in keeping with the style so far, I’ll use a dataclass to represent the CPU.

class InstructionError(Exception):

class CPU:

    registers: Registers
    decoder: Decoder

    def execute(self, instruction: Instruction):
        # I'm using 3.10's pattern matching, but you can use a
        # dictionary to dispatch to functions instead, or a series of
        # if statements.
        match instruction:
            case Instruction(mnemonic="NOP"):
            case _:
                raise InstructionError(f"Cannot execute {instruction}")

    def run(self):
        while True:
            address = self.registers["PC"]
                next_address, instruction = self.decoder.decode(address)
            except IndexError:
            self.registers["PC"] = next_address

Pretty simple. I’m again using the dependency injection pattern. That means the CPU shouldn’t know how to create or instantiate extant classes that it depends on. To use it, you must pass the CPU constructor instances of the Decoder (that you created in the last chapter) and Registers.

I like this pattern because it separates the concerns so each class only knows exactly what it needs to know, and nothing more!. So we could, if we wanted to, pass in FakeRegisters or a FakeDecoder that mimics certain behavior we want to test, or even alternative implementations.

But as you can see, the implementation is fairly sparse. And yet this is enough to start implementing instructions. There’s no memory bank controller yet, and no display or any other fancy features we need, but that’s fine. We’ll get to them later. But because we fastidiously implemented one piece at a time, it’s now a case of snapping together the pieces we’ve built so far:

Instruction fetching and decoding is (mostly) done

Thanks to a single decode method we can now fetch an instruction from a stream of bytes; decode the instruction; advance the stream to the next position; and return the decoded instruction.

The CPU is now in a prime position to make use of that information. We know the address of the next logical instruction (next_address) and can update the PC registers accordingly.

Instruction Execution is the next big step

I’m going to use Python 3.10’s match and case keywords but don’t fret if you don’t have that version or don’t want to use that feature. You can get by with many, many if statements or use a dictionary to dispatch to a function that does the execution.

Consider how you’re storing the instructions you loaded from the opcodes file. You now need a way of calling out to a function that can handle that instruction.

I’ve added a single instruction, NOP, that fittingly doesn’t do anything (it means No Operation)

The run() method has an infinite while loop: once we start we don’t stop unless we run out of instructions – hence the IndexError check – or if the CPU is told to stop by an instruction or a human.

There’s not much to test, so let’s see if we can execute an arbitrary sequence of NOP instructions:

def make_cpu(make_registers, make_decoder):
    def make(data, pc=0):
        cpu = CPU(registers=make_registers(pc=pc), decoder=make_decoder(data=data))
        return cpu

    return make

@given(count=st.integers(min_value=0, max_value=100))
def test_cpu_execute_nop_and_advance(make_cpu, count):
    cpu = make_cpu(b"\x00" * count)
    assert cpu.registers["PC"] == 0
    assert cpu.registers["PC"] == count

And that’s it. Our skeleton CPU is complete. It can read and write to each register and flag as required; and it can execute instructions.


CPU architecture is similar to their theoretical counterparts in Computer Science

A little knowledge of the theory of computation is helpful. It explains why things are the way they are, and what is bare minimum required for a CPU – any CPU – to function.

Knowing your way around bitwise operations is important

It forms the basis for how much of the computation takes place in a real CPU, even though we’ve only looked at it briefly.

The value of testing is ever-more important

Hypothesis lets us focus on capturing the properties and the system of rules that everything we write must follow. Absent hypothesis, you can still test your emulator, but you now have to generate both the test cases and the answers.

Picture of a coiled snake and the main logo of Inspired Python

Liked the Article?

Why not follow us …

Be Inspired Get Python tips sent to your inbox

We'll tell you about the latest courses and articles.

Absolutely no spam. We promise!