Chickpea: Compression

August 13, 2020

emscripten

Downloading...

This is what it looks like when your decompression routine doesn’t work¹.

Today, I worked on both compression and decompression of data for Chickpea, and the start of a resource system. Up until this point, all data for the game was uncompressed, and I didn’t have to think too much about it. Now, the still tiny GBA ROM is half the size, from 50 KB to 25 KB, and the .wasm file went from 606 KB to.. 605 KB?

Huh. We’ll get back to that.

But first, decompression. The Game Boy Advance has several BIOS routines dedicated to decompressing data that most games make use of, and there are two that I’m making use of here. Both decompress an LZ77 encoded stream of data, and differ only in their intended destination. One decompresses to a location in Work RAM (WRAM), and the other to Video RAM (VRAM). Their API is the same:

// chickpea/common.h
void decompress_lz77_wram(const void *restrict nonnull src,
			  void *restrict nonnull dst);

void decompress_lz77_vram(const void *restrict nonnull src,
			  void *restrict nonnull dst);

The reason for the two functions is that this compression format is naturally byte-oriented. It wants to write (and read) data a single byte at a time, but while Work RAM can be written to a byte at a time, Video RAM cannot². So the implementation of the VRAM function is, I presume more complicated to work around this fact, and GBATEK says it is slower as well.

The Decompression Algorithmn

On the desktop and WASM, though, we only need one implementation because we don’t have that hardware restriction. I’m going to run through it, but first I want to say that the original implementation comes from the emulator mGBA, and that this version is pretty much a carbon copy except without all the emulator-y-ness. You can compare with the original code here if you’d like.

// chickpea/emulated.c
static void gba_lz77_decompress(const uint8_t *restrict nonnull src,
				uint8_t *restrict nonnull dst)
{
	uint32_t remaining = src[3] << 16 | src[2] << 8 | src[1];
	src += 4;

We start by figuring out how big the uncompressed data stream is, which is stored in the 2nd, 3rd and 4th bytes in little-endian order. The first byte, src[0] is supposed to contain some sort of header marker, but we ignore it.

	uint8_t block_header = 0;
	uint8_t blocks_remaining = 0;
	const uint8_t *disp = 0;
	uint8_t bytes = 0;

	while (remaining > 0) {
		if (blocks_remaining) {
			/* TODO */;
		} else {
			block_header = *src++;
			blocks_remaining = 8;
		}
	}

The compressed data stream is made up of a series of blocks, and chunks of eight blocks are lead with by a single byte header, where each bit describes what kind of block each of the eight blocks is. Since blocks_remaining starts as zero, reading this block_header is the first thing we do.

	while (remaining > 0) {
		if (blocks_remaining) {
			if (block_header & 0x80) {
				/* TODO */
			} else {
				/* TODO */
			}
			block_header <<= 1;
			--blocks_remaining;
		} else {
			block_header = *src++;
			blocks_remaining = 8;
		}
	}

While we have blocks_remaining we can check the top bit (0x80) of the block_header and branch on if it is set or not, for the two kinds of blocks. Once we handle that block, we shift the block_header to the left by one, so that the next block’s bit will be the top most bit, and we subract one from blocks_remaining since we handled the block in the conditional.

So what are the two types of blocks?

	while (remaining > 0) {
		if (blocks_remaining) {
			if (block_header & 0x80) {
				/* TODO */
			} else {
				*dst++ = *src++;
				--remaining;
			}
			block_header <<= 1;
			--blocks_remaining;
		} else {
			block_header = *src++;
			blocks_remaining = 8;
		}
	}
}

The simpler of the two kinds, where the block_header bit is unset, simply means to copy one byte from the source data to the destination.

When the bit is set, however, lets zoom in:

	if (block_header & 0x80) {
		uint16_t block = src[1] | src[0] << 8;
		src += 2;
		disp = dst - (block & 0x0FFF) - 1;
		bytes = (block >> 12) + 3;
		while (bytes--) {
			if (remaining) {
				--remaining;
			}
			*dst++ = *disp++;
		}
	}

We first read two bytes, these two bytes contain one four-bit value, and one twelve-bit value. The twelve-bit value is an offset, that lets us look back at data we have already written into dst. We store a pointer to the start of this in disp. The four-bit value is how many bytes we are going to copy, stored as bytes - 3, because if we were ever going to copy less than three bytes this way, it would be pointless compared to just repeating them in the data stream.

Once we have these two values, we run in a loop, copying from disp into dst, while subtracting how many bytes overall there are left remaining.

All together now:

static void gba_lz77_decompress(const uint8_t *restrict nonnull src,
				uint8_t *restrict nonnull dst)
{
	uint32_t remaining = src[3] << 16 | src[2] << 8 | src[1];
	src += 4;
	uint8_t block_header = 0;
	uint8_t blocks_remaining = 0;
	const uint8_t *disp = 0;
	uint8_t bytes = 0;
	while (remaining > 0) {
		if (blocks_remaining) {
			if (block_header & 0x80) {
				uint16_t block = src[1] | src[0] << 8;
				src += 2;
				disp = dst - (block & 0x0FFF) - 1;
				bytes = (block >> 12) + 3;
				while (bytes--) {
					if (remaining) {
						--remaining;
					}
					*dst++ = *disp++;
				}
			} else {
				*dst++ = *src++;
				--remaining;
			}
			block_header <<= 1;
			--blocks_remaining;
		} else {
			block_header = *src++;
			blocks_remaining = 8;
		}
	}
}

One important thing to note about this algorithm, is in the fact that disp can chase dst. To repeat a bunch of zeroes, we could output a single zero, then have disp equal to dst-1, and have bytes be whatever number we need. Even though disp is continously incrementing, it’s going to stay behind dst, and it’s going to encounter those zeroes it just wrote. This works not just with a single value, patterns of values can be repeated in this way too, with a disp that is further behind.

Compression

How do you compress data into this format, however? Well, I can’t actually tell you, because I didn’t have to re-implement it. In my utility that bakes the game’s assets, I’m using the libflate crate. It does something very handy for you, where it can give you a vector of ‘codes’:

pub enum Code {
    /// Literal byte.
    Literal(u8),

    /// Backward pointer to shared data.
    Pointer {
        /// Length of the shared data.
        /// The values must be limited to `MAX_LENGTH`.
        length: u16,

        /// Distance between current position and start position of the shared data.
        /// The values must be limited to `MAX_DISTANCE`.
        backward_distance: u16,
    },
}

Which are the two kinds of blocks I was talking about above. So I did have to write code to encode the Vec<Code> in the format the Game Boy Advance expects for it’s BIOS routines. I wasn’t entirely sure if this is actually some standard encoding for LZ77, but it was straight forward enough. libflate handles the hard part.

Resource System

I had to adjust a decent portion of Chickpea’s code to handle the compressed data. Right now I have a strategy that is agnostic to whether or not the data is compressed, it works like this:

// game/resource.h
struct resource {
	const uint32_t length;
	const bool lz77;
	const void *const nonnull data;
	resource_handle allocation;
};

const void *nonnull resource_data(struct resource *nonnull resource);
void resource_copy_to_vram(const struct resource *nonnull resource,
			   void *nonnull vram);

The asset baking utility, now outputs instances of this resource struct, like so:

static const uint8_t map_tile_highlight_4bpp_bytes[] 
	__attribute__((aligned(4))) = {
	0x10, 0x40, 0x00, 0x00, 0x6A, 0x00, 0xF0, 0x00, 0x10, 0x00, 0x55, 0x00, 
	0x02, 0x66, 0x00, 0x02, 0x45, 0xF5, 0x00, 0x0B, 0x10, 0x00, 0x60, 0x0F, 
	0x00, 0x02, 0x44, 0x00, 0x02, 0x33, 0x00, 0x02, 0x40, 0x22, 0x00, 0x02, 
	0x11, 0x00, 0x00, 0x00, 0x00, 0x00, 
};

struct resource map_tile_highlight_4bpp = {
	.length = 64,
	.lz77 = true,
	.data = map_tile_highlight_4bpp_bytes,
};

There’s a length that is the length of the uncompressed data, whether or not it’s compressed in the first place, and a pointer to the compressed data. While the raw bytes are constant so they can go into ROM, the struct resource itself is not. My idea was, if we were going to decompress into memory somewhere, we could record the fact that we did that in the resource itself.

Anywhere I would have referred to the raw bytes before, I now call the resource_data() function, it looks like this, slightly modified:

const void *resource_data(struct resource *nonnull resource)
{
	if (!resource->lz77) {
		return resource->data;
	}
	if (alloc_exists(resource->allocation)) {
		// Look up the decompressed data using the handle
		return allocs[resource->allocation.index].data;
	}

	void *dst = arena_alloc(resource->length);
	decompress_lz77_wram(resource->data, dst);

	resource_handle new_resource_handle = /* Allocate one */;	
	allocs[new_resource_handle.index].data = dst;

	resource->allocation = new_resource_handle;
	return dst;
}

Essentially, if the resource isn’t compressed we can immediately return the pointer to the uncompressed data that is already in the resource. If it is compressed, then we check to see if we have a valid resource_handle stored, if we do, then we have already decompressed the data for this resource, and we can return a pointer to it. allocs is an array of book-keeping data, with pointers back to the resource, and where the data was decompressed to in WRAM.

If we don’t have a valid resource_handle then we need to decompress it. First we allocate a chunk of WRAM big enough to hold everything using this arena_alloc() function. Then we do the decompression, make a new resource_handle and store a pointer to our new allocation in allocs, so we will remember for the next time resource_data() is called and alloc_exists(resource->allocation) returns true.

resource_copy_to_vram() is much simpler:

void resource_copy_to_vram(const struct resource *nonnull resource,
			   void *nonnull vram)
{
	if (!resource->lz77) {
		cpu_fast_set(resource->data, vram, resource->length / 4);
		return;
	}
	decompress_lz77_vram(resource->data, vram);
}

Essentially, if we’re not compressed, then we call cpu_fast_set(), another GBA BIOS routine that’s like a fast memcpy(), except it deals in 32-bit words, thus the resource->length / 4. If the data is compressed, we call decompress_lz77_vram() and don’t deal with any of the handle stuff.

As far as freeing this decompressed data from resource_data(), I’m not entirely sure how that is going to work yet. Unshown here, I’m starting with the simplest thing, where there’s a function that resets everything. The kind of thing you might call in between different levels or modes of the game. If it ever needs to be more fine-grained, I could see adding tags, or multiple arenas, and freeing based on those.

606 KB to 605 KB?

Oh right! So, if the GBA ROM shrunk from 50 KB to 25 KB, why did the .wasm file only shrink by 1 KB? Unlike a ROM, the .wasm file is made up of multiple different sections of data like an executable, or library. One of these sections is the data section, which contains the static data from your program.

We can view some basic information about the different sections from a .wasm file using the wasm-objdump program from the WebAssembly Binary Toolkit:

$ wasm-objdump -h 2020-08-13_chickpea.wasm

2020-08-13_chickpea.wasm:	file format wasm 0x1

Sections:

     Type start=0x0000000b end=0x00000231 (size=0x00000226) count: 72
   Import start=0x00000234 end=0x0000214f (size=0x00001f1b) count: 237
 Function start=0x00002152 end=0x000023e2 (size=0x00000290) count: 654
   Global start=0x000023e4 end=0x000023f4 (size=0x00000010) count: 2
   Export start=0x000023f7 end=0x000026fa (size=0x00000303) count: 47
     Elem start=0x000026fd end=0x00002b45 (size=0x00000448) count: 1
     Code start=0x00002b49 end=0x000875b5 (size=0x00084a6c) count: 654
     Data start=0x000875b9 end=0x00097662 (size=0x000100a9) count: 109

We can see here, in today’s build, our data section is 0x100a9 bytes long, and contains 109 different elements. The WebAssembly spec says that the initial contents of memory should be all zeroes, except for the values copied into it from the data section. Each item in the data section is a place to copy data to, and the data to copy into it. A realization occured to me and I decided to use the same utility on a precious build:

$ wasm-objdump -h 2020-08-12_chickpea.wasm

2020-08-12_chickpea.wasm:	file format wasm 0x1

Sections:

     Type start=0x0000000b end=0x00000231 (size=0x00000226) count: 72
   Import start=0x00000234 end=0x0000214f (size=0x00001f1b) count: 237
 Function start=0x00002152 end=0x000023dd (size=0x0000028b) count: 649
   Global start=0x000023df end=0x000023ef (size=0x00000010) count: 2
   Export start=0x000023f2 end=0x000026f5 (size=0x00000303) count: 47
     Elem start=0x000026f8 end=0x00002b40 (size=0x00000448) count: 1
     Code start=0x00002b44 end=0x00086377 (size=0x00083833) count: 649
     Data start=0x0008637b end=0x00097a82 (size=0x00011707) count: 356

The size is 0x11707 bytes, a bit over a kilobyte more, and.. it has 356 elements?

Essentially, some part of the WASM toolchain was smart enough to break up any single data element that had long spans of zeroes in it into multiple elements without those unnecessary zeroes specified. Neat. Though it suggests at the moment, with the data I have, LZ77 isn’t doing much more than copying zeroes around for me.

Actually, at the point of this build, the decompression routine worked, but it wasn’t being triggered in some cases, so you are seeing a mix of compressed and decompressed data, interpretted as if it was all already decompressed. ↩︎
Only 16-bit and 32-bit writes are intended. If you try to write a single byte to VRAM, the hardware doesn’t stop you, but it trashes the neighboring byte as if you actually wrote two. ↩︎