Database Disk Management

Petro is a storage engine that I am building to learn about databases and this article explores the disk management layer. Disk managment is an important part that greatly affects database performance and I had two design goals to make Petro’s disk writes and reads efficient; provide a non-blocking interface to higher level modules and utilize the disk’s internal parallelism. The disk managment component is composed of a DiskManager and a DiskScheduler.

The DiskManager makes it possible for the database to work with more data than the available RAM. For example, if a database has 20GB of data and the machine it runs on has 8GB of RAM it can’t load all that data onto the RAM, instead the disk manager loads chunks of data called pages, Petro uses 4kb pages but other databases use 8kb or 16kb pages. The DiskManager exposes two methods readPage and writePage

The DiskScheduler exposes a Schedule which the BufferpoolManager calls to read and write pages using a DiskRequest.

A DiskRequest has a response channel as one of its properties and the DiskScheduler writes to that channel when it has fulfilled the request. After submitting a DiskRequest the BufferpoolManager can do other tasks and read from the DiskRequest’s response channel whenever it is ready. This fulfills my first design goal, a non-blocking interface because the BufferpoolManager is not blocked waiting for the disk request to complete.

The DiskScheduler maintains a DiskRequest queue to which it adds new requests. It spawns multiple go routines to handle the requests, that is to say, if we have five write DiskRequests in the queue, the DiskScheduler will spawn five go routines to write to the disk simultaneously, effectively achieving my second design goal of utilizing the disk’s parallelism. But there’s a catch, this means that DiskRequests for a page maybe fulfilled out of order.

For example, if there’s three disk requests for page with id 1, R1->1, R2->0, R3->5 they maybe handled in any order and the final value in the page is not guaranteed to be 5. To solve for this, the DiskScheduler maintains a per page queue instead of a single queue. So requests for a particular page are handled in order by a single go routine but requests for two different pages are handled in parallel by two go routines.