Database Disk Management

Petro is a storage engine that I am building to learn about databases and this article explores the disk management layer. Disk managment is an important part that greatly affects database performance and I had two design goals to make Petro’s disk writes and reads efficient; provide a non-blocking interface to higher level modules and utilize the disk’s internal parallelism. The disk managment component is composed of a DiskManager and a DiskScheduler.

      block-beta
    block
    columns 1
    block
    DiskScheduler
    end
    space
    block
    DiskManager
    end
    end

DiskScheduler --"Read/Write Page"--> DiskManager

The DiskManager makes it possible for the database to work with more data than the available RAM. For example, if a database has 20GB of data and the machine it runs on has 8GB of RAM it can’t load all that data onto the RAM, instead the disk manager loads chunks of data called pages, Petro uses 4kb pages but other databases use 8kb or 16kb pages. The DiskManager exposes two methods readPage and writePage

      block-beta
columns 1
     DB ["DiskManager"]
     space
     block
         A ["Page A (4kb)"]
         B ["Page B (4kb)"]
         C ["Page C (4kb)"]
     end

DB --"read"--> A
DB --"write"--> B
DB --"read"--> C

The DiskScheduler exposes a Schedule which the BufferpoolManager calls to read and write pages using a DiskRequest.

      block-beta
A ["BufferpoolManager"]
space:2
B ["DiskScheduler"]
A--"Schedule(DiskRequest)"-->B

A DiskRequest has a response channel as one of its properties and the DiskScheduler writes to that channel when it has fulfilled the request. After submitting a DiskRequest the BufferpoolManager can do other tasks and read from the DiskRequest’s response channel whenever it is ready. This fulfills my first design goal, a non-blocking interface because the BufferpoolManager is not blocked waiting for the disk request to complete.

      classDiagram
    class DiskRequest {
        PageId: int
        IsWrite: bool
        Data: []byte
        RespCh: chan DiskResp
    }

The DiskScheduler maintains a DiskRequest queue to which it adds new requests. It spawns multiple go routines to handle the requests, that is to say, if we have five write DiskRequests in the queue, the DiskScheduler will spawn five go routines to write to the disk simultaneously, effectively achieving my second design goal of utilizing the disk’s parallelism. But there’s a catch, this means that DiskRequests for a page maybe fulfilled out of order.

For example, if there’s three disk requests for page with id 1, R1->1, R2->0, R3->5 they maybe handled in any order and the final value in the page is not guaranteed to be 5. To solve for this, the DiskScheduler maintains a per page queue instead of a single queue. So requests for a particular page are handled in order by a single go routine but requests for two different pages are handled in parallel by two go routines.