Database Disk Management
Petro is a storage engine that I am building to learn about databases and this article
explores the disk management layer. Disk managment is an important part that greatly affects database
performance and I had two design goals to make Petro’s disk writes and reads efficient; provide a non-blocking
interface to higher level modules and utilize the disk’s internal parallelism. The disk managment component is composed of a DiskManager
and a DiskScheduler
.
The DiskManager
makes it possible for the database to work with more data than the available RAM. For example, if a
database has 20GB of data and the machine it runs on has 8GB of RAM it can’t load all that data onto the RAM,
instead the disk manager loads chunks of data called pages, Petro uses 4kb
pages but
other databases use 8kb
or 16kb
pages. The DiskManager
exposes two methods readPage
and writePage
The DiskScheduler
exposes a Schedule
which the BufferpoolManager calls
to read and write pages using a DiskRequest
.
A DiskRequest
has a response channel as one of its properties and the DiskScheduler
writes to that channel when it has fulfilled the
request. After submitting a DiskRequest
the BufferpoolManager
can do other tasks and read from the DiskRequest
’s response channel whenever it is ready. This fulfills my first design goal, a non-blocking interface because the BufferpoolManager is not blocked waiting for the disk request to complete.
The DiskScheduler
maintains a DiskRequest
queue to which it adds new requests. It spawns multiple go routines
to
handle the requests, that is to say, if we have five write DiskRequest
s in the queue, the DiskScheduler
will spawn
five go routines to write to the disk simultaneously, effectively achieving my second design goal of utilizing the
disk’s parallelism. But there’s a catch, this means that DiskRequests
for a page maybe fulfilled out of order.
For example, if there’s three disk requests for page with id 1, R1->1
, R2->0
, R3->5
they maybe handled in any order and
the final value in the page is not guaranteed to be 5
. To solve for this, the DiskScheduler
maintains a per page
queue instead of a single queue. So requests for a particular page are handled in order by a single go routine but requests
for two different pages are handled in parallel by two go routines.