Database Disk Management
Petro is a storage engine that I am building to learn about databases and this article
explores the disk management layer. Disk managment is an important part that greatly affects database
performance and I had two design goals to make Petro’s disk writes and reads efficient; provide a non-blocking
interface to higher level modules and utilize the disk’s internal parallelism. The disk managment component is composed of a DiskManager and a DiskScheduler.
block-beta
block
columns 1
block
DiskScheduler
end
space
block
DiskManager
end
end
DiskScheduler --"Read/Write Page"--> DiskManager
The DiskManager makes it possible for the database to work with more data than the available RAM. For example, if a
database has 20GB of data and the machine it runs on has 8GB of RAM it can’t load all that data onto the RAM,
instead the disk manager loads chunks of data called pages, Petro uses 4kb pages but
other databases use 8kb or 16kb pages. The DiskManager exposes two methods readPage and writePage
block-beta
columns 1
DB ["DiskManager"]
space
block
A ["Page A (4kb)"]
B ["Page B (4kb)"]
C ["Page C (4kb)"]
end
DB --"read"--> A
DB --"write"--> B
DB --"read"--> C
The DiskScheduler exposes a Schedule which the BufferpoolManager calls
to read and write pages using a DiskRequest.
block-beta
A ["BufferpoolManager"]
space:2
B ["DiskScheduler"]
A--"Schedule(DiskRequest)"-->B
A DiskRequest has a response channel as one of its properties and the DiskScheduler writes to that channel when it has fulfilled the
request. After submitting a DiskRequest the BufferpoolManager can do other tasks and read from the DiskRequest’s response channel whenever it is ready. This fulfills my first design goal, a non-blocking interface because the BufferpoolManager is not blocked waiting for the disk request to complete.
classDiagram
class DiskRequest {
PageId: int
IsWrite: bool
Data: []byte
RespCh: chan DiskResp
}
The DiskScheduler maintains a DiskRequest queue to which it adds new requests. It spawns multiple go routines to
handle the requests, that is to say, if we have five write DiskRequests in the queue, the DiskScheduler will spawn
five go routines to write to the disk simultaneously, effectively achieving my second design goal of utilizing the
disk’s parallelism. But there’s a catch, this means that DiskRequests for a page maybe fulfilled out of order.
For example, if there’s three disk requests for page with id 1, R1->1, R2->0, R3->5 they maybe handled in any order and
the final value in the page is not guaranteed to be 5. To solve for this, the DiskScheduler maintains a per page
queue instead of a single queue. So requests for a particular page are handled in order by a single go routine but requests
for two different pages are handled in parallel by two go routines.