PostgreSQL Source Code git master
|
Until the introduction of asynchronous IO postgres relied on the operating system to hide the cost of synchronous IO from postgres. While this worked surprisingly well in a lot of workloads, it does not do as good a job on prefetching and controlled writeback as we would like.
There are important expensive operations like fdatasync()
where the operating system cannot hide the storage latency. This is particularly important for WAL writes, where the ability to asynchronously issue fdatasync()
or O_DSYNC writes can yield significantly higher throughput.
The main reasons to want to use Direct IO are:
The main reasons not to use Direct IO are:
In many cases code that can benefit from AIO does not directly have to interact with the AIO interface, but can use AIO via higher-level abstractions. See Helpers.
In this example, a buffer will be read into shared buffers.
Using AIO in a naive way can easily lead to deadlocks in an environment where the source/target of AIO are shared resources, like pages in postgres' shared_buffers.
Consider one backend performing readahead on a table, initiating IO for a number of buffers ahead of the current "scan position". If that backend then performs some operation that blocks, or even just is slow, the IO completion for the asynchronously initiated read may not be processed.
This AIO implementation solves this problem by requiring that AIO methods either allow AIO completions to be processed by any backend in the system (e.g. io_uring), or to guarantee that AIO processing will happen even when the issuing backend is blocked (e.g. worker mode, which offloads completion processing to the AIO workers).
Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
The need to be able to execute IO in critical sections has substantial design implication on the AIO subsystem. Mainly because completing IOs (see prior section) needs to be possible within a critical section, even if the to-be-completed IO itself was not issued in a critical section. Consider e.g. the case of a backend first starting a number of writes from shared buffers and then starting to flush the WAL. Because only a limited amount of IO can be in-progress at the same time, initiating IO for flushing the WAL may require to first complete IO that was started earlier.
Because postgres uses a process model and because AIOs need to be complete-able by any backend much of the state of the AIO subsystem needs to live in shared memory.
In an EXEC_BACKEND
build, a backend's executable code and other process local state is not necessarily mapped to the same addresses in each process due to ASLR. This means that the shared memory cannot contain pointers to callbacks.
To achieve portability and performance, multiple methods of performing AIO are implemented and others are likely worth adding in the future.
io_method=sync
does not actually perform AIO but allows to use the AIO API while performing synchronous IO. This can be useful for debugging. The code for the synchronous mode is also used as a fallback by e.g. the worker mode uses it to execute IO that cannot be executed by workers.
io_method=worker
is available on every platform postgres runs on, and implements asynchronous IO - from the view of the issuing process - by dispatching the IO to one of several worker processes performing the IO in a synchronous manner.
io_method=io_uring
is available on Linux 5.1+. In contrast to worker mode it dispatches all IO from within the process, lowering context switch rate / latency.
The central API piece for postgres' AIO abstraction are AIO handles. To execute an IO one first has to acquire an IO handle (pgaio_io_acquire()
) and then "define" it, i.e. associate an IO operation with the handle.
Often AIO handles are acquired on a higher level and then passed to a lower level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c routines acquire the handle, which is then passed through smgr.c, md.c to be finally fully defined in fd.c.
The functions used at the lowest level to define the operation are pgaio_io_start_*()
.
Because acquisition of an IO handle must always succeed and the number of AIO Handles has to be limited AIO handles can be reused as soon as they have completed. Obviously code needs to be able to react to IO completion. State can be updated using AIO Completion callbacks and the issuing backend can provide a backend local variable to receive the result of the IO, as described in AIO Result. An IO can be waited for, by both the issuing and any other backend, using AIO References.
Because an AIO Handle is not executable just after calling pgaio_io_acquire()
and because pgaio_io_acquire()
needs to always succeed (absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by pgaio_io_acquire()
) without causing the IO to have been defined (by, potentially indirectly, causing pgaio_io_start_*()
to have been called). Otherwise a backend could trivially self-deadlock by using up all AIO Handles without the ability to wait for some of the IOs to complete.
If it turns out that an AIO Handle is not needed, e.g., because the handle was acquired before holding a contended lock, it can be released without being defined using pgaio_io_release()
.
Commonly several layers need to react to completion of an IO. E.g. for a read md.c needs to check if the IO outright failed or was shorter than needed, bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the BufferDesc to update the buffer's state.
The fact that several layers / subsystems need to react to IO completion poses a few challenges:
The "solution" to this is the ability to associate multiple completion callbacks with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc state and to verify the page and md.c can have another callback to check if the IO operation was successful.
As mentioned, shared memory currently cannot contain function pointers. Because of that completion callbacks are not directly identified by function pointers but by IDs (PgAioHandleCallbackID
). A substantial added benefit is that that allows callbacks to be identified by much smaller amount of memory (a single byte currently).
In addition to completion, AIO callbacks also are called to "stage" an IO. This is, e.g., used to increase buffer reference counts to account for the AIO subsystem referencing the buffer, which is required to handle the case where the issuing backend errors out and releases its own pins while the IO is still ongoing.
As explained earlier IO completions need to be safe to execute in critical sections. To allow the backend that issued the IO to error out in case of failure AIO Result can be used.
In addition to the completion callbacks describe above, each AIO Handle has exactly one "target". Each target has some space inside an AIO Handle with information specific to the target and can provide callbacks to allow to reopen the underlying file (required for worker mode) and to describe the IO operation (used for debug logging and error messages).
I.e., if two different uses of AIO can describe the identity of the file being operated on the same way, it likely makes sense to use the same target. E.g. different smgr implementations can describe IO with RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr and it would not make sense to use the same target for smgr and WAL.
As described above, AIO Handles can be reused immediately after completion and therefore cannot be used to wait for completion of the IO. Waiting is enabled using AIO wait references, which do not just identify an AIO Handle but also include the handles "generation".
A reference to an AIO Handle can be acquired using pgaio_io_get_wref()
and then waited upon using pgaio_wref_wait()
.
As AIO completion callbacks are executed in critical sections and may be executed by any backend completion callbacks cannot be used to, e.g., make the query that triggered an IO ERROR out.
To allow to react to failing IOs the issuing backend can pass a pointer to a PgAioReturn
in backend local memory. Before an AIO Handle is reused the PgAioReturn
is filled with information about the IO. This includes information about whether the IO was successful (as a value of PgAioResultStatus
) and enough information to raise an error in case of a failure (via pgaio_result_report()
, with the error details encoded in PgAioResult
).
It would be very convenient to have shared completion callbacks encode the details of errors as an ErrorData
that could be raised at a later time. Unfortunately doing so would require allocating memory. While elog.c can guarantee (well, kinda) that logging a message will not run out of memory, that only works because a very limited number of messages are in the process of being logged. With AIO a large number of concurrently issued AIOs might fail.
To avoid the need for preallocating a potentially large amount of memory (in shared memory no less!), completion callbacks instead have to encode errors in a more compact format that can be converted into an error message.
Using the low-level AIO API introduces too much complexity to do so all over the tree. Most uses of AIO should be done via reusable, higher-level, helpers.
A common and very beneficial use of AIO are reads where a substantial number of to-be-read locations are known ahead of time. E.g., for a sequential scan the set of blocks that need to be read can be determined solely by knowing the current position and checking the buffer mapping table.
The Read Stream interface makes it comparatively easy to use AIO for such use cases.