The XenServer team made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the first in a series of articles that will describe the principal improvements.
Our first topic is storage I/O performance. A performance improvement has been achieved through the adoption of a polling technique in tapdisk3, the component of XenServer responsible for handling I/O on virtual storage devices. Measurements of one particular use-case demonstrate a 50% increase in performance from 15,000 IOPS to 22,500 IOPS.
What is polling?
Normally, tapdisk3 operates in an event-driven manner. Here is a summary of the first few steps required when a VM wants to do some storage I/O:
- The VM's paravirtualised storage driver (called blkfront in Linux or xenvbd in Windows) puts a request in the ring it shares with dom0.
- It sends tapdisk3 a notification via an event-channel.
- This notification is delivered to domain 0 by Xen as an interrupt. If Domain 0 is not running, it will need to be scheduled in order to receive the interrupt.
- When it receives the interrupt, the domain 0 kernel schedules the corresponding backend process to run, tapdisk3.
- When tapdisk3 runs, it looks at the contents of the shared-memory ring.
- Finally, tapdisk3 finds the request which can then be transformed into a physical I/O request.
Polling is an alternative to this approach in which tapdisk3 repeatedly looks in the ring, speculatively checking for new requests. This means that steps 2–4 can be skipped: there's no need to wait for an event-channel interrupt, nor to wait for the tapdisk3 process to be scheduled: it's already running. This enables tapdisk3 to pick up the request much more promptly as it avoids these delays inherent to the event-driven approach.
The following diagram contrasts the timelines of these alternative approaches, showing how polling reduces the time until the request is picked up by the backend.
How does polling help improve storage I/O performance?
Polling is in established technique for reducing latency in event-driven systems. (One example of where it is used elsewhere to mitigate interrupt latency is in Linux networking drivers that use NAPI.)
Servicing I/O requests promptly is an essential part of optimising I/O performance. As I discussed in my talk at the 2015 Xen Project Developer Summit, reducing latency is the key to maintaining a low virtualisation overhead. As physical I/O devices get faster and faster, any latency incurred in the virtualisation layer becomes increasingly noticeable and translates into lower throughputs.
An I/O request from a VM has a long journey to physical storage and back again. Polling in tapdisk3 optimises one section of that journey.
Isn't polling really CPU intensive, and thus harmful?
Yes it is, so we need to handle it carefully. If left unchecked, polling could easily eat huge quantities of domain 0 CPU time, starving other processes and causing overall system performance to drop.
We have chosen to do two things to avoid consuming too much CPU time:
- Poll the ring only when there's a good chance of a request appearing. Of course, guest behaviour is totally unpredictable in general, but there are some principles that can increase our chances of polling at the right time. For example, one assumption we adopt is that it's worth polling for a short time after the guest issues an I/O request. It has issued one request, so there's a good chance that it will issue another soon after. And if this guess turns out to be correct, keep on polling for a bit longer in case any more turn up. If there are none for a while, stop polling and temporarily fall back to the event-based approach.
- Don't poll if domain 0 is already very busy. Since polling is expensive in terms of CPU cycles, we only enter the polling loop if we are sure that it won't starve other processes of CPU time they may need.
How much faster does it go?
The benefit you will get from polling depends primarily on the latency of your physical storage device. If you are using an old mechanical hard-drive or an NFS share on a server on the other side of the planet, shaving a few microseconds off the journey through the virtualisation layer isn't going to make much of a difference. But on modern devices and low-latency network-based storage, polling can make a sizeable difference. This is especially true for smaller request sizes since these are most latency-sensitive.
For example, the following graph shows an improvement of 50% in single-threaded sequential read I/O for small request sizes – from 15,000 IOPS to 22,500 IOPS. These measurements were made with iometer in a 32-bit Windows 7 SP1 VM on a Dell PowerEdge R730xd with an Intel P3700 NVMe drive.
How was polling implemented?
The code to add polling to tapdisk3 can be found in the following set of commits: https://github.com/xapi-project/blktap/pull/179/commits.