Virtualization Blog

Discussions and observations on virtualization.

XenServer 7.0 performance improvements part 4: Aggregate I/O throughput improvements

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the fourth in a series of articles that will describe the principal improvements. For the previous ones, see:

  1. http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html
  2. http://xenserver.org/blog/entry/dundee-networking-multi-queue.html
  3. http://xenserver.org/blog/entry/dundee-parallel-vbd-operations.html

In this article we return to the theme of I/O throughput. Specifically, we focus on improvements to the total throughput achieved by a number of VMs performing I/O concurrently. Measurements show that XenServer 7.0 enjoys aggregate network throughput over three times faster than XenServer 6.5, and also has an improvement to aggregate storage throughput.

What limits aggregate I/O throughput?

When a number of VMs are performing I/O concurrently, the total throughput that can be achieved is often limited by dom0 becoming fully busy, meaning it cannot do any additional work per unit time. The I/O backends (netback for network I/O and tapdisk3 for storage I/O) together consume 100% of available dom0 CPU time.

How can this limit be overcome?

Whenever there is a CPU bottleneck like this, there are two possible approaches to improving the performance:

  1. Reduce the amount of CPU time required to perform I/O.
  2. Increase the processing capacity of dom0, by giving it more vCPUs.

Surely approach 2 is easy and will give a quick win...? Intuitively, we might expect the total throughput to increase proportionally with the number of dom0 vCPUs.

Unfortunately it's not as straightforward as that. The following graph shows what happened to the aggregate network throughput on XenServer 6.5 if the number of dom0 vCPUs is artificially increased. (In this case, we are measuring the total network throughput of 40 VMs communicating amongst themselves on a single Dell R730 host.)

b2ap3_thumbnail_5179.png

Counter-intuitively, the aggregate throughput decreases as we add more processing power to dom0! (This explains why the default was at most 8 vCPUs in XenServer 6.5.)

So is there no hope for giving dom0 more processing power...?

The explanation for the degradation in performance is that certain operations run more slowly when there are more vCPUs present. In order to make dom0 work better with more vCPUs, we needed to understand what those operations are, and whether they can be made to scale better.

Three such areas of poor scalability were discovered deep in the innards of Xen by Malcolm Crossley and David Vrabel, and improvements were made for each:

  1. Maptrack lock contention – improved by http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=dff515dfeac4c1c13422a128c558ac21ddc6c8db
  2. Grant-table lock contention – improved by http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b4650e9a96d78b87ccf7deb4f74733ccfcc64db5
  3. TLB flush on grant-unmap – improved by https://github.com/xenserver/xen-4.6.pg/blob/master/master/avoid-gnt-unmap-tlb-flush-if-not-accessed.patch

The result of improving these areas is dramatic – see the green line in the following graph:

b2ap3_thumbnail_4190.png

Now, throughput scales very well as the number of vCPUs increases. This means that, for the first time, it is now beneficial to allocate many vCPUs to dom0 – so that when there is demand, dom0 can deliver. Hence we have given XenServer 7.0 a higher default number of dom0 vCPUs.

How many vCPUs are now allocated to dom0 by default?

Most hosts will now get 16 vCPUs by default, but the exact number depends on the number of CPU cores on the host. The following graph summarises how the default number of dom0 vCPUs is calculated from the number of CPU cores on various current and historic XenServer releases:

b2ap3_thumbnail_dom0-vcpus.png

Summary of improvements

I will conclude with some aggregate I/O measurements comparing XenServer 6.5 and 7.0 under default settings (no dom0 configuration changes) on a Dell R730xd.

  1. Aggregate network throughput – twenty pairs of 32-bit Debian 6.0 VMs sending and receiving traffic generated with iperf 2.0.5.
    b2ap3_thumbnail_aggr-intrahost-r730_20160729-093608_1.png
  2. Aggregate storage IOPS – twenty 32-bit Windows 7 SP1 VMs each doing single-threaded, serial, sequential 4KB reads with fio to a virtual disk on an Intel P3700 NVMe drive.
    b2ap3_thumbnail_storage-iops-aggr-p3700-win7.png
Continue reading
7267 Hits
2 Comments

XenServer 7.0 performance improvements part 2: Parallelised networking datapath

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the second in a series of articles that will describe the principal improvements. For the first, see http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html.

The topic of this post is network I/O performance. XenServer 7.0 achieves significant performance improvements through the support for multi-queue paravirtualised network interfaces. Measurements of one particular use-case show an improvement from 17 Gb/s to 41 Gb/s.

A bit of background about the PV network datapath

In order to perform network-based communications, a VM employs a paravirtualised network driver (netfront in Linux or xennet in Windows) in conjunction with netback in the control domain, dom0.

a1sx2_Original2_single-queue.png

To the guest OS, the netfront driver feels just like a physical network device. When a guest wants to transmit data:

  • Netfront puts references to the page(s) containing that data into a "Transmit" ring buffer it shares with dom0.
  • Netback in dom0 picks up these references and maps the actual data from the guest's memory so it appears in dom0's address space.
  • Netback then hands the packet to the dom0 kernel, which uses normal routing rules to determine that it should go to an Open vSwitch device and then on to either a physical interface or the netback device for another guest on the same host.

When dom0 has a network packet it needs to send to the guest, the reverse procedure applies, using a separate "Receive" ring.

Amongst the factors that can limit network throughput are:

  1. the ring becoming full, causing netfront to have to wait before more data can be sent, and
  2. the netback process fully consuming an entire dom0 vCPU, meaning it cannot go any faster.

Multi-queue alleviates both of these potential bottlenecks.

What is multi-queue?

Rather than having a single Transmit and Receive ring per virtual interface (VIF), multi-queue means having multiple Transmit and Receive rings per VIF, and one netback thread for each:

a1sx2_Original1_multi-queue.png

Now, each TCP stream has the opportunity to be driven through a different Transmit or Receive ring. The particular ring chosen for each stream is determined by a hash of the TCP header (MAC, IP and port number of both the source and destination).

Crucially, this means that separate netback threads can work on each TCP stream in parallel. So where we were previously limited by the capacity of a single dom0 vCPU to process packets, now we can exploit several dom0 vCPUs. And where the capacity of a single Transmit ring limited the total amount of data in-flight, the system can now support a larger amount.

Which use-cases can take advantage of multi-queue?

Anything involving multiple TCP streams. For example, any kind of server VM that handles connections from more than one client at the same time.

Which guests can use multi-queue?

Since frontend changes are needed, the version of the guest's netfront driver matters. Although dom0 is geared up to support multi-queue, guests with old versions of netfront that lack multi-queue support are limited to single Transmit and Receive rings.

  • For Windows, the XenServer 7.0 xennet PV driver supports multi-queue.
  • For Linux, multi-queue support was added in Linux 3.16. This means that Debian Jessie 8.0 and Ubuntu 14.10 (or later) support multi-queue with their stock kernels. Over time, more and more distributions will pick up the relevant netfront changes.

How does the throughput scale with an increasing number of rings?

The following graph shows some measurements I made using iperf 2.0.5 between a pair of Debian 8.0 VMs both on a Dell R730xd host. The VMs each had 8 vCPUs, and iperf employed 8 threads each generating a separate TCP stream. The graph reports the sum of the 8 threads' throughputs, varying the number of queues configured on the guests' VIFs.

5104.png

We can make several observations from this graph:

  • The throughput scales well up to four queues, with four queues achieving more than double the throughput possible with a single queue.
  • The blip at five queues probably arose when the hashing algorithm failed to spread the eight TCP streams evenly across the queues, and is thus a measurement artefact. With different TCP port numbers, this may not have happened.
  • While the throughput generally increases with an increasing number of queues, the throughput is not proportional to the number of rings. Ideally, the throughput would double when you double the number of rings. This doesn't happen in practice because the processing is not perfectly parallelisable: netfront needs to demultiplex the streams onto the rings, and there are some overheads due to locking and synchronisation between queues.

This graph also highlights the substantial improvement over XenServer 6.5, in which only one queue per VIF was supported. In this use-case of eight TCP streams, XenServer 7.0 achieves 41 Gb/s out-of-the-box where XenServer 6.5 could manage only 17 Gb/s – an improvement of 140%.

How many rings do I get by default?

By default the number of queues is limited by (a) the number of vCPUs the guest has and (b) the number of vCPUs dom0 has. A guest with four vCPUs will get four queues per VIF.

This is a sensible default, but if you want to manually override it, you can do so in the guest. In a Linux guest, add the parameter xen_netfront.max_queues=n, for some n, to the kernel command-line.

Recent Comments
Tobias Kreidl
Hi, Jonathan: Thanks for the insightful pair of articles. It's interesting how what appear to be nuances can make large performan... Read More
Tuesday, 21 June 2016 04:54
Jonathan Davies
Thanks for sharing your thoughts, Tobias. You ask about queue polling. In fact, netback already does this! It achieves this by us... Read More
Wednesday, 22 June 2016 08:40
Sam McLeod
Interesting post Jonathan, I've tried adjusting `xen_netfront.max_queues` amongst other similar values on both guests and hosts h... Read More
Tuesday, 21 June 2016 13:01
Continue reading
8732 Hits
5 Comments

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.