Virtualization Blog

Discussions and observations on virtualization.

XenServer 7.0 performance improvements part 2: Parallelised networking datapath

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the second in a series of articles that will describe the principal improvements. For the first, see http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html.

The topic of this post is network I/O performance. XenServer 7.0 achieves significant performance improvements through the support for multi-queue paravirtualised network interfaces. Measurements of one particular use-case show an improvement from 17 Gb/s to 41 Gb/s.

A bit of background about the PV network datapath

In order to perform network-based communications, a VM employs a paravirtualised network driver (netfront in Linux or xennet in Windows) in conjunction with netback in the control domain, dom0.

a1sx2_Original2_single-queue.png

To the guest OS, the netfront driver feels just like a physical network device. When a guest wants to transmit data:

  • Netfront puts references to the page(s) containing that data into a "Transmit" ring buffer it shares with dom0.
  • Netback in dom0 picks up these references and maps the actual data from the guest's memory so it appears in dom0's address space.
  • Netback then hands the packet to the dom0 kernel, which uses normal routing rules to determine that it should go to an Open vSwitch device and then on to either a physical interface or the netback device for another guest on the same host.

When dom0 has a network packet it needs to send to the guest, the reverse procedure applies, using a separate "Receive" ring.

Amongst the factors that can limit network throughput are:

  1. the ring becoming full, causing netfront to have to wait before more data can be sent, and
  2. the netback process fully consuming an entire dom0 vCPU, meaning it cannot go any faster.

Multi-queue alleviates both of these potential bottlenecks.

What is multi-queue?

Rather than having a single Transmit and Receive ring per virtual interface (VIF), multi-queue means having multiple Transmit and Receive rings per VIF, and one netback thread for each:

a1sx2_Original1_multi-queue.png

Now, each TCP stream has the opportunity to be driven through a different Transmit or Receive ring. The particular ring chosen for each stream is determined by a hash of the TCP header (MAC, IP and port number of both the source and destination).

Crucially, this means that separate netback threads can work on each TCP stream in parallel. So where we were previously limited by the capacity of a single dom0 vCPU to process packets, now we can exploit several dom0 vCPUs. And where the capacity of a single Transmit ring limited the total amount of data in-flight, the system can now support a larger amount.

Which use-cases can take advantage of multi-queue?

Anything involving multiple TCP streams. For example, any kind of server VM that handles connections from more than one client at the same time.

Which guests can use multi-queue?

Since frontend changes are needed, the version of the guest's netfront driver matters. Although dom0 is geared up to support multi-queue, guests with old versions of netfront that lack multi-queue support are limited to single Transmit and Receive rings.

  • For Windows, the XenServer 7.0 xennet PV driver supports multi-queue.
  • For Linux, multi-queue support was added in Linux 3.16. This means that Debian Jessie 8.0 and Ubuntu 14.10 (or later) support multi-queue with their stock kernels. Over time, more and more distributions will pick up the relevant netfront changes.

How does the throughput scale with an increasing number of rings?

The following graph shows some measurements I made using iperf 2.0.5 between a pair of Debian 8.0 VMs both on a Dell R730xd host. The VMs each had 8 vCPUs, and iperf employed 8 threads each generating a separate TCP stream. The graph reports the sum of the 8 threads' throughputs, varying the number of queues configured on the guests' VIFs.

5104.png

We can make several observations from this graph:

  • The throughput scales well up to four queues, with four queues achieving more than double the throughput possible with a single queue.
  • The blip at five queues probably arose when the hashing algorithm failed to spread the eight TCP streams evenly across the queues, and is thus a measurement artefact. With different TCP port numbers, this may not have happened.
  • While the throughput generally increases with an increasing number of queues, the throughput is not proportional to the number of rings. Ideally, the throughput would double when you double the number of rings. This doesn't happen in practice because the processing is not perfectly parallelisable: netfront needs to demultiplex the streams onto the rings, and there are some overheads due to locking and synchronisation between queues.

This graph also highlights the substantial improvement over XenServer 6.5, in which only one queue per VIF was supported. In this use-case of eight TCP streams, XenServer 7.0 achieves 41 Gb/s out-of-the-box where XenServer 6.5 could manage only 17 Gb/s – an improvement of 140%.

How many rings do I get by default?

By default the number of queues is limited by (a) the number of vCPUs the guest has and (b) the number of vCPUs dom0 has. A guest with four vCPUs will get four queues per VIF.

This is a sensible default, but if you want to manually override it, you can do so in the guest. In a Linux guest, add the parameter xen_netfront.max_queues=n, for some n, to the kernel command-line.

Resetting Lost Root Password in XenServer 7.0
XenServer 7.0 performance improvements part 1: Low...

Related Posts

 

Comments 5

Tobias Kreidl on Tuesday, 21 June 2016 04:54

Hi, Jonathan:

Thanks for the insightful pair of articles. It's interesting how what appear to be nuances can make large performance differences in the end.

I wondered about optimization of the queue polling, as well. It'd be interesting to build up a DB of hit/miss polling events to use as a self-learning option to optimize how to guess whether or not to poll. Seems this could be one of these self-learning sorts of things that could be set up and allowed to build up on its own for each configuration and hence adapt to each configuration on its own.

Another thought that came to mind was to dedicate a single VCPU for just polling, though that may be wasteful unless you have eight or more VCPUs dedicated to dom0. Alternatively, maybe one VCPU could be reserved if and only when the I/O load got so high that it'd be worth dedicating it.

Parallel NFS (pNFS) could also help here with multiple queues getting processed concurrently; I suggested this be looked into a year ago and submitted it to the https://bugs.xenserver.org list as a suggestion.

Finally, has RDMA support ever been looked at for networking? I also suggested this on the bugs list as a feature request nearly a year ago, as it seems like another possible way to improve packet transfer efficiency.

Best regards,
-=Tobias

0
Hi, Jonathan: Thanks for the insightful pair of articles. It's interesting how what appear to be nuances can make large performance differences in the end. I wondered about optimization of the queue polling, as well. It'd be interesting to build up a DB of hit/miss polling events to use as a self-learning option to optimize how to guess whether or not to poll. Seems this could be one of these self-learning sorts of things that could be set up and allowed to build up on its own for each configuration and hence adapt to each configuration on its own. Another thought that came to mind was to dedicate a single VCPU for just polling, though that may be wasteful unless you have eight or more VCPUs dedicated to dom0. Alternatively, maybe one VCPU could be reserved if and only when the I/O load got so high that it'd be worth dedicating it. Parallel NFS (pNFS) could also help here with multiple queues getting processed concurrently; I suggested this be looked into a year ago and submitted it to the https://bugs.xenserver.org list as a suggestion. Finally, has RDMA support ever been looked at for networking? I also suggested this on the bugs list as a feature request nearly a year ago, as it seems like another possible way to improve packet transfer efficiency. Best regards, -=Tobias
Jonathan Davies on Wednesday, 22 June 2016 08:40

Thanks for sharing your thoughts, Tobias.

You ask about queue polling. In fact, netback already does this! It achieves this by using the NAPI interface for Linux network drivers. This article gives a good introduction to how it works: http://www.makelinux.net/ldd3/chp-17-sect-8

Regarding pNFS and RDMA, these are indeed on the radar.

0
Thanks for sharing your thoughts, Tobias. You ask about queue polling. In fact, netback already does this! It achieves this by using the NAPI interface for Linux network drivers. This article gives a good introduction to how it works: http://www.makelinux.net/ldd3/chp-17-sect-8 Regarding pNFS and RDMA, these are indeed on the radar.
Sam McLeod on Tuesday, 21 June 2016 13:01

Interesting post Jonathan,

I've tried adjusting `xen_netfront.max_queues` amongst other similar values on both guests and hosts had it's had very little affect for us.

We keep getting stuck between 20-35K ramdom 4k read IOP/s in all my tests.
Each time, a single tapdisk process is maxing out a single core on the host-OS.

I'm also aware than even as of XenServer 7 there is still no working trim / discard passthrough, it's not even enabled in lvm.conf (seriously!) and I haven't looked into the tapdisk blkbk code to see whats wrong there.

For comparison if I present the same iSCSI LUNs to a host running KVM, I can easily achieve over 150,000 random 4k read or write IOP/s, and if I present the iSCSI LUN directly to a server, I'm able to to pull a full 450,000 random 4K read IOP/s from our storage arrays which is what is expected.

0
Interesting post Jonathan, I've tried adjusting `xen_netfront.max_queues` amongst other similar values on both guests and hosts had it's had very little affect for us. We keep getting stuck between 20-35K ramdom 4k read IOP/s in all my tests. Each time, a single tapdisk process is maxing out a single core on the host-OS. I'm also aware than even as of XenServer 7 there is still no working trim / discard passthrough, it's not even enabled in lvm.conf (seriously!) and I haven't looked into the tapdisk blkbk code to see whats wrong there. For comparison if I present the same iSCSI LUNs to a host running KVM, I can easily achieve over 150,000 random 4k read or write IOP/s, and if I present the iSCSI LUN directly to a server, I'm able to to pull a full 450,000 random 4K read IOP/s from our storage arrays which is what is expected.
Tobias Kreidl on Wednesday, 22 June 2016 07:15

There is no question that native Linux seems to be able to achieve much higher IOPS and Sam's comments about the lack of TRIM support is a real impediment to better support for SSD storage. So while any improvement is always welcome, the theoretically achievable I/O rates still fall far behind what should be possible, even taking a fair amount of overhead into account. With some storage arrays, the buffer size can be raised to a larger number, which can help some, but Windows continues to depend on 4k buffers, so a lot hinges on being able to optimize for that buffer size.

0
There is no question that native Linux seems to be able to achieve much higher IOPS and Sam's comments about the lack of TRIM support is a real impediment to better support for SSD storage. So while any improvement is always welcome, the theoretically achievable I/O rates still fall far behind what should be possible, even taking a fair amount of overhead into account. With some storage arrays, the buffer size can be raised to a larger number, which can help some, but Windows continues to depend on 4k buffers, so a lot hinges on being able to optimize for that buffer size.
Jonathan Davies on Wednesday, 22 June 2016 08:40

Sam, I think there's a bit of confusion between paravirtualised networking and paravirtualised storage. Sorry for not explaining more clearly! It sounds like your testing is doing storage I/O in the VM. This will be using the paravirtualised storage datapath. If you want to see the effect of increasing xen_netfront.max_queues, you need to be measuring network I/O throughput (e.g. using iperf, netperf, or similar). A VM still uses the storage datapath even if it's in an iSCSI SR.

Regardless, it does sound like your storage throughput is not as it should be. Please raise an XSO ticket on https://bugs.xenserver.org/ so we can investigate, giving details of the hardware, guest OS and how you make the measurements.

0
Sam, I think there's a bit of confusion between paravirtualised networking and paravirtualised storage. Sorry for not explaining more clearly! It sounds like your testing is doing storage I/O in the VM. This will be using the paravirtualised storage datapath. If you want to see the effect of increasing xen_netfront.max_queues, you need to be measuring network I/O throughput (e.g. using iperf, netperf, or similar). A VM still uses the storage datapath even if it's in an iSCSI SR. Regardless, it does sound like your storage throughput is not as it should be. Please raise an XSO ticket on https://bugs.xenserver.org/ so we can investigate, giving details of the hardware, guest OS and how you make the measurements.

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Technical support for XenServer is available from Citrix.