Virtualization Blog

Discussions and observations on virtualization.

Project Karcygwins and Virtualised Storage Performance

Introduction

Over the last few years we have witnessed a revolution in terms of storage solutions. Devices capable of achieving millions of Input/Output Operations per Second (IOPS) are now available off-the-shelf. At the same time, Central Processing Unit (CPU) speeds remain largely constant. This means that the overhead of processing storage requests is actually affecting the delivered throughput. In a world of virtualisation, where extra processing is required in order to securely pass requests from virtual machines (VM) to storage domains, this overhead becomes more evident.

It is the first time that such an overhead became a concern. Until recently, the time spent within I/O devices was much longer than that of processing a request within CPUs. Kernel and driver developers were mainly worried about: (1) not blocking while waiting for devices to complete; and (2) sending requests optimised for specific device types. While the former was addressed by techniques such as Direct Memory Access (DMA), the latter was solved by elevator algorithms such as Completely Fair Queueing (CFQ).

Today, with the large adoption of Solid-State Drives (SSD) and the further development of low-latency storage solutions such as those built on top of PCI Express (PCIe) and Non-Volatile Memory (NVM) technologies, the main concern lies in not losing any unnecessary time in processing requests. Within the Xen Project community, some development already started in order to allow scalable storage traffic from several VMs. Linux kernel maintainers and storage manufacturers are also working on similar issues. In the meantime, XenServer Engineering delivered Project Karcygwins which allowed a better understanding of current bottlenecks, when they are evident and what can be done to overcome them. 

Project Karcygwins

Karcygwins was originally intended as three separate projects (Karthes, Cygni and Twins). Due to their topics being closely related, they were merged. Those three projects were proposed based on subjects believed to be affecting virtualised storage throughput.

Project Karthes aimed at assessing and mitigating the cost in mapping (and unmapping) memory between domains. When a VM issues an I/O request, the storage driver domain (dom0 in XenServer) requires access to certain memory areas in the guest domain. After the request is served, these areas need to be released (or unmapped). This is also an expensive operation due to flushes required in different cache tables. Karthes was proposed to investigate the cost related to these operations, how they impacted the delivered throughput and what could be done to mitigate them.

Project Cygni aimed at allowing requests larger than 44 KiB to be passed between a guest and a storage driver domain. Until recently, Xen's blkif protocol defined a fixed array of data segments per request. This array had room for 11 segments corresponding to a 4 KiB memory page each (hence the 44 KiB). The protocol has since been updated to support indirect I/O operations where the segments actually contained other segments. This change allowed for much larger requests at a small expense.

Project Twins aimed at evaluating the benefits of using two communication rings between dom0 and a VM. Currently, only one ring exists and it is used both for requests from the guests and responses from the back end. With two rings, requests and responses can be stored in their own ring. This new strategy allows for larger inflight data and better use of caching.

Due to initial findings, the main focus of Karcygwins stayed on Project Karthes. The code allowing for requests larger than 44 KiB, however, was constantly included in the measurements to address the goals proposed for Project Cygni. The idea of using split rings (Project Twins) was postponed and will be investigated at a later stage.

Visualising the Overhead

When a user installs a virtualisation platform, one of the first questions to be raised is: "what is the performance overhead?". When it comes to storage performance, a straightforward way to quantify this overhead is to measure I/O throughput on a bare metal Linux installation and repeat the measurement (on the same hardware) from a Linux VM. This can promptly be done with a generic tool like dd for a variety of block sizes. It is a simple test that does not cover concurrent workloads or greater IO depths.

karcyg-fig0.png

Looking at the plot above we can see that, on a 7.2k RPM SATA Western Digital Blue WD5000AAKX, read requests as large as 16 KiB can reach the maximum disk throughput at just over 120 MB/s (red line). When repeating the same test from a VM (green and blue lines), however, we see that the throughput for small requests is much lower. They eventually reach the same 120 MB/s mark, but only with larger requests.

The green line represents the data path where blkback is directly plugged to the back end storage. This is the kernel module that receive requests from the VM. While this is the fastest virtualisation path in the Xen world, it lacks certain software-level features such as thin-provisioning, cloning, snapshotting and the capability of migrating guests without centralised storage.

The blue line represents the data path where requests go through tapdisk2. This is a user space application that runs in dom0 and can implement the VHD format. It also has an NBD plugin for migration of guests without centralised storage. It allows for thin-provisioning, cloning and snapshotting of Virtual Disk Images (VDI). Because requests transverse more components before reaching the disk, it is understandingly slower.

Using Solid-State Drives and Fast RAID Arrays

The shape of the plot above is not the same for all types of disks, though. Modern disk setups can achieve considerable higher data rates before flattening their throughputs.

karcyg-fig1.png

Looking at the plot above, we can see a similar test executed from dom0 on two different back end types. The red line represents the throughput obtained from a RAID0 formed by two SSDs (Intel DC S3700). The blue line represents the throughput obtained from a RAID0 formed by two SAS disks (Seagate ST). Both arrays were measured independently and are connected to the host through a PERC H700 controller. While the Seagate SAS array achieves its maximum throughput at around 370 MB/s when using 48 KiB requests, the Intel SSD array continues to speed up even with requests as large as 4 MiB. Focusing on each array separately, it is possible to compare these dom0 measurements with measurements obtained from a VM. The plot below isolates the Seagate SAS array.

karcyg-fig2.png

Similar to what is observed on the measurements taken on a single Western Digital, the throughput measured from a VM is smaller than that of dom0 when requests are not big enough. In this case, the blkback data path (the pink line) allows the VM to reach the same throughput offered by the array (370 MB/s) with requests larger than 116 KiB. The other data paths (orange, cyan and brown lines) represent user space alternatives that reach different bottlenecks and even with large requests cannot match the throughput measured from dom0.

It is interesting to observe that some user space implementations vary considerably in terms of performance. When using qdisk as the back end along the blkfront driver from the Linux Kernel 3.11.0 (the orange line), the throughput is higher for requests of sizes such as 256 KiB (when compared to other user space alternatives -- the blkback data path remains faster). The main difference in this particular setup is the support for persistent grants. This technique, implemented in 3.11.0, reuses memory grants and drastically reduces the map and unmap operations. It requires, however, an additional copy operation within the guest. The trade-off may have different implications when varying factors such as hardware architecture and workload types. More on that on the next section.

karcyg-fig3.png

When repeating these measurements on the Intel SSD array, a new issue came to light. Because the array delivers higher throughput with no signs of abating as larger requests are issued, none of the virtualisation technologies are capable of matching the throughput measured from dom0. While this behaviour will probably differ with other workloads, this is what has been observed when using a single I/O thread with queue depth set to one. In a nutshell, 2 MiB read requests from dom0 achieves 900 MB/s worth of throughput while a similar measurement from one VM will only reach 300 MB/s when using user space back ends. This is a pathological example chosen for this particular hardware architecture to show how bad things can get.

Understanding the Overhead

In order to understand why the overhead is so evident in some cases, it is necessary to take a step back. The measurements taken on slower disks show that all virtualisation technologies are somewhat slower than what is observed in dom0. On such disks, this difference disappears as requests grow in size. What happens at that point is that the actual disk becomes "maxed out" and cannot respond faster no matter the request size. At the same time, much of the work done at the virtualisation layers do not get slower proportionally to the amount of data associated with requests. For example, interruptions between domains are unlikely to take longer simply because requests are bigger. This is exactly why there is no visible overhead with large enough requests on certain disks.

However, the question remains: what is consuming CPU time and causing such a visible overhead on the example previously presented? There are mainly two techniques that can be used to answer that question: profiling and tracing. Profiling allows instruction pointer samples to be collected at every so many events. The analysis of millions of such samples reveals code in hot paths where time is being spent. Tracing, on the other hand, measures the exact time passed between two events.

For this particular analysis, the tracing technique and the blkback data path have been chosen. To measure the amount of time spent between events, the code was actually modified and several RDTSC instructions have been inserted. These instructions read the Time Stamp Counters (TSC) and are relatively cheap while providing very accurate data. On modern hardware, TSCs are constant and consistent across cores of a host. This means that measurements from different domains (i.e. dom0 and guests) can be matched to obtain the time passed, for example, between blkfront kicking blkback. The diagram below shows where trace points have been inserted.

karcyg-fig4.png

In order to gather meaningful results, 100 requests have been issued in succession. Domains have been pinned to the same NUMA node in the host and turbo capabilities were disabled. The TSC readings were collected for each request and analysed both individually and as an average. The individual analysis revealed interesting findings such as a "warm up" period where the first requests are always slower. This was attributed to caching and scheduling effects. It also showed that some requests were randomly faster than others in certain parts of the path. This was attributed to CPU affinity. For the average analysis, the 20 fastest and slowest requests were initially discarded. This produced more stable and reproducible results. The plots below show these results.

karcyg-fig5.png

karcyg-fig6.png

Without persistent grants, the cost of mapping and unmapping memory across domains is clearly a significant factor as requests grow in size. With persistent grants, the extra copy on the front end adds up and results in a slower overall path. Roger Pau Monne, however, showed that persistent grants can improve aggregate throughput from multiple VMs as it reduces contention on the grant tables. Matt Wilson, following on from discussions on the Xen Developer Summit 2013, produced patches that should also assist grant table contention.

Conclusions and Next Steps

In summary, Project Karcygwins allowed the understanding of several key elements in storage performance for both Xen and XenServer:

  • The time spent in processing requests (in CPU) definitely matters as disks get faster
  • Throughput is visibly affected for single-threaded I/O on low-latency storage
  • Kernel-only data paths can be significantly faster
  • The cost of mapping (and unmapping) grants is the most significant bottleneck at this time

It also raised the attention on such issues with the Linux and Xen Project communities by having these results shared over a series of presentations and discussions:

Next, new research projects are scheduled (or already underway) to:

  • Look into new ideas for low-latency virtualised storage
  • Investigate bottlenecks and alternatives for aggregate workloads
  • Reduce the overall CPU utilisation of processing requests in user space

Have a happy 2014 and thanks for reading!

XenServer: code highlights from 2013
XenServer Status – November 2013
 

Comments 12

Lorscheider Santiago on Thursday, 02 January 2014 13:16

Congratulations for the excellent article and for his work. Interesting to know that even with SSD storage, there is a limitation of performance by Xen. another detail is that SAS storage arrays, could extract all the performance from the hardware. In the roadmap of XenServer does not have the preview for inclusion Kernel 3.11.0, with the persistent resource grants, already possible to extract more performance today. The Karcygwins design is essential for us to explore the entire performance of SSD storage on Xen.

In your opinion 100% storage array SSDs would be an investment with low return, since we would have a major limitation of performance?

0
Congratulations for the excellent article and for his work. Interesting to know that even with SSD storage, there is a limitation of performance by Xen. another detail is that SAS storage arrays, could extract all the performance from the hardware. In the roadmap of XenServer does not have the preview for inclusion Kernel 3.11.0, with the persistent resource grants, already possible to extract more performance today. The Karcygwins design is essential for us to explore the entire performance of SSD storage on Xen. In your opinion 100% storage array SSDs would be an investment with low return, since we would have a major limitation of performance?
Felipe Franciosi on Tuesday, 07 January 2014 20:11

Hi Santiago. SSDs can be great. The tests I wrote about are all using a single VM and this is the hardest case to deliver near bare-metal performance. I am yet to write about workloads involving multiple guests. We have also been focusing on aggregate throughput and working on maxing out what the disks can deliver.

And don't worry about 3.11 (or newer kernels). In XenServer, we are in control of the features we compile in our kernel. This means we can always import persistent grants or other features if we think it will help (even if we are using 3.10, for example).

0
Hi Santiago. SSDs can be great. The tests I wrote about are all using a single VM and this is the hardest case to deliver near bare-metal performance. I am yet to write about workloads involving multiple guests. We have also been focusing on aggregate throughput and working on maxing out what the disks can deliver. And don't worry about 3.11 (or newer kernels). In XenServer, we are in control of the features we compile in our kernel. This means we can always import persistent grants or other features if we think it will help (even if we are using 3.10, for example).
Lorscheider Santiago on Wednesday, 08 January 2014 00:30

Hi Felipe.

Thanks for the replies. Indeed you have a great work to do. Very good to know the resources that can be inserted in Kernel 3.10. I follow the roadmap and was already worried :) I'm waiting to post about performance with multiple guests. Thank you!

0
Hi Felipe. Thanks for the replies. Indeed you have a great work to do. Very good to know the resources that can be inserted in Kernel 3.10. I follow the roadmap and was already worried :) I'm waiting to post about performance with multiple guests. Thank you!
Tobias Kreidl on Friday, 03 January 2014 18:49

See also Felipe's detailed article and presentation as referenced in http://xenserver.org/discuss-virtualization/virtualization-blog/entry/xenserver-at-the-xen-project-developer-summit.html

The current efforts to address bottlenecks in storage I/O under XenServer are long overdue and it is encouraging to see this issue being taken seriously and getting the research it needs and deserves. Kudos to you for your efforts to date and it will be great to see how things develop as studies continue.

1
See also Felipe's detailed article and presentation as referenced in http://xenserver.org/discuss-virtualization/virtualization-blog/entry/xenserver-at-the-xen-project-developer-summit.html The current efforts to address bottlenecks in storage I/O under XenServer are long overdue and it is encouraging to see this issue being taken seriously and getting the research it needs and deserves. Kudos to you for your efforts to date and it will be great to see how things develop as studies continue.
Guest - Florian Heigl on Friday, 03 January 2014 20:43

Please also see the 2006 XenSummit presentation from Ian (Pratt????) where they used an additional ring for blktap to boost performance.
Since none of the newer approaches can match blkback, it's still highly relevant.

0
Please also see the 2006 XenSummit presentation from Ian (Pratt????) where they used an additional ring for blktap to boost performance. Since none of the newer approaches can match blkback, it's still highly relevant.
srinivas j on Monday, 06 January 2014 17:37

Cheers on good detailed analysis of storage performance in the current xenserver architecture! Just curious how the results look like blktap3 driver is being used in the tests?

The results and analysis from this project - get the attention of XS product planning / roadmaps?

blktap3 driver is defined as -
"blktap3 (also referred to as tapdisk3) - aims to remove the blktap kernel driver (not upstreamable) by enabling direct tapdisk-domU IO request passing (access the inter-domain IO ring directly rather than delegating that function to blkback driver). Moving to tapdisk3, also avoids the blktap2 kernel driver not included in upstream kernel !
Link : http://www.xenserver.org/overview-xenserver-open-source-virtualization/project-roadmap/2-uncategorised/116-q4-2013-project-activities.html

0
Cheers on good detailed analysis of storage performance in the current xenserver architecture! Just curious how the results look like blktap3 driver is being used in the tests? The results and analysis from this project - get the attention of XS product planning / roadmaps? blktap3 driver is defined as - "blktap3 (also referred to as tapdisk3) - aims to remove the blktap kernel driver (not upstreamable) by enabling direct tapdisk-domU IO request passing (access the inter-domain IO ring directly rather than delegating that function to blkback driver). Moving to tapdisk3, also avoids the blktap2 kernel driver not included in upstream kernel ! Link : http://www.xenserver.org/overview-xenserver-open-source-virtualization/project-roadmap/2-uncategorised/116-q4-2013-project-activities.html
Felipe Franciosi on Tuesday, 07 January 2014 20:16

We are still working on blktap3 and constantly focusing on its performance. I will make sure to share some data on these results when they become available. All the results from our studies and technology development projects are taken into consideration for product planning and roadmaps.

0
We are still working on blktap3 and constantly focusing on its performance. I will make sure to share some data on these results when they become available. All the results from our studies and technology development projects are taken into consideration for product planning and roadmaps.
Guest - james on Tuesday, 17 June 2014 14:06

@Felipe - ping, for the performance results with blktap3 !

0
@Felipe - ping, for the performance results with blktap3 !
Tobias Kreidl on Wednesday, 29 January 2014 18:47

I find it curious that I have not seen anything about the possible integration of something like VAAI (vStorage API for Array Integration) or support for T10 SCSI commands into XenServer, especially given it's been supported in VMware for at least a couple of years now and that quite a number of contemporary storage arrays support this feature. The gains that can be realized are substantial.
-=Tobias

0
I find it curious that I have not seen anything about the possible integration of something like VAAI (vStorage API for Array Integration) or support for T10 SCSI commands into XenServer, especially given it's been supported in VMware for at least a couple of years now and that quite a number of contemporary storage arrays support this feature. The gains that can be realized are substantial. -=Tobias
Guest - james on Tuesday, 18 March 2014 17:36

It has been posted on the blog that blktap3 integration is almost complete with few defects remaining, can an update be expected on the XS Q1'2014 activities?
are there any plans to fully utilize qemu in place of blktap - in near future?

0
It has been posted on the blog that blktap3 integration is almost complete with few defects remaining, can an update be expected on the XS Q1'2014 activities? are there any plans to fully utilize qemu in place of blktap - in near future?
Guest - TK on Sunday, 29 June 2014 02:57

I would be most interested to see new test results with XenServer Creedence release!

0
I would be most interested to see new test results with XenServer Creedence release!
Guest - tjkreidl on Sunday, 29 June 2014 02:59

@James: XenServer Creedence uses blktap3, so performance is definitely better.

0
@James: XenServer Creedence uses blktap3, so performance is definitely better.

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Technical support for XenServer is available from Citrix.